Development and Validation of a Dynamic-Template-Constrained Large Language Model for Generating Fully-Structured Radiology Reports

Md Zabirul Islam, Chuang Niu, Md Sayed Tanveer,, Parisa Kaviani, Qing Lyu, Mannudeep K. Kalra, Christopher T. Whitlow, Ge Wang.
Rensselaer Polytechnic Institute, Troy, NY 12180, United States

Overview

This work presents a dynamic template-constrained large language model (LLM) framework for converting free-text lung cancer screening (LCS) radiology reports into fully-structured reports (FSR) with zero formatting errors and zero hallucinations.

Unlike prompt-based approaches, this method constrains every generated token using a predefined clinical template with standardized candidate values. The system was validated across two institutions using 7,442 LDCT reports, achieving state-of-the-art performance.

Clinical Motivation

Radiology reporting traditionally uses loosely-structured or free-text formats (LSR), which limit large-scale statistical analysis and reliable retrieval. Fully-structured reporting (FSR) enables:

  • Standardized discrete feature extraction
  • Automated statistical mining
  • Nodule-level semantic retrieval
  • Cross-institutional consistency
  • Downstream AI model training

However, conventional LLM prompting fails in clinical settings due to:

  • JSON formatting errors
  • Content hallucinations
  • Privacy risks (cloud-based proprietary models)
  • Long inference time and high token cost

📐 Structured Template Design

Two radiologists created a standardized lung nodule template containing 28 features:

  • 24 nodule-level features (e.g., lobe, segment, attenuation, margin, shape, size)
  • 3 report-level management features
  • 1 auxiliary feature (number of nodules)

Each feature has a predefined candidate set (e.g., attenuation ∈ {solid, part-solid, ground-glass, ...}). This predefined schema is the foundation of the constrained decoding.

Figure 1: Cross-institutional dataset construction (Institution-1 & Institution-2).

🏗 Dynamic Template-Constrained Decoding

Figure 2: The LLM receives system instruction + free-text report, while decoding is strictly constrained by a structured template.

Figure 3: Modified inference architecture integrated into vLLM. Special template tokens restrict candidate outputs during decoding.

During decoding:

  • Template format tokens are fixed → ensures valid JSON
  • Special template tokens select only predefined candidates
  • Dynamic adjustment handles variable number of nodules
  • Maximum-probability candidate is deterministically selected

This guarantees:

  • 🚫 Zero formatting errors
  • 🚫 Zero hallucinations
  • 🔒 Local deployment (no privacy leakage)

📊 Cross-Institutional Validation

Evaluation was initially performed on 500 manually labeled reports (250 per institution) to assess generalizability across clinical settings.

  • Institution-1: F1 = 97.60% (95% CI: 96.5–98.6%)
  • Institution-2: F1 = 96.92% (95% CI: 95.8–98.0%)

Performance Improvements over Baselines:

  • LLaMA-3.1-405B: +10.42%
  • GPT-4o: +17.18%
  • Consistent gains across all 27 nodule features (p < 0.01)

Figure 6: Institution-1 performance comparison.

Figure 7: Institution-2 performance comparison.

📈 Extended Large-Scale Evaluation (Unpublished Update)

Beyond the cross-institutional validation reported in the manuscript, we further evaluated updated model variants on a 5,000-report dataset derived from our newly curated structured radiology corpus.

  • LLaMA-3.1–8B (Template-Constrained): 98.05% F1 on 5k evaluation set
  • LLaMA-3.2–1B (Fine-Tuned & Constrained): 94.27% F1 on 5k evaluation set

These results demonstrate that:

  • The constrained decoding framework scales effectively to larger evaluation sets
  • An 8B model achieves near-clinical-grade performance (>98% F1)
  • Even a lightweight 1B model maintains strong structural fidelity (>94% F1)
  • Model size-performance trade-offs enable deployment flexibility

Note: These extended results are not included in the current arXiv manuscript version but reflect subsequent internal evaluation.

🔎 Large-Scale Statistical Mining

5,192 consecutive LDCT reports (3 years) were automatically converted to FSR.

Automatically derived statistics matched prior clinical literature:

  • Upper lobes had significantly more nodules (p < 0.01)
  • Right lung > left lung (p < 0.01)
  • Ground-glass nodules more common in females (p < 0.01)
  • Lung-RADS distribution consistent with prior studies

🔍 Nodule-Level Retrieval System

Figure 8: Complex semantic query: solid AND (increase OR new) → 272 nodules retrieved.

Unlike keyword matching, semantic reasoning detects growth even when "increase" is not explicitly mentioned.

⚙️ Synthetic Fine-Tuning & Efficient Model Scaling

To reduce computational cost and enable local deployment, a LLaMA-3.1–8B model was fine-tuned using a large-scale synthetic radiology dataset generated through template-driven structured sampling and free-text synthesis.

  • Training samples: 189,950
  • Testing samples: 49,965
  • Mean IoU: 0.183 → 0.840
  • Mean F1: 0.349 → 0.899
  • Inference time (5,000 reports): 9h → 2.5h

Figure 9: Structural IoU & feature-wise F1 improvement after fine-tuning.

Figures 10: Lung/Pleura feature-level F1 improvements across anatomical systems.

🚀 Key Contributions

  • Dynamic template-constrained decoding eliminating hallucinations
  • Cross-institutional validation (n=7,442)
  • Zero formatting errors
  • Open-source vLLM-Structure framework
  • Efficient fine-tuned 8B model for scalable deployment
  • Automatic statistical mining & semantic retrieval