Large language models for extracting histopathologic diagnoses of colorectal cancer and dysplasia from electronic health records

PMID: 40313292
Source: medRxiv
Publication date: 2025-05-02
Year: 2025

Abstract

BACKGROUND: Accurate data resources are essential for impactful medical research, but available structured datasets are often incomplete or inaccurate. Recent advances in open-weight large language models (LLMs) enable more accurate data extraction from unstructured text in electronic health records (EHRs) but have not yet been thoroughly validated for challenging diagnoses such as inflammatory bowel disease (IBD)-related neoplasia. OBJECTIVE: Create a validated approach using LLMs for identifying histopathologic diagnoses in pathology reports from the nationwide Veterans Health Administration database, including patients with genotype data within the Million Veteran Program (MVP) biobank. DESIGN: Our approach utilizes simple 'yes/no' question prompts for following phenotypes of interest: any colorectal dysplasia, high-grade dysplasia and/or colorectal adenocarcinoma (HGD/CRC), and invasive CRC. We validated the method on diagnostic tasks by applying prompts to reports from patients with IBD (and validated separately in non-IBD) and calculated F-1 scores as a balanced accuracy measure. RESULTS: In patients with IBD in MVP, we achieved F1-scores of 96.1% (95% CI 92.5-99.4%) for identifying dysplasia, 93.7% (88.2-98.4%) for identifying HGD/CRC, and 98% (96.3-99.4%) for identifying CRC. In patients without IBD in MVP, we achieved F1-scores of 99.2% (98.2-100%) for identifying any colorectal dysplasia, 96.5% (93.0-99.2%) for identifying HGD/CRC, and 95% (92.8-97.2%) for identifying CRC using LLM Gemma-2. CONCLUSION: LLMs provided excellent accuracy in extracting the diagnoses of interest from EHRs. Our validated methods generalized to unstructured pathology notes, even withstanding challenges of resource-limited computing environments. This may therefore be a promising approach for other clinical phenotypes given the minimal human-led development required.