The Health Thread | Nepal Health News & IPAC

Performance of large language models as an information resource on functional hypothalamic amenorrhea for patients and healthcare professionals.

Researchers

Nancy Safwan, Jana Karam, Sarah L Berga, Maria D Hurtado Andrade, Kristin Cole, Stacey J Winham, Stephanie S Faubion, Chrisandra L Shufelt

Abstract

To assess and compare the accuracy, readability, and overall performance of large language models (LLMs) in answering questions about functional hypothalamic amenorrhea (FHA) for patients and healthcare professionals. A total of 11 patient-level and 15 clinician-level FHA-related questions were entered separately into four LLMs: ChatGPT 3.5 (free version), ChatGPT 4.0 (updated, paid subscription), Gemini, and OpenEvidence. OpenEvidence was used only for clinician-based questions. Responses were evaluated by three expert reviewers blinded to the LLM used who rated them as accurate and complete, accurate but incomplete, or inaccurate. A fourth reviewer resolved discordant scores. Readability for patient-level questions was assessed using the Flesch Reading Ease Score (FRES) and word count. Lower FRES scores indicate more difficult reading. Accuracy and completeness were compared using odds ratios (95% CI) with ChatGPT 3.5 as the reference model, and differences in readability were analyzed using Friedman's test. LLM performance varied across question types. For patient-level questions, ChatGPT 4.0 achieved the highest accuracy (9 of 11; 82%), followed by ChatGPT 3.5 and Gemini (each 8 of 11; 73%), with no statistically significant differences. Among clinician-level questions, OpenEvidence demonstrated perfect accuracy (15 of 15; 100%), compared with 93% for and 80% for ChatGPT 4.0 and Gemini. Completeness followed similar patterns, with OpenEvidence providing the most complete clinician responses (93%) and ChatGPT 4.0 the most complete patient-level responses (89%). Readability differed significantly among models (<i>p</i> = 0.012), with Gemini producing the most readable patient-level content (median FRES 43.5 [IQR 36.8-53.4]) compared with ChatGPT 3.5 (30.6 [16.8-48.4]) and ChatGPT 4.0 (28.8 [22.1-37.6]). Word counts did not differ significantly (<i>p</i> = 0.39). LLMs demonstrated good overall performance in answering FHA-related questions but often provided incorrect or incomplete information. Fine tuning field-specific data, engineered prompts, and obtaining human-in-the-loop feedback may help improve the accuracy of these models.