The Health Thread | Nepal Health News & IPAC

A large language model pipeline for automated citation quality scoring across engineering journal quartiles with expert validation.

Researchers

Murat Isik, Sultan Guleryuzlu

Abstract

This study proposes and evaluates a two-stage large language model (LLM)-based pipeline for automated citation quality scoring in academic manuscripts. The pipeline operates as follows: in Stage 1, citation sentences are extracted from full-text PDFs and matched to their referenced articles using the Gemini 2.5 Flash model; in Stage 2, each citation-reference pair is scored for semantic relevance on a continuous 0-10 scale by a second LLM inference call operating under a structured five-tier rubric and a skeptical reviewer prompt persona. The pipeline was applied to a corpus of 121 Web of Science (WOS)-indexed engineering articles drawn from journals spanning all four Journal Citation Reports quartile strata (Q1-Q4), yielding 5,615 scored citation-reference pairs. Descriptive analysis revealed an overall mean relevance score of 7.76 (SD = 2.36), with 74.7% of citations rated as Strong or Excellent. A Kruskal-Wallis test confirmed statistically significant score differences across quartile groups (H(3) = 157.10, p < 0.001), though the overall effect size was small (ε² = 0.028). Post-hoc Mann-Whitney U tests with Bonferroni correction identified Q2 articles as recording the highest mean scores (M = 8.04), significantly outperforming Q1 (M = 7.52), Q3 (M = 7.73), and Q4 (M = 7.74). The Q3 versus Q4 comparison was the sole non-significant pairing (p = 0.756), indicating these strata are statistically indistinguishable in citation quality. Spearman correlation yielded a weak negative rank correlation (ρ = -0.105, p < 0.001), with Q1 recording the highest proportion of Irrelevant citations (10.7%). These findings challenge the assumption that citation quality improves monotonically with journal prestige. The lower mean score of Q1 coexists with one of the highest proportions of highly relevant citations, indicating a bimodal rather than uniformly weaker profile, and a systematic annotation showed that context-dependent pointer citations are disproportionately concentrated in the Q1 Irrelevant set. We therefore attribute Q1's pattern to the broader interdisciplinary scope of top-tier articles together with a measurement effect, rather than to any single cause such as AI-assisted writing. The proposed pipeline offers a scalable, content-aware complement to existing academic integrity tools, with practical applications in editorial pre-screening and automated peer review support. An inter-rater reliability study on a stratified subsample of 150 citation-reference pairs showed strong ordinal agreement between the LLM and expert majority vote (Spearman ρ = 0.643, p < 0.001), with exact-category agreement of 48.0% rising to 77.3% under ± 1 adjacent-category tolerance, and highest agreement at the Irrelevant (80.0%) and Excellent (71.0%) poles.