Publication time valid prediction of citation risk outcomes in a bounded clinical specialty literature corpus.
Researchers
Sunny Chung, Charles Kahi, Siddharth Singh
Abstract
Citation-prediction studies often estimate citation counts using information unavailable at publication. We evaluated whether citation-risk outcomes can be predicted using only publication-time information: metadata, references, author history, and text available on or before publication. We assembled 9,424 original-research articles published from 2017 to 2022 across seven clinical gastroenterology journals using OpenAlex and PubMed. The primary reference-observed cohort included 8,409 articles with a parsed reference list. The primary outcome was ≤ 3 citations within 2 years; secondary outcomes were 0 citations within 3 years, ≤ 3 citations within 3 years, and > 20 citations within 2 years. Models compared a nonsemantic citation/reference/context baseline, author-history variables, whole-document title/abstract embeddings, role-segmented source-text embeddings, and reference-context distributional features. Evaluation used two held-out publication-year folds with PR-AUC, or area under the precision-recall curve, F1, and precision among the top-ranked 10% of predictions. For the primary outcome, the nonsemantic baseline achieved PR-AUC 0.818, F1 0.722, and precision@10% 0.935. Adding whole-document embeddings improved performance to 0.828, 0.735, and 0.962, respectively. Structure-aware features did not improve the primary outcome but provided endpoint-specific gains for secondary outcomes. Author-history features showed standalone signal but did not improve the baseline. Pooled performance exceeded journal-local performance, indicating that citation-risk signal operated at the corpus level. These findings support publication-time-valid citation-risk modeling as a reproducible framework for studying evidence visibility within bounded literatures and motivate replication across other journal sets, specialties, and publication eras.Source: PubMed (PMID: 42396484)View Original on PubMed