Abstract
Brain age prediction from T1-weighted MRI and its associated brain age gap (BAG) has emerged as a promising neuroimaging biomarker for assessing deviations from normative aging. However, the robustness, bias, and interpretability of existing models across external datasets remain poorly understood, limiting clinical translation. In this study, we evaluated four publicly available brain age models (ENIGMA, DeepBrainNet, Pyment, and BrainAgeNeXt) across four independent MRI datasets (ADNI, UNSAM Long COVID, and two OpenNeuro cohorts), comprising 1,634 subjects with diverse demographic and clinical profiles. Models were tested using their original preprocessing pipelines, and performance was assessed using mean absolute error (MAE), mean error (ME), and BAG variability metrics, with additional analyses of biases related to age, dataset, ethnicity, and education. Interpretability was evaluated using Layer-wise Relevance Propagation, and anatomical correlates were explored using BrainChart-derived centile scores. Group-level comparisons were performed between cognitively normal (CN) individuals and patients with Mild Cognitive Impairment (MCI), Alzheimer’s disease (AD), or Long COVID (LC). Models based on 3D convolutional neural networks (Pyment and BrainAgeNeXt) outperformed the DeepBrainNet 2D CNN and the ENIGMA ridge regression model in both accuracy (MAE: 3.9–3.7 vs. 6.2–12.4 years respectively) and stability (ASTD: 3.2–2.9 vs. 4.6–8.3 years). Dataset-specific BAG differences were largely explained by age distributions, whereas ethnicity showed a statistically significant but small effect on BAG in some models. Relevance maps highlighted the lateral ventricles as the most consistently relevant anatomical region, with additional cerebellar contributions emerging in older adults for BrainAgeNeXt. Group-level analyses confirmed elevated BAG in MCI and AD patients compared to CN, while no significant differences were observed in Long COVID participants. These findings suggest that, while BAG is a promising biomarker for group-level analyses, current models are required to address age and demographic biases to enable individual-level clinical application.