Epigenetic Age Prediction

RnBeads' age prediction module employs machine learning to infer age information from DNA methylation data. It can predict age from methylation data obtained by virtually any protocol that provides single base pair resolution methylation calls, extending previously developed methods for epigenetic age inference. In addition to Illumina's methylation array series, this is the first method to support bisulfite-sequencing assays. Here we outline the methods employed and describe the data sets that were used to infer predictors for different technologies in detail. A detailed tutorial on how to use epigenetic age prediction in RnBeads can be found here.


Model description and patametrization

The age prediction module employs elastic net regression using the glmnet R-package. A transformed version of the annotated age as suggested by Horvath et al. is used as response variable to reflect different pacing in epigenetic changes in younger and older ages, while DNA methylation values at single CpGs are used as features.

Pre-defined Predictors

Since prediction accuracy increases with the number of analyzed samples, we collected methylation data from multiple, publicly available studies to provide users with a basis for annotating their own samples (see below). From these data sets, we compiled training data sets and inferred age predictors using elastic net regression as described above. The resulting pre-defined predictors are available as CSV files in the RnBeads package and can be directly applied to user-provided data sets.

Evaluation of pre-defined age predictors

Predictors were trained based on data from the Illumina Infinium HumanMethylation27 BeadChip, its successor the Infinium 450 BeadChip and Reduced-Representation Bisulfite Sequencing (RRBS). Note that we did not include the Illumina MethylationEPIC BeadChip and Whole-Genome Bisulfite-Sequencing data due to the lack of large representative data sets (hundreds of age-annotated training samples are required). However, first applications of the pre-defined predictors to EPIC data showed performance consistent with what we describe here.

Infinium 27k BeadChip

The Infinium 27k BeadChip training data set comprised 2,286 sample from 6 different studies, collected from the gene expression omnibus (GEO). A detailed description of the studied can be found in the table below. Mainly blood samples, but also colon samples are contained in the training data set, while the test data set only consists of blood samples. Applying the above workflow, we detected 366 CpG sites to be predictive of the donor’s age. Evaluation of the pre-defined predictor was based on 10-fold cross validation on the training data set and on the independent test data set. While cross validation resulted in a correlation of 0.958 and a median absolute difference (MADiff) of 3.23 years between annotated and predicted ages, performance on the test data set was 86 % correlation and MADiff 5.13.

Infinium 450k BeadChip

Similarly, the training data set consisted of 1,866 samples from 20 studies, either from GEO or TCGA, and contained ages between 0 and 103 years. In addition to the majority of blood samples, kidney and brain tissues were included. An independent test data set comprising 1,007 mainly blood samples was used to evaluate performance of the pre-defined predictor, which selected 761 age-associated CpGs. In line with the findings for its predecessor chip, cross validation proved high predictive performance (correlation 0.95, MADiff 2.8 years) underpinned by the independent test data set (0.89 and 2.94 years).


This data set contains blood samples from 232 individuals of German origin covering an age range of 20-103 years. No independent test data set was available due to the still limited number of bisulfite studies addressing methylation changes associated with healthy human aging. Nevertheless, the feature selection method employed by RnBeads determines 156 sites associated with the aging process and cross validation showed reasonable predictive performance (correlation 0.93, MADiff 6.37 years)

Data set description

The table below describes the data sets used in RnBeads' age prediction module both for training and testing purposes.

Data Type Use Source Accession Size Tissue Donor characteristic
27k Training GEO GSE56606 90 CD14-cells diabetic
GSE41037 720 blood, brain healthy
GSE48988 178 colon aspirin intake
GSE58119 282 blood serum BRCA-1 mutation
GSE19711 540 whole blood healthy
GSE27097 398 cord blood pediatric
Training Data Set 2,289
Test GEO GSE44763 46 brain obese
Test Data Set 46
450k Training TCGA BRCA 96 breast breast carcinoma
ESCA 16 esophagus esophageal carcinoma
GBM 1 brain glioblastoma multiforme
HNSC 50 head and neck head and neck squamous cell carcinoma
KIRC 160 kidney kidney renal clear cell carcinoma
LIHC 48 liver liver hepatocellular carcinoma
LUAD 32 lung lung adenocarcinoma
LUSC 42 lung lung squamous cell carcinoma
PAAD 8 pancreas pancreatic adenocarcinoma
KIRP 45 kidney kidney renal papillary cell carcinoma
GEO GSE51057 177 blood all female
GSE41169 95 brain aging
GSE48472 45 multiple tissues (e.g. blood, hair, omentum)
GSE50498 48 skeletal muscle healthy
GSE41826 145 brain
GSE30870 40 blood newborns
GSE40360 46 brain multiple-sclerosis
GSE40279 656 whole blood
GSE48325 85 liver non-alcoholic fatty liver
Training Data Set 1,866
Test GEO GSE34639 48 CD4+-cells newborns
GSE53841 24 blood HIV-1 infection
GSE59457 130 brain HIV-1 infection
GSE67751 92 blood HIV-1 infection
GSE59505 24 multiple tissues (e.g semen,saliva)
GSE42861 689 blood rheumatoid arthritis
Test Data Set 1,007
RRBS Training Kiel NA 232 blood