RnBeads

RnBeads' age prediction module employs machine learning to infer age information from DNA methylation data. It can predict age from methylation data obtained by virtually any protocol that provides single base pair resolution methylation calls, extending previously developed methods for epigenetic age inference. In addition to Illumina's methylation array series, this is the first method to support bisulfite-sequencing assays. Here we outline the methods employed and describe the data sets that were used to infer predictors for different technologies in detail. A detailed tutorial on how to use epigenetic age prediction in RnBeads can be found here.

Workflow

Model description and patametrization

The age prediction module employs elastic net regression using the glmnet R-package. A transformed version of the annotated age as suggested by Horvath et al. is used as response variable to reflect different pacing in epigenetic changes in younger and older ages, while DNA methylation values at single CpGs are used as features.

Pre-defined Predictors

Since prediction accuracy increases with the number of analyzed samples, we collected methylation data from multiple, publicly available studies to provide users with a basis for annotating their own samples (see below). From these data sets, we compiled training data sets and inferred age predictors using elastic net regression as described above. The resulting pre-defined predictors are available as CSV files in the RnBeads package and can be directly applied to user-provided data sets.

Evaluation of pre-defined age predictors

Predictors were trained based on data from the Illumina Infinium HumanMethylation27 BeadChip, its successor the Infinium 450 BeadChip and Reduced-Representation Bisulfite Sequencing (RRBS). Note that we did not include the Illumina MethylationEPIC BeadChip and Whole-Genome Bisulfite-Sequencing data due to the lack of large representative data sets (hundreds of age-annotated training samples are required). However, first applications of the pre-defined predictors to EPIC data showed performance consistent with what we describe here.

Infinium 27k BeadChip

The Infinium 27k BeadChip training data set comprised 2,286 sample from 6 different studies, collected from the gene expression omnibus (GEO). A detailed description of the studied can be found in the table below. Mainly blood samples, but also colon samples are contained in the training data set, while the test data set only consists of blood samples. Applying the above workflow, we detected 366 CpG sites to be predictive of the donor’s age. Evaluation of the pre-defined predictor was based on 10-fold cross validation on the training data set and on the independent test data set. While cross validation resulted in a correlation of 0.958 and a median absolute difference (MADiff) of 3.23 years between annotated and predicted ages, performance on the test data set was 86 % correlation and MADiff 5.13.

Infinium 450k BeadChip

Similarly, the training data set consisted of 1,866 samples from 20 studies, either from GEO or TCGA, and contained ages between 0 and 103 years. In addition to the majority of blood samples, kidney and brain tissues were included. An independent test data set comprising 1,007 mainly blood samples was used to evaluate performance of the pre-defined predictor, which selected 761 age-associated CpGs. In line with the findings for its predecessor chip, cross validation proved high predictive performance (correlation 0.95, MADiff 2.8 years) underpinned by the independent test data set (0.89 and 2.94 years).

RRBS

This data set contains blood samples from 232 individuals of German origin covering an age range of 20-103 years. No independent test data set was available due to the still limited number of bisulfite studies addressing methylation changes associated with healthy human aging. Nevertheless, the feature selection method employed by RnBeads determines 156 sites associated with the aging process and cross validation showed reasonable predictive performance (correlation 0.93, MADiff 6.37 years)

Data set description

The table below describes the data sets used in RnBeads' age prediction module both for training and testing purposes.

Data Type	Use	Source	Accession	Size	Tissue	Donor characteristic
27k	Training	GEO	GSE56606	90	CD14-cells	diabetic
			GSE41037	720	blood, brain	healthy
			GSE48988	178	colon	aspirin intake
			GSE58119	282	blood serum	BRCA-1 mutation
			GSE19711	540	whole blood	healthy
			GSE27097	398	cord blood	pediatric
	Training Data Set			2,289
	Test	GEO	GSE44763	46	brain	obese
	Test Data Set			46

450k	Training	TCGA	BRCA	96	breast	breast carcinoma
			ESCA	16	esophagus	esophageal carcinoma
			GBM	1	brain	glioblastoma multiforme
			HNSC	50	head and neck	head and neck squamous cell carcinoma
			KIRC	160	kidney	kidney renal clear cell carcinoma
			LIHC	48	liver	liver hepatocellular carcinoma
			LUAD	32	lung	lung adenocarcinoma
			LUSC	42	lung	lung squamous cell carcinoma
			PAAD	8	pancreas	pancreatic adenocarcinoma
			KIRP	45	kidney	kidney renal papillary cell carcinoma
		GEO	GSE51057	177	blood	all female
			GSE41169	95	brain	aging
			GSE48472	45	multiple tissues (e.g. blood, hair, omentum)
			GSE50498	48	skeletal muscle	healthy
			GSE41826	145	brain
			GSE30870	40	blood	newborns
			GSE40360	46	brain	multiple-sclerosis
			GSE40279	656	whole blood
			GSE48325	85	liver	non-alcoholic fatty liver
	Training Data Set			1,866
	Test	GEO	GSE34639	48	CD4+-cells	newborns
			GSE53841	24	blood	HIV-1 infection
			GSE59457	130	brain	HIV-1 infection
			GSE67751	92	blood	HIV-1 infection
			GSE59505	24	multiple tissues (e.g semen,saliva)
			GSE42861	689	blood	rheumatoid arthritis
	Test Data Set			1,007

RRBS	Training	Kiel	NA	232	blood

Epigenetic Age Prediction