RnBeads: Tcells

Option	Value
replicate.id.column	cellType
exploratory.columns	cellType, technology, individual
exploratory.top.dimensions	all
exploratory.principal.components	8
exploratory.correlation.pvalue.threshold	0.01
exploratory.correlation.permutations	10000
exploratory.correlation.qc	yes
exploratory.beta.distribution	yes
exploratory.intersample	no
exploratory.deviation.plots	no
exploratory.clustering	all
exploratory.clustering.top.sites	1000
exploratory.clustering.heatmaps.pdf	yes
distribution.subsample	1000000
exploratory.gene.symbols	CCR5, CSF2, GAS5, ZFAS1, RUNX1, RUNX3, ANPEP, CD22, CD28, CD34, CD36, ITGAM, ITGAV, PECAM1, THPO, FAS, IFNGR1, IRF6, JMJD6
exploratory.custom.loci.bed	default

Trait	Number of groups
cellType	4
technology	2
individual	3

In addition to CpG sites, there are 12 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.

Annotation	Description	Regions in the Dataset
tiling1kb	n.a.	2776884
cpgislands	CpG island track of the UCSC Genome browser	27177
genes	Ensembl genes, version Ensembl Genes 75	53059
promoters	Promoter regions of Ensembl genes, version Ensembl Genes 75	56533
51Hf0xBlxxCt.cssv1.20151105.enhmrg	Annotation extracted from file: 51_Hf0X_BlXX_Ct.CSSv1.20151105.EnhMrg.bed	213487
ensembleRegBuildBPall	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- all	555717
ensembleRegBuildBPctcf	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- ctcf	75501
ensembleRegBuildBPdistal	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- distal	154159
ensembleRegBuildBPdnase	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- dnase	41817
ensembleRegBuildBPproximal	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- proximal	143864
ensembleRegBuildBPtfbs	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- tfbs	114796
ensembleRegBuildBPtss	Ensembl Regulatory build from BLUEPRINT data release 20150128 -- tss	25580

Region length distributions

The plots below show region size distributions for the region types above.

Region type

Figure 1

Distribution of region lengths

Number of sites per region

The plots below show the distributions of the number of sites per region type.

Region type

Figure 2

Distribution of the number of sites per region

Region site distributions

The plots below show distributions of sites across the different region types.

Region type

Figure 3

Distribution of sites across regions. relative coordinates of 0 and 1 corresponds to the start and end coordinates of that region respectively. Coordinates smaller than 0 and greater than 1 denote flanking regions normalized by region length.

Analysis of Sample Replicates

Sample replicates were compared. This section shows pairwise scatterplots for each sample replicate group on both site and region level.

replicate
site/region

Figure 4

Scatterplot for replicate methylation comparison. The transparency corresponds to point density. The 1% of the points in the sparsest populated plot regions are drawn explicitly.

The following table contains pearson correlation coefficients:

	sites	tiling1kb	cpgislands	genes	promoters	51Hf0xBlxxCt.cssv1.20151105.enhmrg	ensembleRegBuildBPall	ensembleRegBuildBPctcf	ensembleRegBuildBPdistal	ensembleRegBuildBPdnase	ensembleRegBuildBPproximal	ensembleRegBuildBPtfbs	ensembleRegBuildBPtss
X51_Hf03_BlCM_Ct_NOMe vs. X51_Hf03_BlCM_Ct_WGBS (TCM)	0.9199	0.8402	0.9941	0.9604	0.9834	0.9232	0.9003	0.9262	0.8453	0.8028	0.859	0.8914	0.9891
X51_Hf03_BlCM_Ct_NOMe vs. X51_Hf04_BlCM_Ct_NOMe (TCM)	0.9026	0.8077	0.9937	0.9484	0.9769	0.9044	0.8749	0.9096	0.8054	0.7656	0.8158	0.8642	0.9871
X51_Hf03_BlCM_Ct_NOMe vs. X51_Hf04_BlCM_Ct_WGBS (TCM)	0.9199	0.8367	0.993	0.9575	0.9824	0.9207	0.8982	0.9245	0.8445	0.8008	0.8579	0.8896	0.987
X51_Hf03_BlCM_Ct_WGBS vs. X51_Hf04_BlCM_Ct_NOMe (TCM)	0.9141	0.8062	0.9958	0.9477	0.9766	0.9032	0.8747	0.901	0.8066	0.7724	0.8131	0.8698	0.9844
X51_Hf03_BlCM_Ct_WGBS vs. X51_Hf04_BlCM_Ct_WGBS (TCM)	0.9385	0.9359	0.9973	0.9761	0.9945	0.9583	0.9498	0.9695	0.924	0.8878	0.9424	0.9375	0.9962
X51_Hf04_BlCM_Ct_NOMe vs. X51_Hf04_BlCM_Ct_WGBS (TCM)	0.9182	0.809	0.9968	0.9483	0.9775	0.9051	0.8777	0.9039	0.8114	0.7794	0.8179	0.8738	0.9854
X51_Hf03_BlEM_Ct_NOMe vs. X51_Hf03_BlEM_Ct_WGBS (TEM)	0.9054	0.8443	0.9917	0.9531	0.98	0.9166	0.8933	0.9163	0.8555	0.8217	0.8722	0.8851	0.9883
X51_Hf03_BlEM_Ct_NOMe vs. X51_Hf04_BlEM_Ct_NOMe (TEM)	0.8947	0.8329	0.9947	0.9469	0.9778	0.9092	0.8821	0.9132	0.8413	0.8005	0.859	0.8713	0.9903
X51_Hf03_BlEM_Ct_NOMe vs. X51_Hf04_BlEM_Ct_WGBS (TEM)	0.9072	0.8416	0.9917	0.9522	0.9793	0.9152	0.8921	0.9151	0.8551	0.8188	0.8696	0.8845	0.987
X51_Hf03_BlEM_Ct_WGBS vs. X51_Hf04_BlEM_Ct_NOMe (TEM)	0.9081	0.8411	0.9937	0.9499	0.9784	0.9135	0.8895	0.9124	0.8516	0.8211	0.8665	0.8842	0.9874
X51_Hf03_BlEM_Ct_WGBS vs. X51_Hf04_BlEM_Ct_WGBS (TEM)	0.9318	0.94	0.9967	0.9729	0.9932	0.9547	0.947	0.9649	0.9281	0.9034	0.9462	0.935	0.9949
X51_Hf04_BlEM_Ct_NOMe vs. X51_Hf04_BlEM_Ct_WGBS (TEM)	0.9146	0.8449	0.9957	0.9535	0.98	0.9179	0.8943	0.9162	0.8592	0.8242	0.8726	0.8891	0.9901
X51_Hf03_BlTN_Ct_NOMe vs. X51_Hf03_BlTN_Ct_WGBS (TN)	0.9474	0.9103	0.996	0.9813	0.9937	0.9574	0.9448	0.9556	0.8952	0.8129	0.9127	0.9328	0.9947
X51_Hf03_BlTN_Ct_NOMe vs. X51_Hf04_BlTN_Ct_NOMe (TN)	0.9401	0.8873	0.996	0.978	0.9921	0.9494	0.9327	0.9501	0.8723	0.7748	0.8893	0.9189	0.9939
X51_Hf03_BlTN_Ct_NOMe vs. X51_Hf04_BlTN_Ct_WGBS (TN)	0.9452	0.9061	0.9939	0.9794	0.9925	0.9552	0.9426	0.9543	0.8911	0.8066	0.9094	0.9309	0.9921
X51_Hf03_BlTN_Ct_WGBS vs. X51_Hf04_BlTN_Ct_NOMe (TN)	0.9428	0.8893	0.9957	0.9763	0.9917	0.9459	0.9311	0.9434	0.8707	0.7866	0.8858	0.9191	0.9919
X51_Hf03_BlTN_Ct_WGBS vs. X51_Hf04_BlTN_Ct_WGBS (TN)	0.9561	0.9497	0.9968	0.9855	0.9967	0.9713	0.9651	0.9779	0.9325	0.8596	0.9487	0.9517	0.9961
X51_Hf04_BlTN_Ct_NOMe vs. X51_Hf04_BlTN_Ct_WGBS (TN)	0.9457	0.893	0.997	0.9785	0.9924	0.9492	0.9348	0.9466	0.8774	0.7981	0.8915	0.9242	0.9929

Low-dimensional Representation

Dimension reduction is used to visually inspect the dataset for a strong signal in the methylation values that is related to samples' clinical or batch processing annotation. RnBeads implements two methods for dimension reduction - principal component analysis (PCA) and multidimensional scaling (MDS).

One or more of the methylation matrices was augmented before applying the dimension reduction techniques because it contains missing values. The column Missing lists the number of dimensions ignored due to missing values. In the case of MDS, dimensions are ignored only if they contain missing values for all samples. In contrast, sites or regions with missing values in any sample are ignored prior to PCA.

Sites/regions	Technique	Dimensions	Missing	Selected
sites	MDS	26117504	0	26117504
sites	PCA	26117504	13718327	12399177
tiling1kb	MDS	2776884	0	2776884
tiling1kb	PCA	2776884	480878	2296006
cpgislands	MDS	27177	0	27177
cpgislands	PCA	27177	179	26998
genes	MDS	53059	0	53059
genes	PCA	53059	3985	49074
promoters	MDS	56533	0	56533
promoters	PCA	56533	1204	55329
51Hf0xBlxxCt.cssv1.20151105.enhmrg	MDS	213487	0	213487
51Hf0xBlxxCt.cssv1.20151105.enhmrg	PCA	213487	37242	176245
ensembleRegBuildBPall	MDS	555717	0	555717
ensembleRegBuildBPall	PCA	555717	103516	452201
ensembleRegBuildBPctcf	MDS	75501	0	75501
ensembleRegBuildBPctcf	PCA	75501	10173	65328
ensembleRegBuildBPdistal	MDS	154159	0	154159
ensembleRegBuildBPdistal	PCA	154159	30191	123968
ensembleRegBuildBPdnase	MDS	41817	0	41817
ensembleRegBuildBPdnase	PCA	41817	16478	25339
ensembleRegBuildBPproximal	MDS	143864	0	143864
ensembleRegBuildBPproximal	PCA	143864	13826	130038
ensembleRegBuildBPtfbs	MDS	114796	0	114796
ensembleRegBuildBPtfbs	PCA	114796	32783	82013
ensembleRegBuildBPtss	MDS	25580	0	25580
ensembleRegBuildBPtss	PCA	25580	65	25515

Multidimensional Scaling

The scatter plot below visualizes the samples transformed into a two-dimensional space using MDS.

Location type
Distance
Sample representation
Sample color

Figure 5

Scatter plot showing samples after performing Kruskal's non-metric mutidimensional scaling.

Principal Component Analysis

Similarly, the figure below shows the values of selected principal components in a scatter plot.

Location type
Principal components
Sample representation
Sample color

Figure 6

Scatter plot showing the samples' coordinates on principal components.

The figure below shows the cumulative distribution functions of variance explained by the principal components.

Location type

Figure 7

Cumulative distribution function of percentange of variance explained.

The table below gives for each location type a number of principal components that explain at least 95 percent of the total variance. The full tables of variances explained by all components are available in comma-separated values files accompanying this report.

Location Type	Number of Components	Full Table File
sites	10	csv
tiling1kb	7	csv
cpgislands	9	csv
genes	7	csv
promoters	6	csv
51Hf0xBlxxCt.cssv1.20151105.enhmrg	9	csv
ensembleRegBuildBPall	8	csv
ensembleRegBuildBPctcf	9	csv
ensembleRegBuildBPdistal	8	csv
ensembleRegBuildBPdnase	8	csv
ensembleRegBuildBPproximal	7	csv
ensembleRegBuildBPtfbs	9	csv
ensembleRegBuildBPtss	7	csv

Batch Effects

In this section, different properties of the dataset are tested for significant associations. The properties can include sample coordinates in the principal component space, phenotype traits and intensities of control probes. The tests used to calculate a p-value given two properties depend on the essence of the data:

If both properties contain categorical data (e.g. tissue type and sample processing date), the test of choice is a two-sided Fisher's exact test.
If both properties contain numerical data (e.g. coordinates in the first principal component and age of individual), the correlation coefficient between the traits is computed. A p-value is estimated using permutation tests with 10000 permutations.
If property A is categorical and property B contains numeric data, p-value for association is calculated by comparing the values of B for the different categories in A. The test of choice is a two-sided Wilcoxon rank sum test (when A defines two categories) or a Kruskal-Wallis one-way analysis of variance (when A separates the samples into three or more categories).

Note that the p-values presented in this report are not corrected for multiple testing.

Associations between Principal Components and Traits

The computed sample coordinates in the principal component space were tested for association with the specified traits. Below is a list of the traits and the tests performed.

Trait	Test
cellType	Kruskal-Wallis
technology	Wilcoxon
individual	Kruskal-Wallis

The heatmap below summarizes the results of permutation tests performed for associations. Significant p-values (values less than 0.01) are displayed in pink background.

Region type

Figure 8

Heatmap presenting a table of p-values. Significant p-values (less than 0.01) are printed in pink boxes. Non-significant values are represented by blue boxes. Bright grey cells, if present, denote missing values.

The full tables of p-values for each location type are available in CSV (comma-separated value) files below.

Location Type	File Name
sites	csv
tiling1kb	csv
cpgislands	csv
genes	csv
promoters	csv
51Hf0xBlxxCt.cssv1.20151105.enhmrg	csv
ensembleRegBuildBPall	csv
ensembleRegBuildBPctcf	csv
ensembleRegBuildBPdistal	csv
ensembleRegBuildBPdnase	csv
ensembleRegBuildBPproximal	csv
ensembleRegBuildBPtfbs	csv
ensembleRegBuildBPtss	csv

Associations between Traits

This section summarizes the associations between pairs of traits.

The figure below visualizes the tests that were performed on trait pairs based on the description provided above. In addition, the calculated p-values for associations between traits are shown. Significant p-values (values less than 0.01) are displayed in pink background. The full table of p-values is available in a dedicated file that accompanies this report.

Heatmap of

Figure 9

(1) Table of performed tests on pairs of traits. Test names (Correlation + permutation test, Fisher's exact test, Wilcoxon rank sum test and/or Kruskal-Wallis one-way analysis of variance) are color-coded according to the legend given above.
(2) Table of resulting p-values from the performed tests on pairs of traits. Significant p-values (less than 0.01) are printed in pink boxes Non-significant values are represented by blue boxes. White cells, if present, denote missing values.

Methylation Value Distributions

Clustering

The figure below shows clustering of samples using several algorithms and distance metrics.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on

Figure 12

Hierarchical clustering of samples based on all methylation values. The heatmap displays methylation percentiles per sample. The legend for sample coloring can be found in the figure below.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on
Site/region color based on
Visualize

Figure 13

Hierarchical clustering of samples based on all methylation values. The heatmap displays only selected sites/regions with the highest variance across all samples. The legend for locus and sample coloring can be found in the figure below.

Site/region level
Sample color based on
Site/region color based on

Figure 14

Probe and sample colors used in the heatmaps in the previous figures.

Identified Clusters

Using the average silhouette value as a measure of cluster assignment [1], it is possible to infer the number of clusters produced by each of the studied methods. The figure below shows the corresponding mean silhouette value for every observed separation into clusters.

Site/region level
Dissimilarity metric

Figure 15

Line plot visualizing mean silhouette values of the clustering algorithm outcomes for each applicable value of K (number of clusters).

The table below summarizes the number of clusters identified by the algorithms.

Site/region level

Metric	Algorithm	Clusters
correlation-based	hierarchical (average linkage)	3
correlation-based	hierarchical (complete linkage)	3
correlation-based	hierarchical (median linkage)	3
Manhattan distance	hierarchical (average linkage)	2
Manhattan distance	hierarchical (complete linkage)	2
Manhattan distance	hierarchical (median linkage)	3
Euclidean distance	hierarchical (average linkage)	2
Euclidean distance	hierarchical (complete linkage)	2
Euclidean distance	hierarchical (median linkage)	3

Clusters and Traits

The figure below shows associations between clusterings and the examined traits. Associations are quantified using the adjusted Rand index [2]. Rand indices near 1 indicate high agreement while values close to -1 indicate seperation. The full table of all computed indices is stored in the following comma separated files:

Site/region level
Dissimilarity metric

Figure 16

Heatmap visualizing Rand indices computed between sample traits (rows) and clustering algorithm outcomes (columns).

Exploratory Analysis

Parameter Overview

Sample Groups

Region Annotations

Region length distributions

Number of sites per region

Region site distributions

Analysis of Sample Replicates

Low-dimensional Representation

Multidimensional Scaling

Principal Component Analysis

Batch Effects

Associations between Principal Components and Traits

Associations between Traits

Methylation Value Distributions

Methylation Value Densities of Sample Groups

Methylation Value Densities of Site Categories

Clustering

Identified Clusters

Clusters and Traits

Regional Methylation Profiles

Locus Profiles (Genes)

References