Exploratory Analysis

Sample Groups

The specified traits were tested based on criteria for defining sample groups. The table below summarizes these traits.

Trait Number of groups
description 19
tissue 2
differentiation_level 3
blood_lineage 3

Region Annotations

In addition to CpG sites, there are 4 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.

Annotation Description Regions in the Dataset
tiling

Genome tiling regions of length 5000

165008
genes

Ensembl genes, version Ensembl Genes 67

21277
promoters

Promoter regions of Ensembl genes, version Ensembl Genes 67

19021
cpgislands

CpG island track of the UCSC Genome browser

14293

Region length distributions

The plots below show region size distributions for the region types above.

Region type

Figure 1

Open PDF Figure 1

Distribution of region lengths

Number of sites per region

The plots below show the distributions of the number of sites per region type.

Region type

Figure 2

Open PDF Figure 2

Distribution of the number of sites per region

Region site distributions

The plots below show distributions of sites across the different region types.

Region type

Figure 3

Open PDF Figure 3

Distribution of sites across regions. relative coordinates of 0 and 1 corresponds to the start and end coordinates of that region respectively. Coordinates smaller than 0 and greater than 1 denote flanking regions normalized by region length.

Analysis of Sample Replicates

Sample replicates were compared. This section shows pairwise scatterplots for each sample replicate group on both site and region level.

replicate
site/region

Figure 4

Figure 4

Scatterplot for replicate methylation comparison. The transparency corresponds to point density. The 1% of the points in the sparsest populated plot regions are drawn explicitly.

The following table contains pearson correlation coefficients:

sites tiling genes promoters cpgislands
ABSC_1 vs. ABSC_2 (Anagen (activated) bulge stem cell) 0.9835 0.9758 0.9911 0.9914 0.994
B_cell_1 vs. B_cell_2 (B-cell) 0.9875 0.9772 0.9922 0.9928 0.9953
CLP_1 vs. CLP_2 (Common lymphoid progenitor) 0.9867 0.9726 0.989 0.9915 0.9946
CMP_1 vs. CMP_2 (Common myeloid progenitor) 0.9886 0.9787 0.9933 0.9937 0.9957
CLDC_1 vs. CLDC_2 (Companion layer differentiated cell) 0.9746 0.9628 0.987 0.9867 0.9912
CD8_1 vs. CD8_2 (Cytotoxic T cell (CD8+)) 0.9865 0.9764 0.9925 0.9928 0.9942
EDif_1 vs. EDif_2 (Epidermis differentiated cell) 0.9816 0.9727 0.9911 0.9906 0.9934
EPro_1 vs. EPro_2 (Epidermis progenitor cell) 0.9794 0.97 0.9904 0.9894 0.9914
Eryth_1 vs. Eryth_2 (Erythrocyte) 0.9459 0.8855 0.9543 0.9674 0.9833
Gran_1 vs. Gran_2 (Granulocyte) 0.9849 0.9681 0.9886 0.99 0.9944
GMP_1 vs. GMP_2 (Granulocyte-monocyte progenitor) 0.9883 0.9758 0.9905 0.9921 0.9956
HSC_1 vs. HSC_2 (Hematopoietic stem cell) 0.9802 0.9611 0.9845 0.9866 0.9897
MTAC_1 vs. MTAC_2 (Matrix/transit-amplifying cell) 0.9644 0.9418 0.9784 0.9792 0.9866
MEP_1 vs. MEP_2 (Megakaryocyte-erythroid progenitor) 0.9834 0.9653 0.9899 0.9912 0.9959
Mono_1 vs. Mono_2 (Monocyte) 0.9743 0.9532 0.9806 0.9834 0.9892
MPP1_1 vs. MPP1_2 (Multipotent progenitor 1 (Flk2-)) 0.991 0.9837 0.9947 0.9951 0.9969
MPP2_1 vs. MPP2_2 (Multipotent progenitor 2 (Flk2+)) 0.99 0.9809 0.9928 0.9938 0.9967
CD4_1 vs. CD4_2 (T helper cell (CD4+)) 0.9887 0.9786 0.9927 0.9929 0.996
TBSC_1 vs. TBSC_2 (Telogen (quiescent) bulge stem cell) 0.9771 0.9616 0.9848 0.9852 0.9919

Low-dimensional Representation

Dimension reduction is used to visually inspect the dataset for a strong signal in the methylation values that is related to samples' clinical or batch processing annotation. RnBeads implements two methods for dimension reduction - principal component analysis (PCA) and multidimensional scaling (MDS).

One or more of the methylation matrices was augmented before applying the dimension reduction techniques because it contains missing values. The column Missing lists the number of dimensions ignored due to missing values. In the case of MDS, dimensions are ignored only if they contain missing values for all samples. In contrast, sites or regions with missing values in any sample are ignored prior to PCA.

Sites/regions Technique Dimensions Missing Selected
sites MDS 1353082 0 1353082
sites PCA 1353082 663891 689191
tiling MDS 165008 0 165008
tiling PCA 165008 38179 126829
genes MDS 21277 0 21277
genes PCA 21277 1387 19890
promoters MDS 19021 0 19021
promoters PCA 19021 1860 17161
cpgislands MDS 14293 0 14293
cpgislands PCA 14293 665 13628

Multidimensional Scaling

The scatter plot below visualizes the samples transformed into a two-dimensional space using MDS.

Location type
Distance
Sample representation
Sample color

Figure 5

Open PDF Figure 5

Scatter plot showing samples after performing Kruskal's non-metric mutidimensional scaling.

Principal Component Analysis

Similarly, the figure below shows the values of selected principal components in a scatter plot.

Location type
Principal components
Sample representation
Sample color

Figure 6

Open PDF Figure 6

Scatter plot showing the samples' coordinates on principal components.

The figure below shows the cumulative distribution functions of variance explained by the principal components.

Location type

Figure 7

Open PDF Figure 7

Cumulative distribution function of percentange of variance explained.

The table below gives for each location type a number of principal components that explain at least 95 percent of the total variance. The full tables of variances explained by all components are available in comma-separated values files accompanying this report.

Location Type Number of Components Full Table File
sites 24 csv
tiling 20 csv
genes 20 csv
promoters 19 csv
cpgislands 15 csv

Batch Effects

In this section, different properties of the dataset are tested for significant associations. The properties can include sample coordinates in the principal component space, phenotype traits and intensities of control probes. The tests used to calculate a p-value given two properties depend on the essence of the data:

Note that the p-values presented in this report are not corrected for multiple testing.

Associations between Principal Components and Traits

The computed sample coordinates in the principal component space were tested for association with the specified traits. Below is a list of the traits and the tests performed.

Trait Test
description Kruskal-Wallis
tissue Wilcoxon
differentiation_level Kruskal-Wallis
blood_lineage Kruskal-Wallis

The heatmap below summarizes the results of permutation tests performed for associations. Significant p-values (values less than 0.01) are displayed in pink background.

Region type

Figure 8

Open PDF Figure 8

Heatmap presenting a table of p-values. Significant p-values (less than 0.01) are printed in pink boxes. Non-significant values are represented by blue boxes. Bright grey cells, if present, denote missing values.

The full tables of p-values for each location type are available in CSV (comma-separated value) files below.

Location Type File Name
sites csv
tiling csv
genes csv
promoters csv
cpgislands csv

Associations between Traits

This section summarizes the associations between pairs of traits.

The figure below visualizes the tests that were performed on trait pairs based on the description provided above. In some cases, pairs of traits could not be tested for associations. These scenarios are marked by grey shapes, and the underlying reason is given in the figure legend. In addition, the calculated p-values for associations between traits are shown. Significant p-values (values less than 0.01) are displayed in pink background. The full table of p-values is available in a dedicated file that accompanies this report.

Heatmap of

Figure 9

Open PDF Figure 9

(1) Table of performed tests on pairs of traits. Test names (Correlation + permutation test, Fisher's exact test, Wilcoxon rank sum test and/or Kruskal-Wallis one-way analysis of variance) are color-coded according to the legend given above.
(2) Table of resulting p-values from the performed tests on pairs of traits. Significant p-values (less than 0.01) are printed in pink boxes Non-significant values are represented by blue boxes. White cells, if present, denote missing values.

Methylation Value Distributions

Methylation value distributions were assessed based on selected sample groups. This was done on site and region levels. This section contains the generated density plots.

Methylation Value Densities of Sample Groups

The plots below compare the distributions of methylation values in different sample groups, as defined by the traits listed above.

Sample trait
Methylation of

Figure 10

Open PDF Figure 10

Beta value density estimation according to sample grouping.

Methylation Value Densities of Site Categories

In a similar fashion, the plot below compares the distributions of beta values in different site types.

Sample group
Site category

Figure 11

Open PDF Figure 11

Methylation value density estimation according to sample grouping and site category.

Inter-sample Variability

The variability of the methylation values is measured in two aspects: (1) intra-sample variance, that is, differences of methylation between genomic locations/regions within the same sample, and (2) inter-sample variance, i.e. variability in the methylation degree at a specific locus/region across a group of samples.

The following figure shows the relationship between average methylation and methylation variability of a site.

Sample group
Point color based on

Figure 12

Figure 12

Scatter plot showing the correlation betweeen site mean methylation and the variance across a group of samples. Every point corresponds to one site.

In a complete analogy to the plots above, the figure below shows the relationship between average methylation and methylation variability of a genomic region.

Regions
Sample group
Point color based on

Figure 13

Figure 13

Scatter plot showing the correlation betweeen region mean methylation and the variance across a group of samples. Every point corresponds to one region.

Clustering

The figure below shows clustering of samples using several algorithms and distance metrics.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on

Figure 14

Figure 14

Hierarchical clustering of samples based on 1000 most variable loci. The heatmap displays methylation percentiles per sample. The legend for sample coloring can be found in the figure below.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on
Site/region color based on
Visualize

Figure 15

Figure 15

Hierarchical clustering of samples based on 1000 most variable loci. The heatmap displays only selected sites/regions with the highest variance across all samples. The legend for locus and sample coloring can be found in the figure below.

Site/region level
Sample color based on
Site/region color based on

Figure 16

Open PDF Figure 16

Probe and sample colors used in the heatmaps in the previous figures.

Identified Clusters

Using the average silhouette value as a measure of cluster assignment [1], it is possible to infer the number of clusters produced by each of the studied methods. The figure below shows the corresponding mean silhouette value for every observed separation into clusters.

Site/region level
Dissimilarity metric

Figure 17

Open PDF Figure 17

Line plot visualizing mean silhouette values of the clustering algorithm outcomes for each applicable value of K (number of clusters).

The table below summarizes the number of clusters identified by the algorithms.

Site/region level

Metric Algorithm Clusters
correlation-based hierarchical (average linkage) 2
correlation-based hierarchical (complete linkage) 2
correlation-based hierarchical (median linkage) 2
Manhattan distance hierarchical (average linkage) 2
Manhattan distance hierarchical (complete linkage) 2
Manhattan distance hierarchical (median linkage) 2
Euclidean distance hierarchical (average linkage) 2
Euclidean distance hierarchical (complete linkage) 2
Euclidean distance hierarchical (median linkage) 2

Clusters and Traits

The figure below shows associations between clusterings and the examined traits. Associations are quantified using the adjusted Rand index [2]. Rand indices near 1 indicate high agreement while values close to -1 indicate seperation. The full table of all computed indices is stored in the following comma separated files:

Site/region level
Dissimilarity metric

Figure 18

Open PDF Figure 18

Heatmap visualizing Rand indices computed between sample traits (rows) and clustering algorithm outcomes (columns).

Regional Methylation Profiles

Methylation profiles were computed for the specified region types. Composite plots are shown

Region type
Sample trait

Figure 19

Open PDF Figure 19

Regional methylation profiles (composite plots) according to sample groups. For each region in the corresponding region type, relative coordinates of 0 and 1 corresponds to the start and end coordinates of that region respectively. Coordinates smaller than 0 and greater than 1 denote flanking regions normalized by region length. Scatterplot smoothers for each sample and sample group were fit. Horizontal lines indicate region boundaries. For smoothing, generalized additive models with cubic spine smoothing were used. Deviation bands indicate 95% confidence intervals

References

  1. Rousseeuw, P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65
  2. Hubert, L. and Arabie, P. (1985) Comparing partitions. Journal of Classification, 2(1), 193-218