The specified traits were tested based on criteria for defining sample groups. The table below summarizes these traits.
Trait | Number of groups |
Type | 6 |
Study | 11 |
Gender | 2 |
Vital status | 2 |
In addition to CpG sites, there are 4 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.
Annotation | Description | Regions in the Dataset |
tiling | n.a. | 129076 |
genes | n.a. | 29439 |
promoters | n.a. | 29526 |
cpgislands | n.a. | 25791 |
The plots below show region size distributions for the region types above.
Dimension reduction is used to visually inspect the dataset for a strong signal in the methylation values that is related to samples' clinical or batch processing annotation. RnBeads implements two methods for dimension reduction - principal component analysis (PCA) and multidimensional scaling (MDS).
The scatter plot below visualizes the samples transformed into a two-dimensional space using MDS.
Location type | |
Distance | |
Sample representation | |
Sample color |
Scatter plot showing samples after performing Kruskal's non-metric mutidimensional scaling.
Similarly, the figure below shows the values of selected principal components in a scatter plot.
Location type | |
Principal components | |
Sample representation | |
Sample color |
Scatter plot showing the samples' coordinates on principal components.
The figure below shows the cumulative distribution functions of variance explained by the principal components.
Location type |
Cumulative distribution function of percentange of variance explained.
The table below gives for each location type a number of principal components that explain at least 95 percent of the total variance. The full tables of variances explained by all components are available in comma-separated values files accompanying this report.
In this section, different properties of the dataset are tested for significant associations. The properties can include sample coordinates in the principal component space, phenotype traits and intensities of control probes. The tests used to calculate a p-value given two properties depend on the essence of the data:
Note that the p-values presented in this report are not corrected for multiple testing.
The computed sample coordinates in the principal component space were tested for association with the specified traits. Below is a list of the traits and the tests performed.
Trait | Test |
Type | Kruskal-Wallis |
Study | Kruskal-Wallis |
Gender | Wilcoxon |
Days to birth | Correlation |
Vital status | Wilcoxon |
The next figure shows the computed correlations between the first 8 principal components and the sample traits.
Region type |
Heatmap presenting a table of correlations. Grey cells, if present, denote missing values.
The values presented in the figure above are avaialable in CSV (comma-separated value) files accompanying this report.
The heatmap below summarizes the results of permutation tests performed for associations. Significant p-values (values less than 0.01) are displayed in pink background.
Region type |
Heatmap presenting a table of p-values. Significant p-values (less than 0.01) are printed in pink boxes. Non-significant values are represented by blue boxes. Bright grey cells, if present, denote missing values.
The full tables of p-values for each location type are available in CSV (comma-separated value) files below.
This section summarizes the associations between pairs of traits.
The figure below visualizes the tests that were performed on trait pairs based on the description provided above. In addition, the calculated p-values for associations between traits are shown. Significant p-values (values less than 0.01) are displayed in pink background. The full table of p-values is available in a dedicated file that accompanies this report.
Heatmap of |
(1) Table of performed tests on pairs of traits. Test names (Correlation + permutation test, Fisher's exact test, Wilcoxon rank sum test and/or Kruskal-Wallis one-way analysis of variance) are color-coded according to the legend given above.
(2) Table of resulting p-values from the performed tests on pairs of traits. Significant p-values (less than 0.01) are printed in pink boxes Non-significant values are represented by blue boxes. White cells, if present, denote missing values.
This section examines the methylation values of the dataset for quality-associated batch effects.
The heatmaps below visualize the Pearson correlation coefficients between the principal components and the signal levels of selected quality control probes.
Location type | |
Channel | |
Probe group |
Heatmap presenting a table of correlations. Grey cells, if present, denote missing values.
Methylation value distributions were assessed based on selected sample groups. This was done on probe and region levels. This section contains the generated density plots.
The plots below compare the distributions of methylation values in different sample groups, as defined by the traits listed above.
The figure below shows clustering of samples using several algorithms and distance metrics.
Site/region level | |
Dissimilarity metric | |
Agglomeration strategy (linkage) | |
Sample color based on |
Hierarchical clustering of samples based on 1000 most variable loci. The heatmap displays methylation percentiles per sample. The legend for sample coloring can be found in the figure below.
Site/region level | |
Dissimilarity metric | |
Agglomeration strategy (linkage) | |
Sample color based on | |
Site/region color based on | |
Visualize |
Hierarchical clustering of samples based on 1000 most variable loci. The heatmap displays only selected sites/regions with the highest variance across all samples. The legend for locus and sample coloring can be found in the figure below.
Site/region level | |
Sample color based on | |
Site/region color based on |
Probe and sample colors used in the heatmaps in the previous figures.
Using the average silhouette value as a measure of cluster assignment [1], it is possible to infer the number of clusters produced by each of the studied methods. The figure below shows the corresponding mean silhouette value for every observed separation into clusters.
Site/region level | |
Dissimilarity metric |
Line plot visualizing mean silhouette values of the clustering algorithm outcomes for each applicable value of K (number of clusters).
The table below summarizes the number of clusters identified by the algorithms.
Site/region level |
Metric | Algorithm | Clusters |
correlation-based | hierarchical (average linkage) | 2 |
correlation-based | hierarchical (complete linkage) | 2 |
correlation-based | hierarchical (median linkage) | 2 |
Manhattan distance | hierarchical (average linkage) | 2 |
Manhattan distance | hierarchical (complete linkage) | 2 |
Manhattan distance | hierarchical (median linkage) | 2 |
Euclidean distance | hierarchical (average linkage) | 2 |
Euclidean distance | hierarchical (complete linkage) | 2 |
Euclidean distance | hierarchical (median linkage) | 2 |
The figure below shows associations between clusterings and the examined traits. Associations are quantified using the adjusted Rand index [2]. Rand indices near 1 indicate high agreement while values close to -1 indicate seperation. The full table of all computed indices is stored in the following comma separated files:
Site/region level | |
Dissimilarity metric |
Heatmap visualizing Rand indices computed between sample traits (rows) and clustering algorithm outcomes (columns).