RnBeads: Example 3

Option	Value
replicate.id.column	default
exploratory.columns	default
exploratory.top.dimensions	all
exploratory.principal.components	8
exploratory.correlation.pvalue.threshold	0.01
exploratory.correlation.permutations	10000
exploratory.correlation.qc	yes
exploratory.beta.distribution	yes
exploratory.intersample	yes
exploratory.deviation.plots	yes
exploratory.clustering	all
exploratory.clustering.top.sites	1000
exploratory.clustering.heatmaps.pdf	no
distribution.subsample	1000000
exploratory.gene.symbols	default
exploratory.custom.loci.bed	default

Trait	Number of groups
Sample_Group	2
Treatment	2
CellType	2
Predicted Gender	2

In addition to CpG sites, there are 4 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.

Annotation	Description	Regions in the Dataset
tiling	Genome tiling regions of length 5000	133950
genes	Ensembl genes, version Ensembl Genes 75	30784
promoters	Promoter regions of Ensembl genes, version Ensembl Genes 75	30969
cpgislands	CpG island track of the UCSC Genome browser	26610

Region length distributions

The plots below show region size distributions for the region types above.

Region type

Figure 1

Distribution of region lengths

Number of sites per region

The plots below show the distributions of the number of sites per region type.

Region type

Figure 2

Distribution of the number of sites per region

Region site distributions

The plots below show distributions of sites across the different region types.

Region type

Figure 3

Distribution of sites across regions. relative coordinates of 0 and 1 corresponds to the start and end coordinates of that region respectively. Coordinates smaller than 0 and greater than 1 denote flanking regions normalized by region length.

Low-dimensional Representation

Dimension reduction is used to visually inspect the dataset for a strong signal in the methylation values that is related to samples' clinical or batch processing annotation. RnBeads implements two methods for dimension reduction - principal component analysis (PCA) and multidimensional scaling (MDS).

One or more of the methylation matrices was augmented before applying the dimension reduction techniques because it contains missing values. The column Missing lists the number of dimensions ignored due to missing values. In the case of MDS, dimensions are ignored only if they contain missing values for all samples. In contrast, sites or regions with missing values in any sample are ignored prior to PCA.

Sites/regions	Technique	Dimensions	Missing	Selected
sites	MDS	471549	0	471549
sites	PCA	471549	15	471534
tiling	MDS	133950	0	133950
tiling	PCA	133950	0	133950
genes	MDS	30784	0	30784
genes	PCA	30784	0	30784
promoters	MDS	30969	0	30969
promoters	PCA	30969	0	30969
cpgislands	MDS	26610	0	26610
cpgislands	PCA	26610	0	26610

Multidimensional Scaling

The scatter plot below visualizes the samples transformed into a two-dimensional space using MDS.

Location type
Distance
Sample representation
Sample color

Figure 4

Scatter plot showing samples after performing Kruskal's non-metric mutidimensional scaling.

Principal Component Analysis

Similarly, the figure below shows the values of selected principal components in a scatter plot.

Location type
Principal components
Sample representation
Sample color

Figure 5

Scatter plot showing the samples' coordinates on principal components.

The figure below shows the cumulative distribution functions of variance explained by the principal components.

Location type

Figure 6

Cumulative distribution function of percentange of variance explained.

The table below gives for each location type a number of principal components that explain at least 95 percent of the total variance. The full tables of variances explained by all components are available in comma-separated values files accompanying this report.

Location Type	Number of Components	Full Table File
sites	9	csv
tiling	9	csv
genes	8	csv
promoters	8	csv
cpgislands	8	csv

Batch Effects

In this section, different properties of the dataset are tested for significant associations. The properties can include sample coordinates in the principal component space, phenotype traits and intensities of control probes. The tests used to calculate a p-value given two properties depend on the essence of the data:

If both properties contain categorical data (e.g. tissue type and sample processing date), the test of choice is a two-sided Fisher's exact test.
If both properties contain numerical data (e.g. coordinates in the first principal component and age of individual), the correlation coefficient between the traits is computed. A p-value is estimated using permutation tests with 10000 permutations.
If property A is categorical and property B contains numeric data, p-value for association is calculated by comparing the values of B for the different categories in A. The test of choice is a two-sided Wilcoxon rank sum test (when A defines two categories) or a Kruskal-Wallis one-way analysis of variance (when A separates the samples into three or more categories).

Note that the p-values presented in this report are not corrected for multiple testing.

Associations between Principal Components and Traits

The computed sample coordinates in the principal component space were tested for association with the available traits. Below is a list of the traits and the tests performed.

Trait	Test
Sample_Group	Wilcoxon
Cell_Line	Kruskal-Wallis
Passage_No	Correlation
Treatment	Wilcoxon
CellType	Wilcoxon
Predicted Male Probability	Correlation
Predicted Gender	Wilcoxon
predicted_ages	Correlation
Immune Cell Content (LUMP)	Correlation

The next figure shows the computed correlations between the first 8 principal components and the sample traits.

Region type

Figure 7

Heatmap presenting a table of correlations. Grey cells, if present, denote missing values.

The values presented in the figure above are avaialable in CSV (comma-separated value) files accompanying this report.

Location type	Table file
sites	csv
tiling	csv
genes	csv
promoters	csv
cpgislands	csv

The heatmap below summarizes the results of permutation tests performed for associations. Significant p-values (values less than 0.01) are displayed in pink background.

Region type

Figure 8

Heatmap presenting a table of p-values. Significant p-values (less than 0.01) are printed in pink boxes. Non-significant values are represented by blue boxes. Bright grey cells, if present, denote missing values.

The full tables of p-values for each location type are available in CSV (comma-separated value) files below.

Location Type	File Name
sites	csv
tiling	csv
genes	csv
promoters	csv
cpgislands	csv

Associations between Traits

This section summarizes the associations between pairs of traits.

The figure below visualizes the tests that were performed on trait pairs based on the description provided above. In some cases, pairs of traits could not be tested for associations. These scenarios are marked by grey shapes, and the underlying reason is given in the figure legend. In addition, the calculated p-values for associations between traits are shown. Significant p-values (values less than 0.01) are displayed in pink background. The full table of p-values is available in a dedicated file that accompanies this report.

Heatmap of

Figure 9

(1) Table of performed tests on pairs of traits. Test names (Correlation + permutation test, Fisher's exact test, Wilcoxon rank sum test and/or Kruskal-Wallis one-way analysis of variance) are color-coded according to the legend given above.
(2) Table of resulting p-values from the performed tests on pairs of traits. Significant p-values (less than 0.01) are printed in pink boxes Non-significant values are represented by blue boxes. White cells, if present, denote missing values.

In some cases, a correlation was computed between a pair of traits. As described earlier in this report, these correlation values are used as the basis for a permutation-based test. The table of computed correlations is available as a comma-separated file accompanying this report.

Quality-associated Batch Effects

This section examines the methylation values of the dataset for quality-associated batch effects.

The heatmaps below visualize the Pearson correlation coefficients between the principal components and the signal levels of selected quality control probes.

Location type
Channel
Probe group

Figure 10

Heatmap presenting a table of correlations. Grey cells, if present, denote missing values.

In a complete analogy to the heatmaps above, the figure below visualizes the p-values calculated using permutation tests.

Location type
Channel
Probe group

Figure 11

All computed p-values for associations are available as comma-separated files that accompany this report. The links to the dedicated files are provided in the table below.

Location type

Control probe type \ Target - Channel	Green	Red
bisulfite conversion I	csv	csv
bisulfite conversion II	csv	csv
extension	csv	csv
hybridization	csv	csv
non-polymorphic	csv	csv
specificity I	csv	csv
specificity II	csv	csv
staining	csv	csv
target removal	csv	csv

Methylation Value Distributions

Methylation value distributions were assessed based on selected sample groups. This was done on probe and region levels. This section contains the generated density plots.

Methylation Value Densities of Sample Groups

The plots below compare the distributions of methylation values in different sample groups, as defined by the traits listed above.

Sample trait
Methylation of

Figure 12

Beta value density estimation according to sample grouping.

Methylation Value Densities of Probe Categories

In a similar fashion, the plot below compares the distributions of beta values in different probe types.

Sample group
Probe category

Figure 13

Methylation value density estimation according to sample grouping and probe category.

Inter-sample Variability

The variability of the methylation values is measured in two aspects: (1) intra-sample variance, that is, differences of methylation between genomic locations/regions within the same sample, and (2) inter-sample variance, i.e. variability in the methylation degree at a specific locus/region across a group of samples.

The following figure shows the relationship between average methylation and methylation variability of a probe.

Sample group
Point color based on

Figure 14

Scatter plot showing the correlation betweeen probe mean methylation and the variance across a group of samples. Every point corresponds to one probe.

In a complete analogy to the plots above, the figure below shows the relationship between average methylation and methylation variability of a genomic region.

Regions
Sample group
Point color based on

Figure 15

Scatter plot showing the correlation betweeen region mean methylation and the variance across a group of samples. Every point corresponds to one region.

The figure below shows a methylation deviation plot for all samples in the dataset, as well as other sample groups inferred from the table of phenotypic information.

Sample group
Color legend based on

Figure 16

Deviation plot of a sample group. Probes are sorted in increasing order of their median methylation and are binned in groups of up to 118. The horizontal axis in the plot iterates over probe groups, and the vertical axis measures methylation degree. Median β values are depicted by a blue curve. Grey borders mark the 5th and 95th percentiles of β values in a probe (averaged over the group), ensuring that 90 percent of the observed values lie in the yellow area.
Relative frequencies of probe categories in every group are color-coded and plotted below the horizontal axis. Every segment in the color legend shows the distribution of probe categories that underlie the corresponding segment in the deviation plot above it.

In a similar fashion, the figure below shows deviation plots on the region level.

Regions
Sample group
Color legend based on

Figure 17

Deviation plot of a sample group. Regions are sorted in increasing order of their median methylation and are binned in groups of up to 34. The horizontal axis in the plot iterates over region groups, and the vertical axis measures methylation degree. Median β values are depicted by a blue curve. Grey borders mark the 5th and 95th percentiles of β values in a region (averaged over the group), ensuring that 90 percent of the observed values lie in the yellow area.
Relative frequencies of region categories in every group are color-coded and plotted below the horizontal axis. Every segment in the color legend shows the distribution of region categories that underlie the corresponding segment in the deviation plot above it.

The variability of a sample group is the span between 5th and 95th percentile of β values , averaged over all valid locations/regions. This amounts to a number between 0 and 1 and corresponds to the relative area of deviation in the plots presented above. The table below lists the variabilities of the studied sample groups.

Loci/regions	Sample Group	Based on Trait	Size	Variability
sites	all samples		12	0.0890
sites	hESC	Sample_Group	5	0.0665
sites	hiPSC	Sample_Group	7	0.0746
sites	KOSR	Treatment	2	0.0294
sites	TeSR	Treatment	2	0.0351
sites	CT1	CellType	2	0.0360
sites	CT2	CellType	2	0.0336
sites	female	Predicted Gender	6	0.0685
sites	male	Predicted Gender	6	0.0715
tiling	all samples		12	0.0784
tiling	hESC	Sample_Group	5	0.0573
tiling	hiPSC	Sample_Group	7	0.0661
tiling	KOSR	Treatment	2	0.0239
tiling	TeSR	Treatment	2	0.0311
tiling	CT1	CellType	2	0.0306
tiling	CT2	CellType	2	0.0295
tiling	female	Predicted Gender	6	0.0625
tiling	male	Predicted Gender	6	0.0639
genes	all samples		12	0.0610
genes	hESC	Sample_Group	5	0.0454
genes	hiPSC	Sample_Group	7	0.0510
genes	KOSR	Treatment	2	0.0206
genes	TeSR	Treatment	2	0.0248
genes	CT1	CellType	2	0.0244
genes	CT2	CellType	2	0.0226
genes	female	Predicted Gender	6	0.0461
genes	male	Predicted Gender	6	0.0471
promoters	all samples		12	0.0693
promoters	hESC	Sample_Group	5	0.0523
promoters	hiPSC	Sample_Group	7	0.0578
promoters	KOSR	Treatment	2	0.0231
promoters	TeSR	Treatment	2	0.0281
promoters	CT1	CellType	2	0.0281
promoters	CT2	CellType	2	0.0255
promoters	female	Predicted Gender	6	0.0519
promoters	male	Predicted Gender	6	0.0533
cpgislands	all samples		12	0.0781
cpgislands	hESC	Sample_Group	5	0.0581
cpgislands	hiPSC	Sample_Group	7	0.0652
cpgislands	KOSR	Treatment	2	0.0290
cpgislands	TeSR	Treatment	2	0.0317
cpgislands	CT1	CellType	2	0.0307
cpgislands	CT2	CellType	2	0.0285
cpgislands	female	Predicted Gender	6	0.0568
cpgislands	male	Predicted Gender	6	0.0589

Clustering

The figure below shows clustering of samples using several algorithms and distance metrics.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on

Figure 18

Hierarchical clustering of samples based on all methylation values. The heatmap displays methylation percentiles per sample. The legend for sample coloring can be found in the figure below.

Site/region level
Dissimilarity metric
Agglomeration strategy (linkage)
Sample color based on
Site/region color based on
Visualize

Figure 19

Hierarchical clustering of samples based on all methylation values. The heatmap displays only selected sites/regions with the highest variance across all samples. The legend for locus and sample coloring can be found in the figure below.

Site/region level
Sample color based on
Site/region color based on

Figure 20

Probe and sample colors used in the heatmaps in the previous figures.

Identified Clusters

Using the average silhouette value as a measure of cluster assignment [1], it is possible to infer the number of clusters produced by each of the studied methods. The figure below shows the corresponding mean silhouette value for every observed separation into clusters.

Site/region level
Dissimilarity metric

Figure 21

Line plot visualizing mean silhouette values of the clustering algorithm outcomes for each applicable value of K (number of clusters).

The table below summarizes the number of clusters identified by the algorithms.

Site/region level

Metric	Algorithm	Clusters
correlation-based	hierarchical (average linkage)	7
correlation-based	hierarchical (complete linkage)	7
correlation-based	hierarchical (median linkage)	7
Manhattan distance	hierarchical (average linkage)	7
Manhattan distance	hierarchical (complete linkage)	6
Manhattan distance	hierarchical (median linkage)	8
Euclidean distance	hierarchical (average linkage)	7
Euclidean distance	hierarchical (complete linkage)	7
Euclidean distance	hierarchical (median linkage)	7

Clusters and Traits

The figure below shows associations between clusterings and the examined traits. Associations are quantified using the adjusted Rand index [2]. Rand indices near 1 indicate high agreement while values close to -1 indicate seperation. The full table of all computed indices is stored in the following comma separated files:

Site/region level
Dissimilarity metric

Figure 22

Heatmap visualizing Rand indices computed between sample traits (rows) and clustering algorithm outcomes (columns).

Exploratory Analysis

Parameter Overview

Sample Groups

Region Annotations

Region length distributions

Number of sites per region

Region site distributions

Low-dimensional Representation

Multidimensional Scaling

Principal Component Analysis

Batch Effects

Associations between Principal Components and Traits

Associations between Traits

Quality-associated Batch Effects

Methylation Value Distributions

Methylation Value Densities of Sample Groups

Methylation Value Densities of Probe Categories

Inter-sample Variability

Clustering

Identified Clusters

Clusters and Traits

Regional Methylation Profiles

References