RnBeads: TCGA_integration

The table below lists the options of the executed module.

Option	Value
filtering.whitelist
filtering.blacklist
filtering.snp	3
filtering.missing.value.quantile	1
filtering.greedycut	yes
filtering.greedycut.pvalue.threshold	0.05
filtering.greedycut.rc.ties	row
distribution.subsample	1000000
normalization.method	none
normalization.background.method	none
normalization.plot.shifts	yes
filtering.context.removal	CC, CAG, CAH, CTG, CTH, Other
filtering.missing.value.quantile	1
filtering.sex.chromosomes.removal	yes
filtering.deviation.threshold	0
distribution.subsample	1000000

Removal of SNP-enriched Probes

Greedycut

The Greedycut algorithm iteratively removes from the dataset probes and samples of highest impurity. These correspond to the rows and columns in the detection p-value table that contain the largest fraction of unreliable measurements. This section summarizes the results of applying Greedycut on the analyzed dataset.

Unreliable Measurements

We considered every β value to be unreliable when its corresponding detection p-value is not below the threshold T:

p ≥ T = 0.05

The figure below summarizes the observed number of unreliable measurements per probe and per sample.

Number of values per

Figure 1

Cumulative distribution function of number of unreliable values per probe/sample.

Filtered Probes and Samples

RnBeads executed Greedycut using the threshold given above and applied all its steps. Briefly, Greedycut is an iterative algorithm that filters out the probe or sample with the highest fraction of unreliable measurements one at a time. Note that every iteration of the algorithm produces a matrix of retained measurements and a set of removed ones.

We calculated false positive rate (α) and sensitivity (s) when the retained measurements are considered as prediction for the reliable ones. Among all matrices produced by Greedycut, we selected the one that maximizes the value of the expression s + 1 - α, thereby giving equal weights to the sensitivity and specificity. Presented geometrically on a ROC curve, this is the point that is furthest from the diagonal. The results of the Greedycut procedure and the selected iteration are presented in the figure below.

Metric
Iterations to show

Figure 2

Change of table dimensions / metric related to accuracy as Greedycut progressively removes probes and samples. Accuracy is calculated by treating the retained entries as predictive of reliable measurements. The red circle, if present, marks the last iteration that was executed.

Based on the criteria described above, 16762 probes and 26 samples were filtered out. Links to the lists of removed items are given below.

Type	Removed	Table
Probes	16762	removed_sites_greedycut.csv
Samples	26	removed_samples_greedycut.csv

Filtering Summary I

As a final outcome of the filtering procedures, 21475 probes and 26 samples were removed. These statistics are presented in a dedicated table that accompanies this report and visualized in the figure below.

Figure 3

Fractions of removed values in the dataset after applying filtering procedures.

The figure below compares the distributions of the removed methylation β values and of the retained ones.

Plot type

Figure 4

Comparison of removed and retained β values.Both distributions are estimated by randomly sampling 1000000 values in each group.

Normalization

The measurements in this dataset were not normalized after loading.

Sample Mean Methylations

The following figure visualizes the average methylation per sample. Samples are grouped by slide.

Slide number

Figure 5

Point-and-whisker plot showing mean and standard deviation among all beta values in a sample.

Region Annotations

In addition to CpG sites, there are 4 sets of genomic regions to be covered in the analysis. The table below gives a summary of these annotations.

Annotation	Description	Regions in the Dataset
tiling	n.a.	133256
genes	n.a.	30493
promoters	n.a.	30566
cpgislands	n.a.	26536

Context-specific Probe Removal

Removal of Probes on Sex Chromosomes

Filtering Summary II

Context	Probes
CC	0
CAG	986
CAH	138
CTG	7
CTH	1
Other	1282

As a final outcome of the filtering procedures, 12969 probes and 0 samples were removed. These statistics are presented in a dedicated table that accompanies this report and visualized in the figure below.

Figure 6