RnBeads

I am running RnBeads on a computer with limited resources. What can I do to reduce the memory consumption by RnBeads?

First, try to run RnBeads on a single core and NOT in parallel. Additionally, there are several option settings that can be used to reduce the resource requirements. These option settings apply to different (sub)modules:

# Disable greedycut (filtering) rnb.options("filtering.greedycut"=FALSE) # Disable intersample variation plots (exploratory analysis) rnb.options("exploratory.intersample"=FALSE) # Reduce the subsampling number for estimating density plots rnb.options("distribution.subsample"=100000) # Disable regional methylation profiling (exploratory analysis) rnb.options("exploratory.region.profiles"=NULL) # Disable chromosome coverage plots (QC, sequencing data only) rnb.options("qc.coverage.plots"=FALSE)

Can I start RnBeads analysis with Galaxy on the Cloud?

Yes, you can. RnBeads has a wrapper script for integration with Galaxy, accessible from the main Galaxy Tool Shed (http://toolshed.g2.bx.psu.edu/view/pavlo-lutsik/rnbeads). In order to have it running the on a custom Galaxy instance on the Amazon cloud, follow the steps provided below. Warning: Following the steps results in additional costs. The exact amount depends on the selected cloud configuration and Amazon Web Services pricing at the time of usage.

Subscribe for Amazon Web Services and establish a custom Galaxy instance on the clound using the Cloud Launch interface as described here: wiki.galaxyproject.org/CloudMan
Log in to CloudMan (it becomes accessible upon successful completion of Step 1). Give it some time to start the Galaxy instance and click the button Access Galaxy.
Go to the menu item User and register within your Galaxy instance.
Go back to the CloudMan interface, acess the Admin interface (a link on the top-right), and add an yourself as an administrator user.
Go to the Galaxy instance and refresh the page until you get a new Admin menu item.
Proceed to the Galaxy administrator interface and select the task Search and browse tool sheds.
Click on Galaxy main toolshed, select Browse for valid repositories and find the wrapper rnbeads using the search field located at the top of the newly opened page.
Click on the button with the wrapper name and select Preview and install.
Click Install to Galaxy on the top-right.
Check Handle tool dependencies, select the Statistics tool panel section and click Install.
Give Galaxy some time to install the tool, then go back to the Analyze Data interface. RnBeads should become available in the Tool panel, just type RnBeads in the search field. Click on the tool to open its Galaxy interface.
You can test functionality of the tool by starting a small analysis of a public data set. Select Gene Expression Omnibus series in the data type field and specify GSE38268 as the GEO series. The series contains only 6 Infinium 450k samples, so the analysis should not take more than an hour. After the completion you will get a link to the analysis report, displayed directly in the data display window.

Which genome assemblies does RnBeads support? Can I include a new one?

RnBeads supports the human (hg19, hg38), mouse (mm9 and mm10) and rat (rn5) genomes. If you would like to analyze a different genome, you need to create a new annotation package. Please note that this can be a daunting task if you have very little experience with R or Bioconductor. We developed a dedicated project on GitHub - RnBeadsAnnotationCreator - to assist you in adding support for a new assembly to RnBeads. RnBeadsAnnotationCreator is an R package itself. It contains routines that automatize the process of annotation package creation. Its documentation explains how RnBeads annotation packages are structured and includes a tutorial on creating an annotation package for the Zebrafish genome.

Which libraries and tools does RnBeads rely on?

RnBeads utilizes many established R packages for data loading and manipulation. Examples for such include the Bioconductor packages methylumi, minfi, RPMM, ggbio, GEOquery, GOstats and others. RnBeads also includes code from the Google Code project Beta Mixture Quantile Model. Parallelization is implemented using the packages foreach and doParallel. Most of the report figures are created using the ggplot2 package. GO enrichment analysis results are visualized with the help of the wordcloud package. See the output of the following command for a full list of the libraries required by RnBeads:

tools::dependsOnPkgs("RnBeads")

The web service submission form uses the cross-browser tooltips library by Walter Zorn. The library is distributed under the GNU Lesser General Public License

My default RnBeads run generated slightly different results in comparison to an older version of RnBeads. What happened?

In order to keep up-to-date with the most recent developments in Computational Epigenomics, we continuingly update the RnBeads default option setting. In software version 2.9.3, we changed the following defaults:

Option name	Old default	New default
`import.bed.style`	`"BisSNP"`	`"bismarkCov"`
`normalization.background.method`	`"methylumi.noob"`	`"none"`
`filtering.snp`	`"3"`	`"any"`
`filtering.cross.reactive`	`FALSE`	`TRUE`
`filtering.sex.chromosomes.removal`	`FALSE`	`TRUE`
`filtering.missing.value.quantile`	`1`	`0.5`
`exploratory.intersample`	`NULL`	`FALSE`
`exploratory.deviation.plots`	`NULL`	`FALSE`
`exploratory.region.profiles`	`NULL`	`""`
`differential.adjustment.sva`	`TRUE`	`FALSE`
`differential.adjustment.celltype`	`TRUE`	`FALSE`
`export.to.bed`	`TRUE`	`FALSE`
`export.to.trackhub`	`c("bigBed","bigWig")`	`NULL`

Installation

When installing RnBeads using the install script, I get asked: "There are binary versions available but the source versions are later [...] Do you want to install from sources the package which needs compilation?" What should I answer?

No:
Do you want to install from sources the package which needs compilation? y/n: n

When installing RnBeads using the install script, I get asked: "Old packages: [...] Update all/some/none?" What should I answer?

No:
Update all/some/none? [a/s/n]: n

I get an error 'cannot locate Ghostscript / gswin32c'. How can I fix this?

The R platform uses Ghostscript for convertion of PDF files to PNG images. The error you see shows that R cannot locate the Ghostscript executable on your machine. First, make sure Ghostscript is installed. You can download it from Ghostscript's official web site if necessary. After that, one solution is to add the Ghostscript's installation directory to the system path. Here we provide a brief description of steps to follow in order to achieve this on Windows operating systems:

1. Open the Advanced system settings. In Windows 7, for example, it can be reached through: Control Panel > System and Security > System > Advanced System Settings.
2. You see the "System Properties" dialog, open the "Advanced" tab. Click the button "Environment Variables..." to update the search path.
3. Locate the environment variable "Path" (it doesn't matter if it is the user or the system variables, as long as you are the user who starts R). Select it, click on Edit, and prepend the location of the Ghostscript executable, followed by a semicolon. The text you need to add is usually similar to C:\Program Files\gs\gs9.15\bin;
4. After starting a new R session, Ghostscript should be accessible from R. If it still cannot be located, you need to check the corresponding environment variable. In an R session, the command Sys.getenv()["R_GSCMD"] shows the contents of the dedicated Ghostcript variable. If the variable does not exists or points to the wrong executable file, you can set it to the full path of the Ghostscript executable. This is achieved by editing or creating the file Renviron.site in the etc subdirectory of your R installation. Make sure the file contents includes the line R_GSCMD=C:\Program Files\gs\gs9.15\bin\gswin64c.exe (assuming Ghostscript is located in C:\Program Files\gs\gs9.15 and you are using the 64-bit version of R). For more information, please check the R documentation on getting and setting environment variables.

How do I install Ghostscript on a Mac?

You can download ghostscript for mac from this website. Just obtain on the the most recent version of Ghostscript (named Ghostscript 9.**). Then open the package file that you just downloaded and follow the installation instructions.

I receive a warning that I need to install zip on my windows machine. What can I do?

In order to be able to save disk-backed RnBSet objects on Windows, Zip archive creating utility should be installed and properly configured. There are multiple ways to get Zip utility installed on your Windows system. For instance, Zip is available as a part of the Rtools distribution, which is a collection of packages for R development on Windows (link). For the minimal install at the "Select Components" stage of the Rtools installation choose "Custom installation" and check only the "R toolset" item below. In the "Additional Tasks" dialogue, which appears a couple of steps later, make sure that both available items for "Edit the system PATH" are checked ("Current value" and "Save version number XXX in registry"). To test the installation start the Windows terminal ("Start" > "Run" > "CMD") and try executing command "zip" in the command line. Given the installation and configuration were successful you should see the Zip version and brief usage instructions.
In some cases, the environment variables also need to be set in order for R to locate and use the installed zip utility. One way to do this is to create or edit the file Renviron.site in the etc subdirectory of your R installation. Make sure the file contents includes the lines:
R_ZIPCMD=zip
R_UNZIPCMD=unzip
For more information, please check the R documentation on getting and setting environment variables.

RnBeads installation script fails on my brand new Linux system. What is wrong?

RnBeads is a rich package, which also has a lot of dependencies. Upon RnBeads installation into a "clean" R environment, more than a hundred of other R packages will be installed. Many of them have dependencies on the operating system level, however, most of these dependencies are simultaneously defined as dependencies of R itself and will be installed together with it. There are some exceptions, though. As a rule dependencies of RnBeads will require the following libraries to be installed:

libmysql and libmysql-devel. NB. These packages are aliased as mariadb on RedHat-derived Linux distributions (RHEL, CentOS, Fedora)

libxml2 and libxml2-devel

Furthermore, package GLAD, used for CNV calling, will require Gnu Scientific Library, known on many Linux systems as gsl package.
As an example, on RedHat derivatives one can install all the extra dependencies with one command:
yum install mariadb mariadb-devel libxml2 libxml2-devel gsl
Note that the installation will normally require administrator a.k.a. "root" access rights. In most cases one can temporarily get them by addind sudo to the beginning of the above command.
The installation command might be different on other Linux distributions, so please, refer to their respective documentation and package repositories. In case of further problems you should contact the system's administrator of your department, or simply a more experienced Linux user.

Compatibility

Why are some analysis options not recognized in RnBeads 0.99.15 and later?

In RnBeads 0.99.15, we reorganized the analysis pipeline. We introduced the new modules Preprocessing and Exploratory Analysis, and renamed the modules Loading to Import, Data Export to Tracks and Tables, and Annotation Inference to Covariate Inference. As a result, we renamed some of the analysis options to match the new modules. The table below lists the renamed options.

Old Option	New Option
loading	import
loading.default.data.type	import.default.data.type
loading.table.separator	import.table.separator
loading.bed.style	import.bed.style
loading.bed.columns	import.bed.columns
loading.bed.frame.shift	import.bed.frame.shift
loading.bed.test	import.bed.test
loading.bed.test.only	import.bed.test.only
batch	exploratory
batch.dreduction.columns	exploratory.columns
batch.top.dimensions	exploratory.top.dimensions
batch.principal.components	exploratory.principal.components
batch.correlation.columns	exploratory.columns
batch.correlation.pvalue.threshold	exploratory.correlation.pvalue.threshold
batch.correlation.permutations	exploratory.correlation.permutations
batch.correlation.qc	exploratory.correlation.qc
profiles	exploratory
profiles.beta.distribution	exploratory.beta.distribution
profiles.intersample	exploratory.intersample
profiles.deviation.plots	exploratory.deviation.plots
profiles.columns	exploratory.columns
profiles.clustering	exploratory.clustering
profiles.clustering.top.sites	exploratory.clustering.top.sites
region.profiles.types	exploratory.region.profiles
export.to.ucsc	export.to.trackhub

Why are some function names not recognized in RnBeads 0.99.15 and later?

In RnBeads 0.99.15, we reorganized the analysis pipeline; as described in the answer to the previous question. As a result, we renamed some of the exported functions to match the new modules. The table below lists the renamed functions.

Old Function	New Function
rnb.execute.loading	rnb.execute.import
rnb.execute.export	rnb.execute.tnt
rnb.export.to.ucsc	rnb.export.to.trackhub
rnb.run.loading	rnb.run.import
rnb.run.batch	rnb.run.exploratory
rnb.run.profiles	rnb.run.exploratory
rnb.run.export	rnb.run.tnt

Analysis Pipeline

I want to load my data using `bed` files. What formats does RnBeads support?

In principle, RnBeads can process any tabular file format that has exactly one row for each CpG which includes genomic coordinates (chromosome, start and end), and additionally information from which methylation levels can be deduced. For more details see the package's vignette. However, there are many uncertainties and parameters that have to be taken into account when specifying the exact format of the methylation data files. We thus recommend using one of the packages presets which can be set using the loading.bed.style option. Here is an overview of the currently implementd presets:

EPP
bed files in the format as output files from the Epigenome Processing Pipeline developed by Fabian Müller and Christoph Bock A tab-separated file contains: the chromosome name, start coordinate, end coordinate, methylation value and coverage as a string ('#methylated_read/#total_reads'), some score, the strand, and additional information not taken into account by RnBeads. The file should not contain a header line. Coordinates are 0-based, spanning the first coordinate in a site and the first coordinate outside the site (i.e. end-start = 2 for a CpG). Here are some example lines (genome assembly mm9):

chr1 3010957 3010959 '27/27' 1000 + chr1 3010971 3010973 '10/20' 500 + chr1 3011025 3011027 '57/70' 814 - ...

BisSNP
bed files are assumed to have been generated by the methylation calling tool BisSNP. A tab-separated file contains the chromosome name, start coordinate, end coordinate, methylation value in percent, the coverage, the strand, and additional information not taken into account by RnBeads. The file should contain a header line. Coordinates are 0-based, spanning the first and the last coordinate in a site (i.e. end-start = 1 for a CpG). Sites on the - strand are shifted by +1. Here are some example lines (genome assembly hg19):

track name=file_sorted.realign.recal.cpg.filtered.sort.CG.bed type=bedDetail description="CG methylation chr1 10496 10497 79.69 64 + 10496 10497 180,60,0 0 0 chr1 10524 10525 90.62 64 + 10524 10525 210,0,0 0 0 chr1 864802 864803 58.70 46 + 864802 864803 120,120,0 0 5 chr1 864803 864804 50.00 4 - 864803 864804 90,150,0 1 45 ...

bismarkCov
cov files are assumed to have the format as defined by Bismark's coverage file output converted from its bedGraph output (Bismark's bismark2bedGraph module; see the section "Optional bedGraph output in the Bismark User Guide). A tab-separated file contains: the chromosome name, cytosine coordinate, cytosine coordinate (again), methylation value in percent, number of methylated reads and the number of unmethylated reads. The file should not contain a header line. Coordinates are 1-based. Strand information does not need to be provided, but is inferred from the coordinates: Coordinates on the - strand specify the C on the - strand (G on the + strand). Coordinates referring to cytosines not in CpG content are automatically discarded. Here are some example lines (genome assembly hg19):

... chr9 73252 73252 100 1 0 chr9 73253 73253 0 0 1 chr9 73256 73256 100 1 0 chr9 73260 73260 0 0 1 chr9 73262 73262 100 1 0 chr9 73269 73269 100 1 0 ...

bismarkCytosine
bed files are assumed to have the format as defined by Bismark's cytosine report output (Bismark's coverage2cytosine module; see the section "Optional genome-wide cytosine report output" in the Bismark User Guide). A tab-separated file contains: the chromosome name, cytosine coordinate, the strand, number of methylated reads, number of unmethylated reads, and additional information not taken into account by RnBeads. The file should not contain a header line. Coordinates are 1-based. Coordinates on the - strand specify the C on the - strand (G on the + strand). CpG without coverage are allowed, but not required. Here are some example lines (genome assembly hg19):

... chr22 16050097 + 0 0 CG CGG chr22 16050098 - 0 0 CG CGA chr22 16050114 + 0 0 CG CGG chr22 16050115 - 0 0 CG CGT ... chr22 16115591 + 1 1 CG CGC chr22 16117938 - 0 2 CG CGT chr22 16122790 + 0 1 CG CGC ...

Encode
bed files are assumed to have the format as the ones that can be downloaded from UCSC's ENCODE data portal. A tab-separated file contains: the chromosome name, start coordinate, end coordinate, some identifier, read coverage, the strand, start and end coordinates again (not sure why; we discard this information), some color value, read coverage and the methylation percentage. The file should contain a header line. Coordinates are 0-based. Note that this file format is very similar but not identical to the 'BisSNP' one. Here are some example lines (genome assembly hg19):

track name="SL1815 MspIRRBS" description="HepG2_B1__GC_" visibility=2 itemRgb="On" chr1 1000170 1000171 HepG2_B1__GC_ 62 + 1000170 1000171 55,255,0 62 6 chr1 1000190 1000191 HepG2_B1__GC_ 62 + 1000190 1000191 0,255,0 62 3 chr1 1000191 1000192 HepG2_B1__GC_ 31 - 1000191 1000192 0,255,0 31 0 chr1 1000198 1000199 HepG2_B1__GC_ 62 + 1000198 1000199 55,255,0 62 10 chr1 1000199 1000200 HepG2_B1__GC_ 31 - 1000199 1000200 0,255,0 31 0 chr1 1000206 1000207 HepG2_B1__GC_ 31 - 1000206 1000207 55,255,0 31 10 ...

Can I combine the methylome resources on this site with the data of my samples?

Yes, and it is very easy. You need to copy all data files to a single directory and to merge the sample annotation tables. Just follow the steps below.

Create a new directory to host all data files and the generated reports. In the following steps, we use the directory project.
Download the files data.zip, samples.csv and analysis.xml from a dataset we provide on the methylome resources page.
Unzip the contents of data.zip to project/data. Copy samples.csv to the data directory as well. Keep the file analysis.xml in the parent directory project.
Copy the data files of your samples also to project/data.
Open and modify the file samples.csv by adding the information for your samples to the annotation table. This file is in comma-separated format and can be edited by any spreadsheet software, such as Microsoft Excel or LibreOffice. If you still have little experience with RnBeads, avoid renaming columns because this might affect the subsequent analysis steps.

Once you have added your dataset to the downloaded one, you can start the analysis pipeline using commands similar to the ones provided below:

# Set the working directory setwd("project") # Start the analysis pipeline library(RnBeads) rnb.run.xml("analysis.xml")

Feel free to experiment with different analysis options by editing the file analysis.xml or setting them in the R session using the function rnb.options().

Why are the option values reset to default when I load a saved session?

The option values are saved and handled internally by the RnBeads package. Therefore, if you save your R session using the function save.image(), the analysis options are not stored. You can copy them to a list, and reset them upon loading, as shown in the example below:

# Saving the current session RnBeadsOptions <- rnb.options() save.image(file = "my.analysis.RData") # Loading a session library(RnBeads) load("my.analysis.RData") do.call(rnb.options, RnBeadsOptions)

How do I tell RnBeads to perform a paired test in the differential analysis module?

Suppose you have a sample annotation table like this one:

sample	individual	diseaseState
sample_1	John	normal
sample_2	John	tumor
sample_3	Jane	normal
sample_4	Jane	tumor
sample_5	George	normal
sample_6	George	tumor

Further suppose, you want to compare tumor vs normal but with the pairing information by the patient/individual. Then you would apply the following option setting:

rnb.options("differential.comparison.columns"=c("diseaseState"),"columns.pairing"=c("diseaseState"="individual")

Can I introduce additional sample grouping information for analysis?

After loading, you can add sample annotation information (traits) to an RnBSet object. Use the function addPheno() for this purpose. You can introduce a text string for each sample with the same designation for each group that you want to specify. The newly added column in the annotation table can then be used for grouping. You can either let RnBeads figure out the categories by itself, or explicitly set the corresponding group options (see rnb.options() for details). You can set values to NA for samples that you don't want to include in either of the groups. If you want to specify explicit pairwise comparisons, just use the differential.comparison.columns.all.pairwise option.

How does the greedycut algorithm for filtering CpGs and samples work?

In brief, the algorithm iteratively removes the CpGs or samples in the dataset that contain the largest fraction of unreliable measurements as indicated by detection p-values or read coverage.

Here are some mathematical details: Let A = {a_ij} be an indicator matrix in which the columns denote samples, and rows – sites. We set a_ij to 1 if the corresponding methylation measurement is unreliable, and 0 otherwise. Furthermore, let B_retained be a submatrix of A, obtained after removing some columns and/or rows, and Bremoved be the set of elements of A that were removed. Using the notation introduced above, a filtering step splits the elements of A in two groups – retained and removed, which should mirror as closely as possible the properties unreliable and reliable. We thus consider the step of removing low quality samples and/or sites (columns and rows in A, respectively) as a classification problem. For every outcome B_retained (and the removed elements B_removed), one can compute sensitivity Se and specificity Sp w.r.t. to the ground truth, i.e. unreliable and reliable. We define D to be the distance of the point (1 – Sp, Se) to the diagonal in a typical a ROC curve plot, and search for an optimal submatrix B_retained with respect to the metric D.

There are two hurdles to obtaining an exact solution of this problem: (1) there is not necessarily a unique matrix B_retained with a minimal value of D, and (2) the problem is NP-complete. Therefore, in mathematical terms, Greedycut is a greedy search for an optimal submatrix B_retained of a given indicator matrix A = {a_ij} with respect to a specific criterion D.

Further details for the design of this algorithm (finding a maximum induced indicator submatrix) and some of its properties are described in Section 3.2 of this Ph.D. thesis.

RnBeads shows the error message: Error in save.rnb.diffmeth(diffmeth, diffmeth.path): trying to get slot "disk.dump" from an object of a basic class ("NULL") with no slots?

It is likely that the grouping information that you specified to RnBeads is invalid, such that RnBeads cannot perform differential analysis. In the RnBeads log, there is likely a line stating: "WARNING No valid grouping information found. NULL returned". Please validate that the column name that you specified using differential.comparison.columns matches the correct column in your sample annotation sheet. Additionally, please check the values for the options min.group.size and max.group.count, which specify how the accepted groups for differential analysis look.

Reports & Figures

Can I rescale the images in the figures?

RnBeads typically generates thousands of images in one run of the pipeline, and their resolutions are tailored to the limited space in the HTML reports. In some cases, a high-resolution image can be viewed by clicking on the corresponding image in the report. Examples for such include the heatmaps in the report on methylation profiles, as well as the plots in the report on differential methylation. In other cases, you can use the generated PDF file underlying the plot of interest. There are links to the corresponding PDF images at the figure captions in the reports. PDF files store graphics in vector format, which allows rescaling to any size without loss of quality.

Can I change the background color or other properties of the generated plots?

Some of the visual properties of the images can be specified using RnBeads options such as colors.category and colors.gradient. See the section Analysis Parameter Overview of the RnBeads vignette for more information.

RnBeads utilizes the package ggplot2 for generating most of the figures. Therefore, many aspects of the plots can be modified by adjusting the corresponding parameters in the default visual theme. As a simple example, executing the following command before starting the analysis pipeline sets the black-and-white theme:

theme_set(theme_bw())

Please check the documentation of ggplot2 for a detailed description of themes. We can also recommend an online quick reference on the subject, put together by members of the Sape research group at the University of Lugano.

Why do all LOLA enrichment plots produce "plotting error", although `differential.enrichment.lola=TRUE`?

RnBeads makes extensive use of other bioconductor- and CRAN-packages, which themselves have further dependencies. For instance, the LOLA package makes use of the qvalue package to adjust its p-values for multiple testing. This package is not loaded by default through the LOLA package, since multiple testing correction is optional. However, RnBeads relies on corrected p-values as thresholds and throws an error, if no corrected p-value is available. To fix this issue, manually install qvalue and restart your analysis:

source("https://bioconductor.org/biocLite.R") biocLite("qvalue")

Technical Issues

I noticed that my analysis requires a lot of memory when I use multiple cores. What can I do?

R and the foreach package that we use for parallelization across multiple cores have been known to create unnecessary copies of in-memory objects for each parallel task. We therefore recommend to reduce the number cores using the parallel.setup() function in RnBeads. For large datasets, we recommend not to use more than 2 to 4 cores and - if possible - parallelize using a high performance compute cluster (HPC; see the "Deploying RnBeads on a Scientific Compute Cluster" section in the RnBeads vignette) rather than running the entire analysis on too many cores of a single machine.

I heard that RnBeads stores large objects on disk rather than in main memory. Where are these objects stored? Can temporary files accumulate on my hard drive?

Any temporary files are automatically deleted when an RnBeads analysis is completed and the corresponding R session is closed. Typically the temp directory can be found on the /tmp/ path of a Linux/Unix machine. You can check where R stores your temporary data using the tempdir() command in an R session. By default RnBeads also stores big datasets on the hard drive during the analysis in order to reduce memory consumption. For this task it makes implicit use of the ff package for storing temporary files. Within an R session, you can see the ff temporary directory by executing the following commands:

tempdir() library(ff) getOption("fftempdir")

You can change where big RnBeads methylation datasets are stored on disk using

options(fftempdir="MY_DIRECTORY")

before running your analysis. However, if an R session is abnormally terminated, some temporary files might remain, because ff and RnBeads cannot regain control of the R sesseion to delete these files. If you suspect that your computer contains old temporary files from RnBeads analyses, check the contents of the above directories and delete them manually.

Frequently Asked Questions

General Questions