What it does

Optionally trimmed reads are mapped to the reference genome using a bisulfite-specific aligner (bwa-meth). Quality metrics are collected and synthesized in a QC report, including bisulfite conversion rate, mapping rate, coverage metrics, and methylation bias.

There are two flags that allow skipping certain QC metric calculation, i.e. --skipDOC and --GCbias. These deactivate or activate, respectively, the depth of coverage (DOC) calculations or GC bias calculation done by deepTools. If you run the workflow with --fromBAM, you can also choose to skip the re-calculation of most QC metrics with --skipBamQC.

Methylation ratios are extracted (via MethylDackel) for CpG positions in the reference genome with a minimum coverage specified by --minCoverage and low SNP allelic frequency (<0.25 illegitimate bases). If a sample sheet is provided, Metilene , DMRseq and/or DSS (as specified by --DMRprograms) will be used to find differentially methylated regions (DMRs). Filtering criterion can be changed both for the CpGs used to find DMRs as well as what are considered as significant DMRs.


Input requirements

This pipeline requires fastq files and a genome alias, for which bwa-meth index has been built. Optional inputs include a sample sheet with grouping information to use in differential methylation analysis and a blacklist bed file with genomic positions corresponding to known snps to mask single CpG methylation values.

It is possible to use pipeline-compatible bam files as input. For that, the user has to use the --fromBAM flag and provide the bam file extension if not matched by the default.

Workflow configuration file

## General/Snakemake parameters, only used/set by wrapper or in Snakemake cmdl, but not in Snakefile
local: False
maxJobs: 12
## directory with fastq or bam files
## Genome information
###SNP black list (bed file)
###sample Sheet
###inclusion bounds for methylation extraction
noAutoMethylationBias: False
## FASTQ file extension (default: ".fastq.gz")
ext: '.fastq.gz'
## paired-end read name extension (default: ['_R1', "_R2"])
reads: [_R1, _R2]
## Number of reads to downsample from each FASTQ file
## Options for trimming
trim: False
trimmer: 'fastp'
trimmerOptions: '-q 5 -l 30 -M 5'
## Bin size of output files in bigWig format
bwBinSize: 25
## Run FASTQC read quality control
fastqc: false
verbose: False
plotFormat: 'png'
#### Flag to control the pipeline entry point
fromBAM: False
bamExt: '.bam'
pairedEnd: True
###Flags to control skipping of certain QC calculations
skipDOC: False
GCbias: False
###Thresholds for filtering of statistical comparisons (DMRs and DMLs)
DMRprograms: 'metilene,dmrseq'
maxDist: 300
minCpGs: 10
minCoverage: 5
FDR: 0.1
minMethDiff: 0.1
###MethylDackel options
MethylDackelOptions: '--mergeContext --maxVariantFrac 0.25 --minDepth 4'
UMIBarcode: False
bcPattern: NNNNCCCCCCCCC #default: 4 base umi barcode, 9 base cell barcode (eg. RELACS barcode)
UMIDedup: False
UMIDedupSep: "_"
UMIDedupOpts: --paired
aligner: bwameth

Understanding the outputs

The WGBS pipeline invoked fastq files and a sample sheet and the --trim and --fastqc options will generate the following output:

├── bwameth
├── cluster_logs
├── dmrseq_sampleSheet_minCoverage5
├── FastQC_trimmed
├── FASTQ_fastp
├── MethylDackel
├── metilene_sampleSheet_minCoverage5
├── multiQC
├── originalFASTQ
└── QC_metrics

The workflow produces the following outputs:

  • FASTQ_downsampled: contains read files downsampled to 5mln reads. These are used to calculate conversion rate which would otherwise take a very long time.

  • bwameth: contains bam files obtained through read alignment with bwa-meth and the PCR duplicate removal with sambamba, as well as matching bam index files.

  • dmrseq_sampleSheet_minCoverage<X>: DMRs (DMRs.txt) and a report (Stats_report.html) from DMRseq as well as a saved R session (Session.RData) using the requested minimum coverage. If you rerun the pipeline with a different minimum coverage specified then a new directory will be created.

  • DSS_sampleSheet_minCoverage<X>: As with DMRseq above.

  • FastQC_trimmed: FastQC output on the trimmed reads.

  • FASTQ_fastp: The trimmed reads and QC metrics from FastP.

  • MethylDackel: BigWig coverage and methylation files as well as the bedGraph files produced by MethylDackel.

  • metilene_sampleSheet_minCoverage<X>: contains output files from metilene in DMRs.txt. DMRs.annotated.txt is an annotated version of that, wherein DMRs are annotated with the nearest gene and the distance to it. There is additionally a QC report (Stats_report.html) that summarizes various properties of the DMRs.

  • QC_metrics: contains output files from conversion rate, flagstat, depth of coverage, GCbias and methylation bias calculations. The QC report in pdf format collecting those metrics in tabular form is also found in this folder.

Example output plots

Using data from Habibi et al., Cell Stem Cell 2013 corresponding to mouse chr6:4000000-6000000, following plots could be obtained:

../../_images/DMRseq_methylation_difference.png ../../_images/WGBS_PCA_methylation.png ../../_images/WGBS_PCA_coverage.png

Command line options

MPI-IE workflow for WGBS analysis

usage example:

WGBS -i read_input_dir -o output-dir mm10

usage: WGBS -i INDIR -o OUTDIR [-h] [-v] [--ext EXT] [--reads READS READS]
            [-c CONFIGFILE] [--clusterConfigFile CLUSTERCONFIGFILE] [-j INT]
            [--local] [--keepTemp] [--snakemakeOptions SNAKEMAKEOPTIONS]
            [--DAG] [--version] [--emailAddress EMAILADDRESS]
            [--smtpServer SMTPSERVER] [--smtpPort SMTPPORT] [--onlySSL]
            [--emailSender EMAILSENDER] [--smtpUsername SMTPUSERNAME]
            [--smtpPassword SMTPPASSWORD] [--downsample INT] [--trim]
            [--trimmer {cutadapt,trimgalore,fastp}]
            [--trimmerOptions TRIMMEROPTIONS] [--fastqc] [--bcExtract]
            [--bcPattern BCPATTERN] [--UMIDedup] [--UMIDedupSep UMIDEDUPSEP]
            [--UMIDedupOpts UMIDEDUPOPTS] [--bwBinSize BWBINSIZE]
            [--plotFormat STR] [--aligner {bwameth,bwameth2}]
            [--blacklist BLACKLIST] [--targetRegions TARGETREGIONS]
            [--sampleSheet SAMPLESHEET] [--noAutoMethylationBias]
            [--maxDist MAXDIST] [--minCpGs MINCPGS]
            [--minMethDiff MINMETHDIFF] [--minCoverage MINCOVERAGE]
            [--FDR FDR] [--MethylDackelOptions METHYLDACKELOPTIONS]
            [--fromBAM] [--skipBamQC] [--bamExt BAMEXT] [--singleEnd]
            [--DMRprograms DMRPROGRAMS] [--metileneOptions METILENEOPTIONS]
            [--skipDOC] [--GCbias]

Positional Arguments


Genome acronym of the target organism. Either a yaml file or one of: hg38, mm39_ens106, GRCh38_gencode40, dm6, GRCz10, SchizoSPombe_ASM294v2, mm10_gencodeM19, GRCz11

Required Arguments

-i, --input-dir

input directory containing the FASTQ files, either paired-end OR single-end data

-o, --output-dir

output directory

General Arguments

-v, --verbose

verbose output (default: 'False')


Suffix used by input fastq files (default: '".fastq.gz"').


Suffix used to denote reads 1 and 2 for paired-end data. This should typically be either '_1' '_2' or '_R1' '_R2' (default: '['_R1', '_R2']). Note that you should NOT separate the values by a comma (use a space) or enclose them in brackets.

-c, --configFile

configuration file: config.yaml (default: 'None')


configuration file for cluster usage. In absence, the default options specified in defaults.yaml and workflows/[workflow]/cluster.yaml would be selected (default: 'None')

-j, --jobs

maximum number of concurrently submitted Slurm jobs / cores if workflow is run locally (default: '12')


run workflow locally; default: jobs are submitted to Slurm queue (default: 'False')


Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.


Snakemake options to be passed directly to snakemake, e.g. use --snakemakeOptions='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')


If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.


show program's version number and exit

Email Arguments


If specified, send an email upon completion to the given email address


If specified, the email server to use.


The port on the SMTP server to connect to. A value of 0 specifies the default port.


The SMTP server requires an SSL connection from the beginning.


The address of the email sender. If not specified, it will be the address indicated by --emailAddress


If your SMTP server requires authentication, this is the username to use.


If your SMTP server requires authentication, this is the password to use.



Downsample the given number of reads randomly from of each FASTQ file (default: 'None')


Activate fastq read trimming. If activated, Illumina adaptors are trimmed by default. Additional parameters can be specified under --trimmerOptions. (default: 'True')


Possible choices: cutadapt, trimgalore, fastp

Trimming program to use: Cutadapt, TrimGalore, or fastp. Note that if you change this you may need to change --trimmerOptions to match! (default: '"fastp"')


Additional option string for trimming program of choice. (default: '"-q 5 -l 30 -M 5"')


Run FastQC read quality control (default: 'False')


To extract umi barcode from fastq file via UMI-tools and add it to the read name (default: 'False')


The pattern to be considered for the barcode. 'N' = UMI position (required) 'C' = barcode position (optional) (default: '')


Deduplicate bam file based on UMIs via umi_tools dedup that are present in the read name. (default: 'False')


umi separation character that will be passed to umi_tools.(default: '"_"')


Additional options that will be passed to umi_tools.(default: '')


Bin size of output files in bigWig format (default: '25')


Possible choices: png, pdf, None

Format of the output plots from deepTools. Select 'none' for no plots (default: '"png"')


Possible choices: bwameth, bwameth2

Program used for mapping: bwameth or bwameth2 (default: 'None'). Specifying bwameth2 will run bwameth with bwa-mem2 underneath.


Bed file(s) with positions to mask for methylation calling. Useful for masking SNPs in your strain of interest. (default: 'None')


Bed file(s) with regions of interest to evaluate methylation for. (default: 'None')


Perform differential methylation analysis between groups of samples by providing a text file with sample information to use for statistical analysis. (default: 'None')


If specified, MethylDackel mbias will NOT be run and the suggested parameters from it will NOT be used for methylation extraction. You can instead supply them manually in --MethylDackelOptions.


The maximum distance between CpGs in a DMR (for metilene, default: '300')


The minimum number of CpGs in a DMR (for metilene, default: '10')


The minimum methylation change in methylation for CpG inclusion in DMR detection (for metilene, default: '0.1')


The minimum coverage needed for across all samples for a CpG to be used in DMR calling and PCA. Note that you can change this value without overwritting the DMR output. Default: '5'


FDR threshold for returned DMRs (default: '0.1')


Options to pass to MethylDackel extract. You are highly advised NOT to set a minimum coverage at this step. Default: '"--mergeContext --maxVariantFrac 0.25 --minDepth 4"'


If specified, the input is taking from BAM files containing alignments rather than fastq files. See also --bamExt.


If specified, filtering of input bam files, as well as calculation of quality metrics, will be skipped.


If --fromBAM is specified, this is the expected file extension. Removing it yields sample names. Default: '".bam"'


If --fromBAM is specified, this indicates that the input BAM files contain paired-end data. The option is ignored unless --fromBAM is given.


If a sample sheet is provided, use the specified DMR-calling programs. Multiple programs can be comma-separated with no spaces (e.g., 'metilene,dmrseq,DSS'). The available programs are metilene, dmrseq, and DSS (note that this is very slow). Default: "metilene,dmrseq".


Options to pass to metilene. Default: 'None'


Skip depth of coverage calculation with deepTools.


Perform GC bias calculation with deepTools.

code @ github.