Running snakePipes

Pipelines under snakePipes are designed in a way such that all workflows are configured and ran in a similar way.

An example with ChIP-seq data

A typical ChIP-seq analysis of human samples starts from paired-end FASTQ files in the directory input-dir:

$ ls /path/to/input-dir/
my_H3K27ac_sample_R1.fastq.gz  my_H3K27me3_sample_R1.fastq.gz  my_Input_sample_R1.fastq.gz
my_H3K27ac_sample_R2.fastq.gz  my_H3K27me3_sample_R2.fastq.gz  my_Input_sample_R2.fastq.gz

The ChIP-seq workflow requires the files to be processed via the DNA-mapping workflow first. We therefore run the DNA-mapping workflow :

$ DNA-mapping -i /path/to/input-dir -o /path/to/output-dir --mapq 5 -j 10 --dedup hs37d5
  • --mapq 5 would filter mapped reads for a minimum mapping quality of 5. This would keep only primary alignments from bowtie2, sufficient for downstream analysis.

  • --dedup would remove PCR duplicates (reads with matching 5' position in the genome), a typical step in ChIP-Seq analysis.

  • -j 10 defines 10 jobs to be run in parallel on the cluster (see below).

  • hs37d5 is the name of the genome (keyword for the yaml). The yaml file corresponding to this genome should exist as snakePipes/shared/organisms/hs37d5.yaml. (see Setting up snakePipes for details).

All individual jobs of the workflow will be submitted to the Grid engine using the command specified under shared/cluster.yaml. The parameter -j defines the number of jobs to be run in parallel, while the number of threads per job is hard-coded in the workflows.

To run the workflow locally, use the parameter --local for local mode and the parameter -j 10 to specify the maximal number of used CPU threads (here: 10).

For single-end FASTQ files, Note that single end data still needs a valid suffix (e.g. sample1_R1.fastq.gz). With a proper suffix, single end mode is detected by default. When executing some workflows with the --fromBAM flag, it is still necessary to set --singleEnd.

Once the DNA-mapping run is finished sucessfully. We can run the ChIP-seq analysis in the same directory.

$  ChIP-seq -d /path/to/dna-mapping-output/ hs37d5 chip-samples.yaml
  • -d specifies the directory where the output of DNA-mapping workflow lies. The ChIP-seq workflow would also write it's output there.

  • hs37d5 is the name of the genome (keyword for the yaml).

  • chip-samples.yaml is a yaml file that defines for each ChIP sample, the corresponding control (input) sample and the type of mark (broad/sharp). See ChIP-seq for more details on how to setup this yaml file.

The ChIP-seq workflow would follow up from the DNA-mapping outputs and perform peak calling, create ChIP-input normalized coverage files and also perform differential (control-test) analysis if a sample information file is provided (see below).

The sample sheet

Most of the workflows allow users to perform grouped operations as an option, for example differential expression analysis in mRNA-seq workflow, differential binding analysis in ChIP-Seq workflow, differential open-chromatin analysis in ATAC-seq workflow or merging of groups in Hi-C workflow. For all this analysis, snakePipes needs a sampleSheet.tsv file (file name is not important, but it has to be tab-separated) that contains sample grouping information. In most cases users would want to groups samples by replicates. The format of the file is as follows:

name   condition
sample1    control
sample1    control
sample2    test
sample2    test

The name section referes to sample names (without the read suffix), while the condition section refers to sample group (control/test, male/female, normal/diseased etc..)

Using BAM input

In many workflows it is possible to directly use BAM files as input by specifying --fromBAM. Note that you must then specify whether you have paired-end (the default) or single-end data. This is typically done with the --singleEnd option.

Changing read extensions or mate designators

The default file names produced by Illumina sequencers are of the form <sample>_R1.fastq.gz and <sample_R2.fastq.gz. However, sometimes public datasets will instead have a .fq.gz suffix or use _1 and _2 as mate designators. To enable this, the --ext option can be used to change .fastq.gz default suffix to .fq.gz and --reads to _1 _2.

Common considerations for all workflows

All of the snakePipes workflows that begin with a FASTQ file, perform the same pre-processing steps.

  • Linking/downsampling the FASTQ file : The FASTQ rule in the workflows links the input FASTQ file into the FASTQ folder in the output directory. If downsampling is specified, the FASTQ folder would contain the downsampled FASTQ file.

Note

The DNA-mapping and RNA-mapping pipelines can take either single, or paired-end FASTQ files. For paired-end data, the reads R1 and R2 are expected to have the suffix _R1 and _R2 respectively, which can be modified in the defaults.yaml file using the reads key, to your needs. For example, files downloaded from NCBI would normally have the extention .1.fastq.gz and .2.fastq.gz. Also, please check the ext key in the configuration file if you wish to modify the read extension (default is .fastq.gz).

  • Quality/adapter trimming (optional): If --trim is selected, the trimming rule would run the selected program (either Trimgalore, or Cutadapt) on the files in the FASTQ folder, and would produce another folder with name FASTQ_<program>, where <program> is either Cutadapt or Trimgalore.

  • FastQC (optional): If --fastqc is specified, the FASTQC rule would run FastQC on the input files and store the output under FastQC folder. If trimming is specified, FastQC is always produced on trimmed files, and stored under FastQC_trimmed folder.

  • --snakemakeOptions: All wrappers contain a --snakemakeOptions parameter, which is quite useful as it can be used to pass on any arguments directly to snakemake. One use case is to perform a dry run, i.e. to check which programs would be executed and which outputs would be created by the workflow, without actually running it. This can be executed via --snakemakeOptions="-np". This would also print the commands to be used during the run.

  • --DAG: All workflows can produce a directed acyclic graph of themselves, using the --DAG option in the wrappers. This could be useful in reporting/presenting the results.

  • --keepTemp: This option control temporary/intermediate files are to be kept after the workflow is finished. Normally the temporary files are removed after analysis.

  • --bwBinSize: This option is available for most workflows, and refers to the bin size used to create the coverage files. BigWig files are created by most workflows in order to allow downstream analysis and visualization of outputs. This argument controls the size of the bins in which the genome is divided for creating this file. The default is sufficient for most analysis.

  • Temporary directory/files: Some tools need additonal space during runtime (eg. samtools sort -T [DIR] ...). SnakePipes uses the core tool mktemp to create temporary directories in some rules. On Linux-based systems the global env variabale $TMPDIR is honored. On Mac OS and if $TMPDIR is empty, we fallback to /tmp/ as the parent temporary directory. For performance reasons, it is recommended that the $TMPDIR points to a local drive (and not eg. an NFS share). Please make sure there is enough space!

Logging of outputs

snakePipes produces logs at three diferrent levels.

  • <workflow>.log: This file would be generated on the working directory, and contains everything printed on the screen via snakemake and python wrappers.

  • <workflow>_organism.yaml: This file is a copy of the YAML file specifying where all of the genomic indices, annotations, and other files are located.

  • cluster_logs: In case snakePipes is setup with a cluster, the folder cluster_logs would contain the output and error messages from the cluster scheduler.

  • <output>/logs: Each output folder from snakePipes workflows contain their own log (.err and .out) file under /logs/ folder. This contains the messages directly from the executed tools.

Note

For most cases where a tool fails, these files contain useful debugging information. However sometimes, the error can't be captured in these files and therefore ends up in the cluster_logs folder.

Quality-Checks

All workflows under snakePipes employ various quality-checks (QC) to inform users of the data quality.

  • MultiQC : All workflows in snakePipes output a MultiQC folder, which summerizes the QC metrics obtained from various tools in the workflow via MultiQC, in an interactive HTML report. This output is quite useful to compare samples and get an overview of the data quality from all samples.

  • deepTools: deepTools are a popular set of tools that perform QC, normalization and visualization of NGS data. In snakePipes, most workflows (except HiC and scRNAseq) contain outputs from various deepTools modules on the samples. The coverage files (bigWigs), are also generated by deepTools (bamCoverage and bamCompare modules). Therefore, it's useful to look at the deepTools documentation before inspecting these results.

Note

We strongly encourage users to understand these quality matrices and inspect the results from QC, before making biological conclusions or preceeding to downstream analysis.

code @ github.