Pipelines under snakePipes are designed in a way such that all workflows are configured and ran in a similar way.
An example with ChIP-seq data¶
A typical ChIP-seq analysis of human samples starts from paired-end FASTQ files in the directory
$ ls /path/to/input-dir/ my_H3K27ac_sample_R1.fastq.gz my_H3K27me3_sample_R1.fastq.gz my_Input_sample_R1.fastq.gz my_H3K27ac_sample_R2.fastq.gz my_H3K27me3_sample_R2.fastq.gz my_Input_sample_R2.fastq.gz
$ DNA-mapping -i /path/to/input-dir -o /path/to/output-dir --mapq 5 -j 10 --dedup hs37d5
--mapq 5would filter mapped reads for a minimum mapping quality of 5. This would keep only primary alignments from bowtie2, sufficient for downstream analysis.
--dedupwould remove PCR duplicates (reads with matching 5' position in the genome), a typical step in ChIP-Seq analysis.
-j 10defines 10 jobs to be run in parallel on the cluster (see below).
hs37d5is the name of the genome (keyword for the yaml). The yaml file corresponding to this genome should exist as
snakePipes/shared/organisms/hs37d5.yaml. (see Setting up snakePipes for details).
All individual jobs of the workflow will be submitted to the Grid engine using the command specified under
shared/cluster.yaml. The parameter
-j defines the number of jobs to be run in parallel, while the number of threads per job is hard-coded in the workflows.
To run the workflow locally, use the parameter
--local for local mode and the parameter
-j 10 to specify the maximal number of used CPU threads (here: 10).
For single-end FASTQ files, the workflow would automatically recognize them if the file name has no suffix (eg. "sample1.fastq" instead of "sample1_R1.fastq"). However, mixing of single and paired-end files in the same folder is not supported currently.
Once the DNA-mapping run is finished sucessfully. We can run the ChIP-seq analysis in the same directory.
$ ChIP-seq -d /path/to/dna-mapping-output/ hs37d5 chip-samples.yaml
-dspecifies the directory where the output of DNA-mapping workflow lies. The ChIP-seq workflow would also write it's output there.
hs37d5is the name of the genome (keyword for the yaml).
chip-samples.yamlis a yaml file that defines for each ChIP sample, the corresponding control (input) sample and the type of mark (broad/sharp). See ChIP-seq for more details on how to setup this yaml file.
The ChIP-seq workflow would follow up from the DNA-mapping outputs and perform peak calling, create ChIP-input normalized coverage files and also perform differential (control-test) analysis if a sample information file is provided (see below).
The sample sheet¶
Most of the workflows allow users to perform grouped operations as an option, for example
differential expression analysis in mRNA-seq workflow, differential binding analysis in
ChIP-Seq workflow, differential open-chromatin analysis in ATAC-seq workflow or merging of
groups in Hi-C workflow. For all this analysis, snakePipes needs a
sampleSheet.tsv file (file name is not important, but it has to be tab-separated) that contains sample grouping information. In most cases users would want to groups samples by replicates. The format of the file is as follows:
name condition sample1 control sample1 control sample2 test sample2 test
The name section referes to sample names (without the read suffix), while the condition section refers to sample group (control/test, male/female, normal/diseased etc..)
Using BAM input¶
In many workflows it is possible to directly use BAM files as input by specifying
--fromBAM. Note that you must then specify whether you have paired-end (the default) or single-end data. This is typically done with the
Changing read extensions or mate designators¶
The default file names produced by Illumina sequencers are of the form
<sample_R2.fastq.gz. However, sometimes public datasets will instead have a
.fq.gz suffix or use
_2 as mate designators. To enable this, the
--ext option can be used to change
.fastq.gz default suffix to
Common considerations for all workflows¶
All of the snakePipes workflows that begin with a FASTQ file, perform the same pre-processing steps.
- Linking/downsampling the FASTQ file : The FASTQ rule in the workflows links the input FASTQ file into the FASTQ folder in the output directory. If
downsamplingis specified, the FASTQ folder would contain the downsampled FASTQ file.
The DNA-mapping and RNA-mapping pipelines can take either single, or paired-end FASTQ files. For paired-end data, the reads
R2 are expected to have the suffix
_R2 respectively, which can be modified in the
defaults.yaml file using the
reads key, to your needs. For example, files downloaded from NCBI would normally have the extention
.2.fastq.gz. Also, please check the
ext key in the configuration file if you wish to modify the read extension (default is
- Quality/adapter trimming (optional): If
--trimis selected, the
trimmingrule would run the selected program (either Trimgalore, or Cutadapt) on the files in the FASTQ folder, and would produce another folder with name
FASTQ_<program>, where <program> is either
- FastQC (optional): If
--fastqcis specified, the
FASTQCrule would run FastQC on the input files and store the output under
FastQCfolder. If trimming is specified, FastQC is always produced on trimmed files, and stored under
- --snakemakeOptions: All wrappers contain a
--snakemakeOptionsparameter, which is quite useful as it can be used to pass on any arguments directly to snakemake. One use case is to perform a dry run, i.e. to check which programs would be executed and which outputs would be created by the workflow, without actually running it. This can be executed via
--snakemakeOptions="-np". This would also print the commands to be used during the run.
- --DAG: All workflows can produce a directed acyclic graph of themselves, using the
--DAGoption in the wrappers. This could be useful in reporting/presenting the results.
- --keepTemp: This option control temporary/intermediate files are to be kept after the workflow is finished. Normally the temporary files are removed after analysis.
- --bwBinSize: This option is available for most workflows, and refers to the bin size used to create the coverage files. BigWig files are created by most workflows in order to allow downstream analysis and visualization of outputs. This argument controls the size of the bins in which the genome is divided for creating this file. The default is sufficient for most analysis.
- Temporary directory/files: Some tools need additonal space during runtime (eg.
samtools sort -T [DIR] ...). SnakePipes uses the core tool
mktempto create temporary directories in some rules. On Linux-based systems the global env variabale
$TMPDIRis honored. On Mac OS and if $TMPDIR is empty, we fallback to /tmp/ as the parent temporary directory. For performance reasons, it is recommended that the $TMPDIR points to a local drive (and not eg. an NFS share). Please make sure there is enough space!
Logging of outputs¶
snakePipes produces logs at three diferrent levels.
- <workflow>.log: This file would be generated on the working directory, and contains everything printed on the screen via snakemake and python wrappers.
- <workflow>_organism.yaml: This file is a copy of the YAML file specifying where all of the genomic indices, annotations, and other files are located.
- cluster_logs: In case snakePipes is setup with a cluster, the folder
cluster_logswould contain the output and error messages from the cluster scheduler.
- <output>/logs: Each output folder from snakePipes workflows contain their own log (
.out) file under
/logs/folder. This contains the messages directly from the executed tools.
For most cases where a tool fails, these files contain useful debugging information. However sometimes, the error can't be captured in these files and therefore ends up in the
All workflows under snakePipes employ various quality-checks (QC) to inform users of the data quality.
- MultiQC : All workflows in snakePipes output a
MultiQCfolder, which summerizes the QC metrics obtained from various tools in the workflow via MultiQC, in an interactive HTML report. This output is quite useful to compare samples and get an overview of the data quality from all samples.
- deepTools: deepTools are a popular set of tools that perform QC, normalization and visualization of NGS data. In snakePipes, most workflows (except HiC and scRNAseq) contain outputs from various deepTools modules on the samples. The coverage files (bigWigs), are also generated by deepTools (bamCoverage and bamCompare modules). Therefore, it's useful to look at the deepTools documentation before inspecting these results.
We strongly encourage users to understand these quality matrices and inspect the results from QC, before making biological conclusions or preceeding to downstream analysis.
|code @ github.|