Setting up snakePipes¶
Unlike many other pipelines, setting up snakePipes is easy! All you need is a linux/OSX system with python3-mamba installation. In past versions, snakePipes was using conda. We are now moving forward with mamba: a Python-based CLI conceived as a drop-in replacement for conda, offering higher speed and more reliable environment solutions to our snakePipes workflows thanks to the bindings over _libsolv_.
Installing conda & mamba¶
Follow the instructions here to install either miniconda or anaconda first. Once you have already installed either miniconda or anaconda, you may simply add mamba to your base environment
$ conda install mamba -c conda-forge
After installation, check your python path and version :
$ command -v python
$ <your_chosen_installation_path>/bin/python
$ python --version # anything above 3.5 is ok!
$ Python 3.6.5 :: Anaconda, Inc.
Now we are ready to install snakePipes latest release using mamba
.
Installing snakePipes¶
The easiest way to install snakePipes is via our conda channel. The following command install snakePipes and also creates a conda virtual environment named snakePipes
, which you can then activate via conda activate snakePipes
. Specifying snakePipes version avoids issues with conda's environment solver.
mamba create -n snakePipes -c mpi-ie -c conda-forge -c bioconda snakePipes
This way, the software used within snakePipes do not conflict with the software pre-installed on your terminal or in your python environment.
Now, we should create the workflow environments:
conda activate snakePipes
snakePipes createEnvs
Modify global options¶
It is often useful to store organism YAML files and the cluster configuration file outside of snakePipes, so that these can be used across snakePipes versions without needing to make copies. Since snakePipes 1.3.0, this can be done by modifying the defaults.yaml file, the location of which is given by snakePipes info
. Instead of manually modifying this file, you may also use snakePipes config
.
To see the location of the various YAML files so you can manually inspect them, you can use:
snakePipes info
This would show the locations of:
defaults.yaml Defines default tool and file paths. See Create the conda environments
cluster.yaml Defines execution command for the cluster. See Configure your cluster
organisms/<organism>.yaml : Defines genome indices and annotations for various organisms. See Configure the organisms
Workflow-specific defaults : Defines default options for our command line wrappers. See Configure default options for workflows
It is a good idea to keep a copy of your defaults.yaml, cluster.yaml and the whole organism folder in a dedicated location e.g. some folder outside the snakePipes installation folder named "snakePipes_configs" .
You can configure snakePipes to use these files after a fresh installation or update with snakePipes config --organismsDir my_organisms_dir --clusterConfig my_cluster_config
. This will also work if you add --configMode recycle
.
Create the conda environments¶
All the tools required for running various pipelines are installed via various conda repositories (mainly bioconda). The following commands installs the tools and creates the respective conda environments.
snakePipes createEnvs
Note
snakePipes createEnvs
will also set the snakemakeOptions:
line in the global snakePipes
defaults.yaml
files. If you have already modified this then use the --keepCondaDir
option.
The place where the conda envs are created (and therefore the tools are installed) is defined in snakePipes/defaults.yaml
file on our GitHub repository. You can modify it to suite your needs.
Here are the content of defaults.yaml:
snakemakeOptions: '--use-conda --conda-prefix /data/general/scratch/conda_envs'
Note
Whenever you change the snakemakeOptions: line in defaults.yaml, you should run snakePipes createEnvs to ensure that the conda environments are then created.
Running snakePipes createEnvs
is not strictly required, but facilitates multiple users using the same snakePipes installation.
Configure the organisms¶
For each organism of your choice, create a file called <organism>.yaml
in the folder specified by organismsDir
in defaults.yaml and
fill the paths to the required files next to the corresponding yaml entry. For common organisms, the required files are downloaded and the yaml entries can be created automatically via the workflow createIndices
.
The yaml files look like this after the setup (an example from drosophila genome dm3
) :
# Integer, size of genome in base-pairs
genome_size: 142573017
# path to genome.fasta for mapping
genome_fasta: "/data/repository/organisms/dm3_ensembl/genome_fasta/genome.fa"
# path to genome.fasta.fai (fasta index) for mapping
genome_index: "/data/repository/organisms/dm3_ensembl/genome_fasta/genome.fa.fai"
# OPTIONAL. Needed for GC bias estimation by deepTools
genome_2bit: "/data/repository/organisms/dm3_ensembl/genome_fasta/genome.2bit"
# Needed for DNA-mapping workflow
bowtie2_index: "/data/repository/organisms/dm3_ensembl/BowtieIndex/genome"
# index of the genome.fasta using HISAT2, needed for RNA-seq workflow
hisat2_index: "/data/repository/organisms/dm3_ensembl/HISAT2Index/genome"
# needed by HISAT2 for RNA-seq workflow
known_splicesites: "/data/repository/organisms/dm3_ensembl/ensembl/release-78/HISAT2/splice_sites.txt"
bwa_index: "/data/repository/organisms/dm3_ensembl/BWAindex/genome.fa"
# index of the genome.fasta using STAR, needed for RNA-seq workflow
star_index: "/data/repository/organisms/dm3_ensembl/STARIndex/"
# Needed for QC and annotation in DNA-mapping/RNA-Seq workflows
genes_bed: "/data/repository/organisms/dm3_ensembl/Ensembl/release-78/genes.bed"
# Needed for QC and annotation in DNA-mapping/RNA-Seq workflows
genes_gtf: "/data/repository/organisms/dm3_ensembl/Ensembl/release-78/genes.gtf"
# OPTIONAL. For QC and filtering of regions in multiple workflows.
blacklist_bed:
# STRING. Name of the chromosomes to ignore for calculation of normalization factors for coverage files
ignoreForNormalization: "U Uextra X XHet YHet dmel_mitochondrion_genome"
Warning
Do not edit the yaml keywords corresponding to each required entry.
Note
Some fields are optional and can be left empty. For example, if a blacklist file is not available for your organism of interest, leave blacklist_bed: empty. Files for either STAR or HISAT2 could be skipped for RNA-seq if the respective aligner is not used. We nevertheless recommended providing all the files, to allow more flexible analysis.
After setting up the yamls, we can execute a snakePipes workflow on the organism of choice by referring to the organism as dm3
, where the keyword dm3 matches the name of the yaml file (dm3.yaml).
Note
The name of the yaml file (except the .yaml suffix) is used as keyword to refer to the organism while running the workflows.
Download premade indices¶
For the sake of convenience, we provide premade indices for the following organisms:
To use these, simply download and extract them. You will then need to modify the provided YAML file to indicate exactly where the indices are located (i.e., replace /data/processing/ryan
with whatever is appropriate).
Configure your cluster¶
The cluster.yaml
file contains both the default memory requirements as well as two options passed to snakemake that control how jobs are submitted to the cluster and files are retrieved:
snakemake_latency_wait: 300
snakemake_cluster_cmd: module load slurm; SlurmEasy --mem-per-cpu {cluster.memory} --threads {threads} --log {snakePipes_cluster_logDir} --name {rule}.snakemake
snakePipes_cluster_logDir: cluster_logs
__default__:
memory: 8G
snp_split:
memory: 10G
The location of this file must be specified by the clusterConfig
value in defaults.yaml.
You can change the default per-core memory allocation if needed here. Importantly, the snakemake_cluster_cmd
option must be changed to match your needs (see table below). Whatever command you specify must include
a {cluster.memory}
option and a {threads}
option. You can specify other required options here as well.
The snakemake_latency_wait
value defines how long snakemake should wait for files to appear
before throwing an error. The default of 300 seconds is typically reasonable when a file system such as
NFS is in use. Please also note that there are additional memory
settings for each workflow in snakePipes/workflows/[workflow]/cluster.yaml
that you might need to adjust.
snakePipes_cluster_logDir:
can be used like a wildcard in snakemake_cluster_cmd to specify the directory
for the stdout and stderr files from a job that is running on the cluster. This is given separate to make sure
the directory exists before execution. A relative path is treated relative to the ouput directory of the workflow.
If you want, you can also give an absolute log directory starting with /.
Scheduler/Queuing |
snakemake_cluster_cmd example |
---|---|
slurm |
snakemake_cluster_cmd: module load slurm; sbatch --ntasks-per-node=1
-c {threads} -J {rule}.snakemake --mem-per-cpu={cluster.memory}
-p MYQUEUE -o {snakePipes_cluster_logDir}/{rule}.%j.out
-e {snakePipes_cluster_logDir}/{rule}.%j.err
snakePipes_cluster_logDir: cluster_logs
|
PBS/Torque |
snakemake_cluster_cmd: qsub -N {rule}.snakemake
-q MYQUEUE -l pmem={cluster.memory}
-l walltime=20:00:00 -l nodes=1:ppn={cluster.threads}
-o {snakePipes_cluster_logDir}/{rule}.\$PBS_JOBID.out
-e {snakePipes_cluster_logDir}/{rule}.\$PBS_JOBID.err
snakePipes_cluster_logDir: cluster_logs
|
SGE |
Please send us a working example! |
Configure default options for workflows¶
The default options for all command-line arguments as well as for the cluster (memory) are stored in the workflow-specific folders. If you have cloned the repository locally, these files are located under snakePipes/workflows/<workflow_name>
folder. You can modify the values in these yamls to suite your needs. Most of the default values could also be replaced from the command line wrappers while executing a workflow.
Below are some of the workflow defaults from the DNA-mapping pipeline. Empty sections means no default is set:
## key for the genome name (eg. dm3)
genome:
## FASTQ file extension (default: ".fastq.gz")
ext: '.fastq.gz'
## paired-end read name extension (default: ['_R1', "_R2"])
reads: [_R1, _R2]
## mapping mode
mode: mapping
aligner: Bowtie2
## Number of reads to downsample from each FASTQ file
downsample:
## Options for trimming
trim: False
trimmer: cutadapt
trimmerOptions:
## Bin size of output files in bigWig format
bwBinSize: 25
## Run FASTQC read quality control
fastqc: false
## Run computeGCBias quality control
GCBias: false
## Retain only de-duplicated reads/read pairs
dedup: false
## Retain only reads with at least the given mapping quality
mapq: 0
Test data¶
Test data for the various workflows is available at the following locations: