RNA-seq

What it does

The snakePipes RNA-seq workflow allows users to process their single or paired-end RNA-Seq fastq files upto the point of gene/transcript-counts and differential expression. It also allows full allele-specific RNA-Seq analysis (up to allele-specific differential expression) using the allelic-mapping mode.

../../_images/RNAseq_pipeline.png

Input requirements

The only requirement is a directory of gzipped fastq files. Files could be single or paired end, and the read extensions could be modified using the keys in the defaults.yaml file below.

Configuration file

There is a configuration file in snakePipes/workflows/RNA-seq/defaults.yaml:

## General/Snakemake parameters, only used/set by wrapper or in Snakemake cmdl, but not in Snakefile
pipeline: rna-seq
outdir:
configfile:
cluster_configfile:
local: False
max_jobs: 5
## directory with fastq files
indir:
## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/
## Value can be also path to your own genome config file!
genome:
## FASTQ file extension (default: ".fastq.gz")
ext: '.fastq.gz'
## paired-end read name extension (default: ["_R1", "_R2"])
reads: ["_R1","_R2"]
## Number of reads to downsample from each FASTQ file
downsample:
## Options for trimming
trim: False
trim_prg: cutadapt
trim_options:
## further options
mode: alignment-free,deepTools_qc
sampleSheet:
bw_binsize: 25
fastqc: False
featurecounts_options: -C -Q 10 --primary
filter_annotation:
fragment_length: 200
library_type: 2
salmon_index_options: --type quasi -k 31
dnaContam: False
## supported mappers: STAR HISAT2
mapping_prg: STAR
## N.B., setting --outBAMsortingBinsN too high can result in cryptic errors
star_options: --outBAMsortingBinsN 30
hisat_options:
verbose: False
plot_format: png
# for allele-spcific mapping
snp_file:
Nmasked_index:

Apart from the common workflow options (see Running snakePipes), the following parameters are useful to consider:

  • mapping_prg: You can choose either STAR or HISAT2. While HISAT2 mapping is usually faster than STAR, we keep STAR as the default aliger due to it's superior accuracy (see this paper).

  • star_options/hisat_options: Options to pass on to your chosen aligner. Note that library type and junction definitions doesn't have to be passed to the aligners as options, as they are handeled either automatically, or via other parameters.

  • featurecounts_options: Options to pass to featureCounts (in case the alignment or allelic-mapping mode is used). Note that the paired-end information is automatically passed to featurecounts via the workflow, and the summerization is always performed at gene level, since the workflow assumes that featurecounts output is being used for gene-level differential expression analysis. If you wish to perform a transcript-level DE analysis, please choose the mode alignment-free.

  • filter_annotation: Options you can pass on to filter the original GTF file. This is useful in case you want to filter certain kind of transcripts (such as pseudogenes) before running the counts/DE analysis.

  • library_type: The default library-type is suitable for most RNAseq protocols (using Illumina Tru-Seq). Change this option in case you have a different strandednes.

  • salmon_index_options: In the alignment-free mode (see below), this option allows you to change the type of index created by salmon. New users can leave it to default.

  • dnaContam: Enable this to test for possible DNA contamination in your RNA-seq samples. DNA contamination is quantified as the fraction of reads falling into intronic and intergenic regions, compared to those falling into exons. Enabling this option would produce a directory called GenomicContamination with .tsv files containing this information.

  • plot_format: You can switch the type of plot produced by all deeptools modules using this option. Possible choices : png, pdf, svg, eps, plotly

  • snp_file: For the allelic-mapping mode. The snp_file is the file produced by SNPsplit after creating a dual-hybrid genome. The file has the suffix .vcf.

  • Nmasked_index: For the allelic-mapping mode. The Nmasked_index refers to the basename of the index file created using STAR, on the SNPsplit output.

Note

snp_file and Nmasked_index file could be specified in case you already have this and would like to re-run the analysis on new data. In case you are running the allele-specific analysis for the first time, you would need a VCF file and the name of the two strains. In this case the snp_file as well as the Nmasked_index files would be automatically created by the workflow using SNPsplit.

Differential expression

Like the other workflows, differential expression can be performed using the --sampleSheet option and supplying a sample sheet like that below:

name    condition
sample1      eworo
sample2      eworo
SRR7013047      eworo
SRR7013048      OreR
SRR7013049      OreR
SRR7013050      OreR

Note

The first entry defines which group of samples are control. This way, the order of comparison and likewise the sign of values can be changed. The DE analysis might fail if your sample names begin with a number. So watch out for that!

Complex designs with blocking factors

If the user provides additional columns between 'name' and 'condition' in the sample sheet, the variables stored there will be used as blocking factors in the order they appear in the sample sheet. Eg. if the first line of your sample sheet looks like 'name batch condition', this will translate into a formula batch + condition. 'condition' has to be the final column and it will be used for any statistical inference.

Analysis modes

Following analysis (modes) are possible using the RNA-seq workflow:

"alignment"

In this mode, the pipeline uses one of the selected aligners to create BAM files, followed by gene-level quantification using featureCounts. Gene-level differential expression analysis is then performed using DESeq2.

"allelic-mapping"

allelic-mapping mode follows a similar process as the "mapping" mode, however the alignment performed on an allele-masked genome, followed by allele-specific splitting of mapped files. Gene-level quantification is performed for each allele using featureCounts. Allele-specific, gene-level differential expression analysis is then performed using DESeq2.

Note

allelic-mapping mode is mutually exclusive with mapping mode

"alignment-free"

In this mode, the pipeline uses salmon to perform transcript-level expression quantification. This mode performs both transcript-level differential expression (using Sleuth), and gene-level differential expression (using wasabi, followed by DESeq2).

Note

The salmon index is recreated each time in alignment-free mode. This is done to facilitate changing how the GTF file is filtered, which necessitates reindexing.

"deepTools_qc"

The pipeline provides multiple quality controls through deepTools, which can be triggered using the deepTools_qc mode. It's a very useful add-on with any of the other modes.

Note

Since most deeptools functions require an aligned (BAM) file, the deepTools_qc mode will additionally perform the alignment of the fastq files. However this would not interfere with operations of the other modes.

Understanding the outputs

Assuming the pipline was run with --mode 'alignment-free,alignment,deepTools_qc' on a set of FASTQ files, the structure of the output directory would look like this (files are shown only for one sample)

├── Annotation
│   ├── filter_command.txt
│   ├── genes.annotated.bed
│   ├── genes.filtered.bed
│   ├── genes.filtered.fa
│   ├── genes.filtered.gtf
│   ├── genes.filtered.symbol
│   ├── genes.filtered.t2g
├── bamCoverage
│   ├── logs
│   ├── sample1.coverage.bw
│   ├── sample1.RPKM.bw
│   ├── sample1.uniqueMappings.fwd.bw
│   └── sample1.uniqueMappings.rev.bw
├── cluster_logs
├── deepTools_qc
│   ├── bamPEFragmentSize
│   │   ├── fragmentSize.metric.tsv
│   │   └── fragmentSizes.png
│   ├── estimateReadFiltering
│   │   └── sample1_filtering_estimation.txt
│   ├── logs
│   ├── multiBigwigSummary
│   ├── plotCorrelation
│   │   ├── correlation.pearson.bed_coverage.heatmap.png
│   │   ├── correlation.pearson.bed_coverage.tsv
│   │   ├── correlation.spearman.bed_coverage.heatmap.png
│   │   └── correlation.spearman.bed_coverage.tsv
│   ├── plotEnrichment
│   │   ├── plotEnrichment.png
│   │   └── plotEnrichment.tsv
│   └── plotPCA
│       ├── PCA.bed_coverage.png
│       └── PCA.bed_coverage.tsv
├── DESeq2_Salmon_sampleSheet
│   ├── DESeq2_Salmon.err
│   ├── DESeq2_Salmon.out
│   ├── citations.bib
│   ├── DESeq2_report_files
│   ├── DESeq2_report.html
│   ├── DESeq2_report.Rmd
│   ├── DESeq2.session_info.txt
│   ├── DEseq_basic_counts_DESeq2.normalized.tsv
│   ├── DEseq_basic_DEresults.tsv
│   └── DEseq_basic_DESeq.Rdata
├── DESeq2_sampleSheet
│   ├── DESeq2.err
│   ├── DESeq2.out
│   ├── citations.bib
│   ├── DESeq2_report_files
│   ├── DESeq2_report.html
│   ├── DESeq2_report.Rmd
│   ├── DESeq2.session_info.txt
│   ├── DEseq_basic_counts_DESeq2.normalized.tsv
│   ├── DEseq_basic_DEresults.tsv
│   └── DEseq_basic_DESeq.Rdata
├── FASTQ
│   ├── sample1_R1.fastq.gz
│   └── sample1_R2.fastq.gz
├── featureCounts
│   ├── counts.tsv
│   ├── sample1.counts.txt
│   ├── sample1.counts.txt.summary
│   ├── sample1.err
│   ├── sample1.out
├── multiQC
│   ├── multiqc_data
│   ├── multiQC.err
│   ├── multiQC.out
│   └── multiqc_report.html
├── QC_report
│   └── QC_report_all.tsv
├── RNA-seq.cluster_config.yaml
├── RNA-seq.config.yaml
├── RNA-seq_organism.yaml
├── RNA-seq_pipeline.pdf
├── RNA-seq_run-1.log
├── Salmon
│   ├── counts.genes.tsv
│   ├── counts.tsv
│   ├── Salmon_counts.log
│   ├── Salmon_genes_counts.log
│   ├── Salmon_genes_TPM.log
│   ├── SalmonIndex
│   ├── Salmon_TPM.log
│   ├── sample1
│   ├── sample1.quant.genes.sf
│   ├── sample1.quant.sf
│   ├── TPM.genes.tsv
│   └── TPM.tsv
├── sleuth_Salmon_sampleSheet
│   ├── logs
│   ├── MA-plot.pdf
│   ├── sleuth_live.R
│   ├── so.rds
│   └── Wald-test.results.tsv
└── STAR
        ├── logs
    ├── sample1
    ├── sample1.bam
    └── sample1.bam.bai

Note

The _sampleSheet suffix for the DESeq2_sampleSheet and sleuth_Salmon_sampleSheet is drawn from the name of the sample sheet you use. So if you instead named the sample sheet mySampleSheet.txt then the folders would be named DESeq2_mySampleSheet and sleuth_Salmon_mySampleSheet. This facilitates using multiple sample sheets.

Apart from the common module outputs (see Running snakePipes), the workflow would produce the following folders:

  • Annotation: This folder would contain the GTF and BED files used for analysis. In case the file has been filtered using the --filter_annotation option (see Configuration file), this would contain the filtered files.

  • STAR/HISAT2: (not produced in mode alignment-free) This would contain the output of RNA-alignment by STAR or HISAT2 (indexed BAM files).

  • featureCounts: (not produced in mode alignment-free) This would contain the gene-level counts (output of featureCounts), on the filtered GTF files, that can be used for differential expression analysis.

  • bamCoverage: (not produced in mode alignment-free) This would contain the bigWigs produced by deepTools bamCoverage . Files with suffix .coverage.bw are raw coverage files, while the files with suffix RPKM.bw are RPKM-normalized coverage files.

  • deepTools_QC: (produced in the mode deepTools_QC) This contains the quality checks specific for RNA-seq, performed via deepTools. The output folders are names after various deepTools functions and the outputs are explained under deepTools documentation. In short, they show the insert size distribution(bamPEFragmentSize), mapping statistics (estimateReadFiltering), sample-to-sample correlations and PCA (multiBigwigSummary, plotCorrelation, plotPCA), and read enrichment on various genic features (plotEnrichment)

  • DESeq2_[sampleSheet]/DESeq2_Salmon_[sampleSheet]: (produced in the modes alignment or alignment-free, only if a sample-sheet is provided.) The folder contains the HTML result report DESeq2_report.html, the annotated output file from DESeq2 (DEseq_basic_DEresults.tsv) and normalized counts for all samples, produced via DEseq2 (DEseq_basic_counts_DESeq2.normalized.tsv) as well as an Rdata file (DEseq_basic_DESeq.Rdata) with the R objects dds <- DESeq2::DESeq(dds) and ddr <- DDESeq2::results(dds,alpha = fdr). DESeq2_[sampleSheet] uses gene counts from featureCounts/counts.tsv, whereas DESeq2_Salmon_[sampleSheet] uses transcript counts from Salmon/counts.tsv that are merged via tximport in R.

  • Salmon: (produced in mode alignment-free) This folder contains transcript-level (counts.tsv)and gene-level (counts.genes.tsv) counts estimated by the tool Salmon .

  • sleuth_Salmon_[sampleSheet] (produced in mode alignment-free, only if a sample-sheet is provided) This folder contains a transcript-level differential expression output produced using the tool Sleuth .

Command line options

MPI-IE workflow for RNA mapping and analysis

usage example:

RNA-seq -i input-dir -o output-dir mm10

usage: RNA-seq -i INDIR -o OUTDIR [-h] [-v] [-c CONFIGFILE]
               [--cluster_configfile CLUSTER_CONFIGFILE] [-j INT] [--local]
               [--keepTemp] [--snakemake_options STR] [--DAG] [--version]
               [--emailAddress EMAILADDRESS] [--smtpServer SMTPSERVER]
               [--smtpPort SMTPPORT] [--onlySSL] [--emailSender EMAILSENDER]
               [--smtpUsername SMTPUSERNAME] [--smtpPassword SMTPPASSWORD]
               [--VCFfile STR] [--strains STR] [--SNPfile STR]
               [--Nmasked_index STR] [-m MODE] [--downsample INT] [--trim]
               [--trim_prg {cutadapt,trimgalore}] [--trim_options STR]
               [--fastqc] [--bcExtract] [--bcPattern BCPATTERN] [--umidedup]
               [--umidedup_sep UMIDEDUP_SEP] [--umidedup_opts UMIDEDUP_OPTS]
               [--bw-binsize INT] [--plotFormat STR] [--library_type]
               [--mapping_prg STR] [--star_options STR] [--hisat_options STR]
               [--salmon_index_options STR] [--featurecounts_options STR]
               [--filter_annotation STR] [--sampleSheet SAMPLESHEET]
               [--dnaContam] [--single-end]
               GENOME

Positional Arguments

GENOME

Genome acronym of the target organism. Either a yaml file or one of: mm10_gencodeM13, dm6, GRCz10, dm3, mm10, mm9, hs37d5, hg38, SchizoSPombe_ASM294v2

Required Arguments

-i, --input-dir

input directory containing the FASTQ files, either paired-end OR single-end data

-o, --output-dir

output directory

General Arguments

-v, --verbose

verbose output (default: 'False')

-c, --configfile

configuration file: config.yaml (default: 'None')

--cluster_configfile

configuration file for cluster usage. In absence, the default options from shared/cluster.yaml and workflows/[workflow]/cluster.yaml would be selected (default: 'None')

-j, --jobs

maximum number of concurrently submitted Slurm jobs / cores if workflow is run locally (default: '5')

--local

run workflow locally; default: jobs are submitted to Slurm queue (default: 'False')

--keepTemp

Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.

--snakemake_options

Snakemake options to be passed directly to snakemake, e.g. use --snakemake_options='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')

--DAG

If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.

--version

show program's version number and exit

Email Arguments

--emailAddress

If specified, send an email upon completion to the given email address

--smtpServer

If specified, the email server to use.

--smtpPort

The port on the SMTP server to connect to. A value of 0 specifies the default port.

--onlySSL

The SMTP server requires an SSL connection from the beginning.

--emailSender

The address of the email sender. If not specified, it will be the address indicated by --emailAddress

--smtpUsername

If your SMTP server requires authentication, this is the username to use.

--smtpPassword

If your SMTP server requires authentication, this is the password to use.

Allele-specific mapping arguments

--VCFfile

VCF file to create N-masked genomes (default: 'None')

--strains

Name or ID of SNP strains separated by comma (default: 'None')

--SNPfile

File containing SNP locations (default: 'None')

--Nmasked_index

N-masked index of the reference genome (default: 'None')

Options

-m, --mode

workflow running modes (available: 'alignment-free, alignment, allelic-mapping, deepTools_qc') (default: '"alignment-free,deepTools_qc"')

--downsample

Downsample the given number of reads randomly from of each FASTQ file (default: 'False')

--trim

Activate fastq read trimming. If activated, Illumina adaptors are trimmed by default. Additional parameters can be specified under --trim_options (default: 'False')

--trim_prg

Possible choices: cutadapt, trimgalore

Trimming program to use: Cutadapt or TrimGalore (default: '"cutadapt"')

--trim_options

Additional option string for trimming program of choice. (default: 'None')

--fastqc

Run FastQC read quality control (default: 'False')

--bcExtract

To extract umi barcode from fastq file via UMI-tools and add it to the read name (default: 'False')

--bcPattern

The pattern to be considered for the barcode. 'N' = UMI position (required) 'C' = barcode position (optional) (default: '')

--umidedup

Deduplicate bam file based on UMIs via umi_tools dedup that are present in the read name. (default: 'False')

--umidedup_sep

umi separation character that will be passed to umi_tools.(default: '"_"')

--umidedup_opts

Additional options that will be passed to umi_tools.(default: '')

--bw-binsize

Bin size of output files in bigWig format (default: '25')

--plotFormat

Possible choices: png, pdf, None

Format of the output plots from deepTools. Select 'none' for no plots (default: '"png"')

--library_type

user provided library type strand specificity. featurCounts style: 0, 1, 2 (Illumina TruSeq); default: '2')

--mapping_prg

Program used for mapping: STAR or HISAT2 (default: '"STAR"')

--star_options

STAR option string, e.g.: '--twopassMode Basic' (default: '"--outBAMsortingBinsN 100"')

--hisat_options

HISAT2 option string (default: 'None')

--salmon_index_options

Salmon index options, e.g. '--type fmd' (default: '"--type quasi -k 31"')

--featurecounts_options

featureCounts option string. The options '-p -B' are always used for paired-end data (default: '"-C -Q 10 --primary"')

--filter_annotation

filter annotation GTF by grep for use with Salmon, e.g. use --filter_annotation='-v pseudogene'; default: 'None')

--sampleSheet

Information on samples (required for DE analysis); see 'docs/content/sampleSheet.example.tsv' for example. The column names in the tsv files are 'name' and 'condition'. The first entry defines which group of samples are control. This way, the order of comparison and likewise the sign of values can be changed. The DE analysis might fail if your sample names begin with a number. So watch out for that! (default: 'None')

--dnaContam

Returns a plot which presents the proportion of the intergenic reads (default: 'False')

--single-end

input data is single-end, not paired-end. This is only used if --fromBam is specified.

code @ github.