.. _noncoding-RNA-seq: noncoding-RNA-seq ================= What it does ------------ The snakePipes noncoding-RNA-seq workflow allows users to process their single or paired-end ribosomal-depleted RNA-seq fastq files upto the point of gene/transcript/repeat-element counts and differential expression. Repeat elements are quantified and tested for differential expression at the name, family and class level. Since changes in repeat element expression tend to be unidirectional, size factors from gene expression are used when normalizing repeat element expression. Note that in addition to the normal GTF file, this pipeline requires a repeat masker output file, which can be downloaded from UCSC or other sites. The chromosome names here must match that used in the other files. .. image:: ../images/noncoding-RNA-seq_pipeline.png Input requirements ------------------ The only requirement is a directory of gzipped fastq files. Files could be single or paired end, and the read extensions could be modified using the keys in the ``defaults.yaml`` file below. .. _noncodingRNAconfig: Configuration file ~~~~~~~~~~~~~~~~~~ There is a configuration file in ``snakePipes/workflows/noncoding-RNA-seq/defaults.yaml``:: ## General/Snakemake parameters, only used/set by wrapper or in Snakemake cmdl, but not in Snakefile pipeline: noncoding-rna-seq outdir: configFile: clusterConfigFile: local: False maxJobs: 5 ## directory with fastq files indir: ## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/ ## Value can be also path to your own genome config file! genome: ## FASTQ file extension (default: ".fastq.gz") ext: '.fastq.gz' ## paired-end read name extension (default: ["_R1", "_R2"]) reads: ["_R1","_R2"] ## assume paired end reads pairedEnd: True ## Number of reads to downsample from each FASTQ file downsample: ## Options for trimming trim: False trimmer: fastp trimmerOptions: ## further options mode: alignment,deepTools_qc sampleSheet: bwBinSize: 25 fastqc: False fragmentLength: 200 libraryType: 2 ## supported mappers: STAR HISAT2 aligner: STAR alignerOptions: "--outSAMstrandField intronMotif --outFilterMultimapNmax 1000 --outFilterMismatchNoverLmax 0.1 --outSAMattributes Standard" verbose: False plotFormat: png #### Flag to control the pipeline entry point fromBAM: False bamExt: '.bam' #umi_tools UMIBarcode: False bcPattern: NNNNCCCCCCCCC #default: 4 base umi barcode, 9 base cell barcode (eg. RELACS barcode) UMIDedup: False UMIDedupSep: "_" UMIDedupOpts: --paired Apart from the common workflow options (see :ref:`running_snakePipes`), the following parameters are useful to consider: * **aligner**: The only choice at the moment is `STAR `__. * **alignerOptions**: Options to pass on to your chosen aligner. Note that library type and junction definitions don't have to be passed to the aligners as options, as they are handeled either automatically, or via other parameters. * **plotFormat**: You can switch the type of plot produced by all deeptools modules using this option. Possible choices : png, pdf, svg, eps, plotly Differential expression ----------------------- Like the other workflows, differential expression can be performed using the ``--sampleSheet`` option and supplying a sample sheet like that below:: name condition sample1 eworo sample2 eworo SRR7013047 eworo SRR7013048 OreR SRR7013049 OreR SRR7013050 OreR .. note:: The first entry defines which group of samples are control. This way, the order of comparison and likewise the sign of values can be changed. The DE analysis might fail if your sample names begin with a number. So watch out for that! Complex designs with blocking factors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the user provides additional columns between 'name' and 'condition' in the sample sheet, the variables stored there will be used as blocking factors in the order they appear in the sample sheet. Eg. if the first line of your sample sheet looks like 'name batch condition', this will translate into a formula ``batch + condition``. 'condition' has to be the final column and it will be used for any statistical inference. Multiple pairwise comparisons ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The user may specify multiple groups of independent comparisons by providing a 'group' column after the 'condition' column. This will cause the sample sheet to be split into the groups defined in this column, and a corresponding number of independent pairwise comparisons will be run, one for each split sheet, and placed in separate output folders named accordingly. This will be applied to DESeq2 pairwise comparison. Specifying a value of 'All' in the 'group' column will cause that sample group to be used in all pairwise comparisons, e.g. if the same set of controls should be used for several different treatment groups. An example sample sheet with the group information provided looks like this: name condition group sample1 Control All sample2 Control All sample3 Treatment Group1 sample4 Treatment Group1 sample5 Treatment Group2 sample6 Treatment Group2 Analysis modes -------------- Following analysis (**modes**) are possible using the noncoding-RNA-seq workflow: "alignment" ~~~~~~~~~~~ In this mode, the pipeline uses **STAR** to create BAM files and **TEtranscripts** to quantify genes and repeat elements. Differential expression of genes and repeat elements is then performed with **DESeq2**. "deepTools_qc" ~~~~~~~~~~~~~~ The pipeline provides multiple quality controls through deepTools, which can be triggered using the **deepTools_qc** mode. It's a very useful add-on with any of the other modes. .. note:: Since most deeptools functions require an aligned (BAM) file, the deepTools_qc mode will additionally perform the alignment of the fastq files. However this would not interfere with operations of the other modes. Understanding the outputs ------------------------- Assuming the pipline was run with ``--mode 'alignment,deepTools_qc'`` on a set of FASTQ files, the structure of the output directory would look like this (files are shown only for one sample) :: ├── bamCoverage │   ├── sample1.coverage.bw │   ├── sample1.RPKM.bw │   ├── sample1.uniqueMappings.fwd.bw │   ├── sample1.uniqueMappings.rev.bw ├── cluster_logs ├── deepTools_qc │   ├── bamPEFragmentSize │   │   ├── fragmentSize.metric.tsv │   │   └── fragmentSizes.png │   ├── estimateReadFiltering │   │   └── sample1_filtering_estimation.txt │   ├── logs │   │   ├── bamPEFragmentSize.err │   │   ├── bamPEFragmentSize.out │   │   ├── multiBigwigSummary.err │   │   └── plotCorrelation_pearson.err │   ├── multiBigwigSummary │   │   └── coverage.bed.npz │   ├── plotCorrelation │   │   ├── correlation.pearson.bed_coverage.heatmap.png │   │   ├── correlation.pearson.bed_coverage.tsv │   │   ├── correlation.spearman.bed_coverage.heatmap.png │   │   └── correlation.spearman.bed_coverage.tsv │   ├── plotEnrichment │   │   ├── plotEnrichment.png │   │   └── plotEnrichment.tsv │   └── plotPCA │   ├── PCA.bed_coverage.png │   └── PCA.bed_coverage.tsv ├── DESeq2_sampleSheet │   ├── DESeq2_report_genes.html │   ├── DESeq2_report_repeat_class.html │   ├── DESeq2_report_repeat_family.html │   ├── DESeq2_report_repeat_name.html │   ├── DESeq2.session_info.txt │   ├── genes_counts_DESeq2.normalized.tsv │   ├── genes_DEresults.tsv │   ├── genes_DESeq.Rdata │   ├── repeat_class_counts_DESeq2.normalized.tsv │   ├── repeat_class_DEresults.tsv │   ├── repeat_class_DESeq.Rdata │   ├── repeat_family_counts_DESeq2.normalized.tsv │   ├── repeat_family_DEresults.tsv │   ├── repeat_family_DESeq.Rdata │   ├── repeat_name_counts_DESeq2.normalized.tsv │   ├── repeat_name_DEresults.tsv │   └── repeat_name_DESeq.Rdata ├── FASTQ ├── filtered_bam │   ├── sample1.filtered.bam │   ├── sample1.filtered.bam.bai ├── multiQC ├── STAR └── TEcount    └── sample1.cntTable .. note:: The ``_sampleSheet`` suffix for the ``DESeq2_sampleSheet`` is drawn from the name of the sample sheet you use. So if you instead named the sample sheet ``mySampleSheet.txt`` then the folder would be named ``DESeq2_mySampleSheet``. This facilitates using multiple sample sheets. Apart from the common module outputs (see :ref:`running_snakePipes`), the workflow would produce the following folders: * **bamCoverage**: This would contain the bigWigs produced by deepTools `bamCoverage `__ . Files with suffix ``.coverage.bw`` are raw coverage files, while the files with suffix ``RPKM.bw`` are `RPKM-normalized `__ coverage files. * **deepTools_QC**: (produced in the mode *deepTools_QC*) This contains the quality checks specific for mRNA-seq, performed via deepTools. The output folders are names after various deepTools functions and the outputs are explained under `deepTools documentation `__. In short, they show the insert size distribution(**bamPEFragmentSize**), mapping statistics (**estimateReadFiltering**), sample-to-sample correlations and PCA (**multiBigwigSummary, plotCorrelation, plotPCA**), and read enrichment on various genic features (**plotEnrichment**) * **DESeq2_[sampleSheet]**: (produced only if a sample-sheet is provided.) The folder contains the HTML result reports **DESeq2_report_genes.html**, **DESeq2_report_repeat_name.html**, **DESeq2_report_repeat_class.html** and **DESeq2_report_repeat_family.html** as we as the annotated output file from DESeq2 (**genes_DEresults.tsv**, etc.) and normalized counts for all samples, produced via DEseq2 (**genes_counts_DESeq2.normalized.tsv**, etc.) as well as an Rdata file (**genes_DESeq.Rdata**, etc.) with the R objects ``dds <- DESeq2::DESeq(dds)`` and ``ddr <- DDESeq2::results(dds,alpha = fdr)``. Sample name to plotting shape mapping on the PCA plot is limited to 36 samples and skipped otherwise. * **filtered_bam**: This contains sorted and indexed BAM files that have been filtered to remove secondary alignments. This are used by deepTools and are appropriate for use in IGV. * **multiQC**: This folder contains the report produced by MultiQC and summarizes alignment metrics from STAR and possibly various deepTools outputs. * **STAR**: This would contain the output logs of RNA-alignment by STAR. The actual BAM files are removed at the end of the pipeline as they're not compatible with typical visualization programs * **TEcount**: (produced in the **alignment** mode) This contains the counts files and logs from the TEcount program in the TEtranscripts package. These are used by DESeq2 for differential expression. Command line options -------------------- .. argparse:: :func: parse_args :filename: ../snakePipes/workflows/noncoding-RNA-seq/noncoding-RNA-seq :prog: noncoding-RNA-seq :nodefault: