.. _DNA-mapping: DNA-mapping =========== What it does ------------ This is the primary DNA-mapping pipeline. It can be used both alone or upstream of the ATAC-seq and ChIP-seq pipelines. This has a wide array of options, including trimming and various QC steps (e.g., marking duplicates and plotting coverage and PCAs). In addition, basic coverage tracks are created to facilitate viewing the data in IGV. .. image:: ../images/DNAmapping_pipeline.png Input requirements ------------------ The only requirement is a directory of gzipped fastq files. Files could be single or paired end, and the read extensions could be modified using the keys in the ``defaults.yaml`` file below. Configuration file ~~~~~~~~~~~~~~~~~~ There is a configuration file in ``snakePipes/workflows/DNA-mapping/defaults.yaml``:: ## General/Snakemake parameters, only used/set by wrapper or in Snakemake cmdl, but not in Snakefile pipeline: dna-mapping outdir: configFile: clusterConfigFile: local: False maxJobs: 5 ## directory with fastq files indir: ## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/ ## Value can be also path to your own genome config file! genome: ## FASTQ file extension (default: ".fastq.gz") ext: '.fastq.gz' ## paired-end read name extension (default: ['_R1', "_R2"]) reads: [_R1, _R2] ## mapping mode mode: mapping aligner: Bowtie2 ## Number of reads to downsample from each FASTQ file downsample: ## Options for trimming trim: False trimmer: cutadapt trimmerOptions: ## Bin size of output files in bigWig format bwBinSize: 25 ## Run FASTQC read quality control fastqc: false ## Run computeGCBias quality control GCBias: false ## Retain only de-duplicated reads/read pairs dedup: false ## Retain only reads with at least the given mapping quality mapq: 0 ## Retain only reads mapping in proper pairs properPairs: false ## Mate orientation in paired-end experiments for Bowtie2 mapping ## (default "--fr" is appropriate for Illumina sequencing) mateOrientation: --fr ## other Bowtie2 stuff insertSizeMax: 1000 alignerOpts: plotFormat: png UMIBarcode: False bcPattern: NNNNCCCCCCCC #default: 4 base umi barcode, 8 base cell barcode (eg. RELACS barcode) UMIDedup: False UMIDedupSep: "_" UMIDedupOpts: ## Median/mean fragment length, only relevant for single-end data (default: 200) fragmentLength: 200 qualimap: false verbose: false Many of these options can be more conveniently set on the command-line (e.g., ``--qualimap`` sets ``qualimap: true``). However, you may need to change the ``reads:`` setting if your paired-end files are not denoted by ``sample_R1.fastq.gz`` and ``sample_R2.fastq.gz``, but rather ``sample_1.fastq.gz`` and ``sample_2.fastq.gz``. Understanding the outputs -------------------------- The DNA mapping pipeline will generate output of the following structure:: . ├── bamCoverage ├── Bowtie2 ├── deepTools_qc │   ├── bamPEFragmentSize │   ├── estimateReadFiltering │   ├── multiBamSummary │   ├── plotCorrelation │   ├── plotCoverage │   └── plotPCA ├── FASTQ ├── FastQC ├── filtered_bam ├── multiQC │   └── multiqc_data └── Sambamba In addition to the FASTQ module results (see :ref:`running_snakePipes`), the workflow produces the following outputs: * **Bowtie2** : Contains the BAM files after mapping with `Bowtie2 `__ and indexed by `Samtools `__. * **filtered_bam** : Contains the BAM files filtered by the provided criteria, such as mapping quality (``--mapq``) or PCR duplicates (``--dedup``). This file is used for most downstream analysis in the DNA-mapping and ChIP-seq/ATAC-seq pipeline. * **bamCoverage** : Contains the coverage files (`bigWig format `__) produced from the BAM files by `deepTools bamCoverage `__ . The files are either raw, or 1x normalized (by sequencing depth). They are useful for plotting and inspecting the data in IGV. * **deepTools_qc** : Contains various QC files and plots produced by deepTools on the filtered BAM files. These are very useful for evaluation of data quality. The folders are named after the tools. Please look at the `deepTools documentation `__ on how to interpret the outputs from each tool. * **Sambamba** : Contains the alignment metrices evaluated on the BAM files by `Sambamba `__. A number of other directories may optionally be present if you specified read trimming, using Qualimap, or a variety of other options. These are typically self-explanatory. A fair number of useful QC plots are or can be generated by the pipeline. These include correlation and PCA plots as well as the output from MultiQC. .. image:: ../images/DNAmapping_correlation.png Command line options -------------------- .. argparse:: :func: parse_args :filename: ../snakePipes/workflows/DNA-mapping/DNA-mapping :prog: DNA-mapping :nodefault: