makePairs

What it does

The snakePipes makePairs workflow allows users to process their HiC/uC data from raw fastq files to HiC matrices (in an allele-specific manner). The workflow utilizes mapping by bwa, followed by analysis using pairtools . The workflow follows the example workflow described in the documentation of pairtools , which explains each step in detail and would be useful for new users to have a look at. Currently the output matrices are produced in the .pairs format.

../../_images/makePairs_pipeline.png

Input requirements and outputs

This pipeline requires paired-end reads fastq files as input in order to build allele-specific contact matrices. The input fastq files will be trimmed (with fastp) and be mapped against a diploid reference genome (with bwa).

Prior to building the matrix, the pipeline generates two reference genomes (from a reference genome and a VCF file) that contains the information on haplotypes. The Haplotypes are set using the --strains flag. The two reference genomes are then merged to yield one reference genome (genome/diploid_genome.fa) which is indexed with bwa as the basis for mapping of paired-end reads. (Notice that this is different from the mono-allelic HiC workflow which map reads individually in single-end mode and combines them into contact pairs afterwards.

The output of mapping step is used by pairtools` to construct different contact matrices for each sample (in pairs format)

Workflow configuration file

Default parameters from the provided config file can be altered by user. Below is the config file description for the makePairs workflow :

  pipeline: makePairs
  outdir:
  configFile:
  clusterConfigFile:
  local: False
  maxJobs: 5
  ## directory with fastq files
  indir:
  ## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/
  ## Value can be also path to your own genome config file!
  genome:
  ## FASTQ file extension (default: ".fastq.gz")
  ext: '.fastq.gz'
  ## paired-end read name extension (default: ["_R1", "_R2"])
  reads: ["_R1","_R2"]
  ## assume paired end reads
  pairedEnd: True
  ## Number of reads to downsample from each FASTQ file
  downsample:
  ## Options for trimming
  trim: True
  trimmer: fastp
  trimmerOptions:

  verbose: False
  fastqc: True
  UMIBarcode: False
  bcPattern: "NNNNCCCCCCCCC"
  UMIDedup: False
  UMIDedupSep: "_"
  UMIDedupOpts: "_"
  plotFormat: png
  bwBinSize: 1000
  aligner: 'bwa'
  alignerOptions: '-SPu -T0'
  alignerThreads: 30

  fromBAM: False
  sampleSheet:


################################################################################
# Call snakemake directly, i.e. without using the wrapper script:
#
# Please save a copy of this config yaml file and provide an adjusted config
# via '--configFile' parameter!
# example call:
#
# snakemake --snakefile /path/to/snakemake_workflows/workflows/makePairs/Snakefile
#           --configFile /path/to/snakemake_workflows/workflows/makePairs/defaults.yaml
#           --directory /path/to/outputdir
#           --VCFfile /path/to/vcf_file
#           --strains strain1_name,strain2_name
#           --cores 32
################################################################################

Structure of output directory

In addition to the FASTQ module results (see Running snakePipes), the workflow produces the following outputs:

.
|-- bam
|-- FASTQ
|-- FastQC
|-- FastQC_trimmed
|-- FASTQ_fastp
|-- genome
|-- multiqc
|-- originalFASTQ
|-- pairs
|-- phase_stats
  • bam folder contains the mapping results in BAM format. The files were obtained after running bwa in paired-end mode.

  • originalFASTQ includes softlinks to the original FASTQ data

  • FASTQ links to originalFASTQ if no further filters are specified

  • FASTQ_fastp: trimmed FASTQ files output by fastp

  • FastQC FASTQC report on FASTQ directory

  • genome folder contains the diploid_genome.fa.gz that was constructed from 2 strain-specific genomes with rule diploid_genome. Chromosome sizes and indices (bwa) can also be found in this directory

  • multiqc folder contains the final QC report generated with MultiQC (including fastqc, fastp, and pairtools modules)

Note

For the pairtools modules to work we used MultiQC from open2c as specified for the makePiars environment

  • pairs folder contains the parsed, phased, sorted and deduplicated contact matrices generated by pairtools.

  • phase_stats contains the 4 subsetted pairs files for each sample (unphased pairs, 2 different strains, trans pairs). QC statistics are also calculated and will be processed by MultiQC

Command line options

MPI-IE workflow for creating HiC matrices with pairtools

usage example:

makePairs -i input-dir -o output-dir --VCFfile vcf --strains s1,s2 dm6

usage: makePairs -i INDIR -o OUTDIR [-h] [-v] [--ext EXT]
                 [--reads READS READS] [-c CONFIGFILE] [--keepTemp]
                 [--snakemakeOptions SNAKEMAKEOPTIONS] [--DAG] [--version]
                 [--emailAddress EMAILADDRESS] [--smtpServer SMTPSERVER]
                 [--smtpPort SMTPPORT] [--onlySSL] [--emailSender EMAILSENDER]
                 [--smtpUsername SMTPUSERNAME] [--smtpPassword SMTPPASSWORD]
                 [--VCFfile VCFFILE] [--strains STRAINS] [--SNPfile SNPFILE]
                 [--NMaskedIndex NMASKEDINDEX] [--downsample INT] [--trim]
                 [--trimmer {cutadapt,trimgalore,fastp}]
                 [--trimmerOptions TRIMMEROPTIONS] [--fastqc] [--bcExtract]
                 [--bcPattern BCPATTERN] [--UMIDedup]
                 [--UMIDedupSep UMIDEDUPSEP] [--UMIDedupOpts UMIDEDUPOPTS]
                 [--plotFormat STR] [--aligner {bwa}]
                 [--alignerOptions ALIGNEROPTIONS] [--fromBAM]
                 GENOME

Positional Arguments

GENOME

Genome acronym of the target organism. Either a yaml file or one of: hg38, mm39_ens106, GRCh38_gencode40, dm6, GRCz10, SchizoSPombe_ASM294v2, mm10_gencodeM19, GRCz11

Required Arguments

-i, --input-dir

input directory containing the FASTQ files, either paired-end OR single-end data

-o, --output-dir

output directory

General Arguments

-v, --verbose

verbose output (default: 'False')

--ext

Suffix used by input fastq files (default: ''.fastq.gz'').

--reads

Suffix used to denote reads 1 and 2 for paired-end data. This should typically be either '_1' '_2' or '_R1' '_R2' (default: '['_R1', '_R2']). Note that you should NOT separate the values by a comma (use a space) or enclose them in brackets.

-c, --configFile

configuration file: config.yaml (default: 'None')

--keepTemp

Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.

--snakemakeOptions

Snakemake options to be passed directly to snakemake, e.g. use --snakemakeOptions='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')

--DAG

If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.

--version

show program's version number and exit

Email Arguments

--emailAddress

If specified, send an email upon completion to the given email address

--smtpServer

If specified, the email server to use.

--smtpPort

The port on the SMTP server to connect to. A value of 0 specifies the default port.

--onlySSL

The SMTP server requires an SSL connection from the beginning.

--emailSender

The address of the email sender. If not specified, it will be the address indicated by --emailAddress

--smtpUsername

If your SMTP server requires authentication, this is the username to use.

--smtpPassword

If your SMTP server requires authentication, this is the password to use.

Allele-specific mapping arguments

--VCFfile

VCF file to create N-masked genomes (default: 'None'). Note that for the makePairs workflow this file is assumed to be gzipped and indexed (with tabix).

--strains

Name or ID of SNP strains separated by comma (default: 'None')

--SNPfile

File containing SNP locations (default: 'None')

--NMaskedIndex

N-masked index of the reference genome (default: 'None'). Note that this should point to a file (i.e. 'Genome' for STAR indices, genome.1.bt2 for bowtie2 indices).

Options

--downsample

Downsample the given number of reads randomly from of each FASTQ file (default: 'False')

--trim

Activate fastq read trimming. If activated, Illumina adaptors are trimmed by default. Additional parameters can be specified under --trimmerOptions. (default: 'False')

--trimmer

Possible choices: cutadapt, trimgalore, fastp

Trimming program to use: Cutadapt, TrimGalore, or fastp. Note that if you change this you may need to change --trimmerOptions to match! (default: ''fastp'')

--trimmerOptions

Additional option string for trimming program of choice. (default: 'None')

--fastqc

Run FastQC read quality control (default: 'False')

--bcExtract

To extract umi barcode from fastq file via UMI-tools and add it to the read name (default: 'False')

--bcPattern

The pattern to be considered for the barcode. 'N' = UMI position (required) 'C' = barcode position (optional) (default: ''NNNNCCCCCCCCC'')

--UMIDedup

Deduplicate bam file based on UMIs via umi_tools dedup that are present in the read name. (default: 'False')

--UMIDedupSep

umi separation character that will be passed to umi_tools.(default: ''_'')

--UMIDedupOpts

Additional options that will be passed to umi_tools.(default: '''')

--plotFormat

Possible choices: png, pdf, None

Format of the output plots from deepTools. Select 'none' for no plots (default: ''png'')

--aligner

Possible choices: bwa

Program used for mapping: BWA (default: ''bwa''). If you change this, please change --alignerOptions to match.

--alignerOptions

aligner option string, e.g.: '-SPu -T0' (default: ''-SPu -T0'')

--fromBAM

Input folder with bam files. If provided, the analysis will start from this point. If bam files contain single ends, please specify --singleEnd additionally.

code @ github.