ChIP-seq

What it does

The ChIP-seq pipeline takes one or more BAM files and attempts to find peaks. If multiple samples and a sample sheet are provided, then CSAW is additionally used to call differential peaks. Both sharp and broad peak calling are supported.

../../_images/ChIPseq_pipeline.png

In addition to peaks, bigWig tracks are also generated.

Input requirements

The DNA mapping pipeline generates output that is fully compatible with the ChIP-seq pipeline input requirements! When running the ChIP-seq pipeline, please specify the output directory of DNA-mapping pipeline as the working directory (-w).

If you need to provides file NOT generated by the DNA-mapping pipeline, then you must provide a directory with the following structure:

.
├── deepTools_qc
│   └── bamPEFragmentSize
│       ├── fragmentSize.metric.tsv
│       └── fragmentSizes.png
├── filtered_bam
│   ├── sample1.filtered.bam
│   ├── sample1.filtered.bam.bai
│   ├── sample2.filtered.bam
│   └── sample2.filtered.bam.bai
├── Sambamba
│   ├── flagstat_report_all.tsv
│   ├── sample1.markdup.txt
│   └── sample2.markdup.txt
└── sampleSheet.yaml
  • deepTools_qc contains the output of bamPEFragmentSize from deepTools, run on all the BAM files.

  • Sambamba directory contains the output of flagstat command from sambamba (the .markdup.txt files) and a single file summarizing that with columns sample (sample name, such as sampl1), total (total reads), dup (number of duplicate reads), and mapped (number of mapped reads).

  • filtered_bam directory contains the input BAM files (either filtered or unfiltered, however you prefer).

  • sampleSheet.tsv (OPTIONAL) is only needed to test for differential binding.

Sample configuration

The ChIP-seq sample configuration yaml file describes what type of peak calling to perform on each sample and which sample to use as the input control

chip_dict:
  SRR6761497:
    control: SRR6761502
    broad: True
  SRR6761498:
    control: SRR6761502
    broad: True
  SRR6761495:
    control: SRR6761502
    broad: False
  SRR6761499:
    control: SRR6761502
    broad: False

As you can see above, the same control can be used for multiple samples.

Note

Set the flag broad to True for broad marks, such as H3K27me and H3K9me3

Spikein Normalization

If chromatin from an external organism was spikein in, it is possible to obtain spikein-derived scaling factors for the ChIP (and input) samples with the flag --useSpikeInForNorm. This requires providing a hybrid bam file, with reads aligned to a hybrid genome of host and spikein chromosomes. Spikein chromosome extention can be specified with --spikeinExt. Scale factors can be obtained either from whole spikein genome in the ChIP samples, from windows centered on TSS in the spikein genome in the ChIP samples, or from whole spikein genome in the input samples . The default scale factors from whole spikein genome in the ChIP samples can be changed to something else with --getSizeFactorsFrom.

DESeq2-style scaling factors produced with deepTools multiBamSummary will then be used to create bam coverage tracks and passed to CSAW as size Factors if sample sheet is provided.

A hybrid genome can be obtained with createIndices workflow and can be passed to the DNA-mapping workflow without any particular arguments.

Differential Binding analysis

If you wish to perform differential binding analysis between two group of samples, for example wild-type vs Knock-outs, via snakePipes. You would require a sample-sheet and the --sampleSheet option. Sample sheet may contain only a subset of samples used in the previous steps e.g. for peak calling. In addition, input samples are filtered out prior to the analysis using the sample configuration yaml (see above).

The sample sheet is a tab-separated file with two columns, named name and condition. An example is below:

name    condition
sample1 wild-type
sample2 wild-type
SRR7013047      wild-type
SRR7013048      mutant
SRR7013049      mutant
SRR7013050      mutant

For comparison between two conditions, the name you assign to "condition" is not relevant, but rather the order is. The group mentioned first (in the above case "wild-type") would be used as a "control" and the group mentioned later would be used as "test".

The differential binding module utilizes the R package CSAW to detect significantly different peaks between two conditions. The analysis is performed on a "union" of peaks from all samples mentioned in the sample sheet. This merged set of regions are provided as an output inside the CSAW folder as the file 'DiffBinding_allregions.bed'. All differentially bound regions are available in 'CSAW/DiffBinding_significant.bed'. Two thresholds are applied to produce Filtered.results.bed : FDR (default 0.05 ) as well as absolute log fold change (1). These can be specified either in the defaults.yaml dictionary or via commandline parameters '--FDR' and '--LFC'. Additionally, filtered results are split into up to 3 bed files, representing direction change (UP, DOWN, or MIXED).

If the user provides additional columns between 'name' and 'condition' in the sample sheet, the variables stored there will be used as blocking factors in the order they appear in the sample sheet. Condition will be the final column and it will be used for any statistical inference.

Note

In order to include or exclude peaks from selected samples in the union of peaks used in the differential binding analysis, the user may provide an additional column named 'UseRegions' and set it to True or False, accordingly. This column must supersede the 'condition' column in the column order.

Merged regions from filtered results with any direction change are further used to produce deepTools heatmaps, using log2 ratio of chip signal to input or depth-normalized coverage. For this purpose, the regions are rescaled to 1kb, and extended by 0.2kb on each side.

An html report summarizing the differential binding analysis is produced in the same folder.

Filtered results are also annotated with the distance to the closest gene using bedtools closest and written as '.txt' files to the AnnotatedResults_* folder.

Configuration file

There is a configuration file in snakePipes/workflows/ChIP-seq/defaults.yaml:

pipeline: chip-seq
configFile:
clusterConfigFile:
local: false
maxJobs: 5
## workingdir need to be required DNA-mapping output dir, 'outdir' is set to workingdir internally
workingdir:
## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/
## Value can be also path to your own genome config file!
genome:
## Which peak caller should be used?
peakCaller: 'MACS2'
## paired end data?
pairedEnd: true
## Bin size of output files in bigWig format
bwBinSize: 25
## Median/mean fragment length, only relevant for single-end data (default: 200)
fragmentLength: 200
verbose: false
# sampleInfo_DB
sample_info:
# windowSize
windowSize: 150
plot_format: png
##dummy string to skip filtering annotation
filter_annotation:
##parameters to filter DB regions on
fdr: 0.05
absBestLFC: 1

The only parameters that are useful to change are bwBinSize, fragmentLength, and windowSize. Note however that those can be more conveniently changed on the command line.

Understanding the outputs

The ChIP-seq pipeline will generate additional output as follows:

.
├── deepTools_ChIP
│   ├── bamCompare
│   │   ├── sample1.filtered.log2ratio.over_SRR6761502.bw
│   │   ├── sample1.filtered.subtract.SRR6761502.bw
│   │   ├── sample2.filtered.log2ratio.over_SRR6761502.bw
│   │   └── sample2.filtered.subtract.SRR6761502.bw
│   └── plotFingerprint
│       ├── plotFingerprint.metrics.txt
│       └── plotFingerprint.png
├── histoneHMM
│   ├── sample2.filtered.histoneHMM-em-posterior.txt.gz
│   ├── sample2.filtered.histoneHMM-regions.gff.gz
│   ├── sample2.filtered.histoneHMM-regions.gff.gz.tbi
│   ├── sample2.filtered.histoneHMM.txt.gz
│   ├── sample2.filtered.histoneHMM-zinba-emfit.pdf
│   ├── sample2.filtered.histoneHMM-zinba-params-em.RData
│   └── sample2.filtered.histoneHMM-zinba-params-em.txt
├── Genrich
│   └── sample2.narrowPeak
└── MACS2
    ├── sample1.filtered.BAM_peaks.narrowPeak
    ├── sample1.filtered.BAM_peaks.qc.txt
    ├── sample1.filtered.BAM_peaks.xls
    ├── sample1.filtered.BAMPE_peaks.narrowPeak
    ├── sample1.filtered.BAMPE_peaks.xls
    ├── sample1.filtered.BAMPE_summits.bed
    ├── sample1.filtered.BAM_summits.bed
    ├── sample2.filtered.BAM_peaks.broadPeak
    ├── sample2.filtered.BAM_peaks.gappedPeak
    ├── sample2.filtered.BAM_peaks.qc.txt
    ├── sample2.filtered.BAM_peaks.xls
    ├── sample2.filtered.BAMPE_peaks.broadPeak
    ├── sample2.filtered.BAMPE_peaks.gappedPeak
    └── sample2.filtered.BAMPE_peaks.xls

Following up on the DNA-mapping module results (see DNA-mapping), the workflow produces the following output directories :

  • deepTools_ChIP: Contains output from two of the deepTools modules. The bamCompare output contains the input-normalized coverage files for the samples, which is very useful for downstream analysis, such as visualization in IGV and plotting the heatmaps. The plotFingerPrint output is a useful QC plot to assess signal enrichment in the ChIP samples.

  • Genrich: This folder contains the output of Genrich. This will only exist IF you specified --peakCaller Genrich and you have samples with non-broad peaks. The output is in narrowPeak format, like that from MACS2.

  • MACS2: This folder contains the output of MACS2 on the ChIP samples, MACS2 would perform either a narrow or broad peak calling on the samples, as indicated by the ChIP sample configuration file (see Configuration file). The outputs files would contain the respective tags (narrowPeak or broadPeak). This folder will only exist if you have non-broad marks and use MACS2 for peak calling

  • histoneHMM: This folder contains the output of histoneHMM. This folder will only exist if you have broad marks.

  • CSAW_sampleSheet: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see Differential Binding analysis) CSAW will be run using peaks called by the chosen peak caller, and the output folder will be named accordingly.

  • AnnotatedResults_sampleSheet: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see Differential Binding analysis). Differentially bound regions annotated with distance to nearest gene are stored here.

Note

Although in case of broad marks, we also perform the MACS2 broadpeak analysis (output available as MACS2/<sample>.filtered.BAM_peaks.broadPeak), we would recommend using the histoneHMM outputs in these cases, since histoneHMM produces better results than MACS2 for broad peaks.

Note

For narrow marks, the user may choose the peak caller from MACS2 (default), Genrich or SEACR. By deafult, SEACR is run in the stringent mode, applying normalization to counts over bed files. If invoked together with --useSpikeInForNorm, SEACR will be run in stringent mode, using spikein-normalized counts. FDR can be set by the user (default 0.05).

Note

The _sampleSheet suffix for the CSAW_sampleSheet is drawn from the name of the sample sheet you use. So if you instead named the sample sheet mySampleSheet.txt then the folder would be named CSAW_mySampleSheet. This facilitates using multiple sample sheets.

Note

If you provide a sampleSheet with name, condition and group columns, "multiple comparison mode" will be detected. The original sampleSheet will be split on the group column, and multiple pairwise comparisons will be run with CSAW, one per group.

Note

If provided with sampleSheet, Genrich will be used to jointly call peaks within a condition group. It will utilize the input controls if they exist.

Command line options

MPI-IE workflow for ChIP-seq analysis

Usage example:

ChIP-seq -d working-dir mm10 samples.yaml

usage: ChIP-seq -d WORKINGDIR [-h] [-v] [-c CONFIGFILE]
                [--clusterConfigFile CLUSTERCONFIGFILE] [-j INT] [--local]
                [--keepTemp] [--snakemakeOptions SNAKEMAKEOPTIONS] [--DAG]
                [--version] [--emailAddress EMAILADDRESS]
                [--smtpServer SMTPSERVER] [--smtpPort SMTPPORT] [--onlySSL]
                [--emailSender EMAILSENDER] [--smtpUsername SMTPUSERNAME]
                [--smtpPassword SMTPPASSWORD]
                [--peakCaller {MACS2,Genrich,SEACR}]
                [--peakCallerOptions PEAKCALLEROPTIONS] [--cutntag]
                [--singleEnd] [--useSpikeInForNorm]
                [--getSizeFactorsFrom {genome,TSS,input}]
                [--spikeinExt SPIKEINEXT] [--bigWigType BIGWIGTYPE]
                [--fragmentLength INT] [--bwBinSize INT]
                [--sampleSheet SAMPLESHEET] [--externalBed EXTERNALBED]
                [--windowSize WINDOWSIZE]
                [--predictChIPDict [PREDICTCHIPDICT]] [--fromBAM FROMBAM]
                [--bamExt BAMEXT] [--plotFormat {png,pdf,None}] [--FDR FDR]
                [--LFC ABSBESTLFC]
                GENOME [SAMPLESCONFIG]

Positional Arguments

GENOME

Genome acronym of the target organism. Either a yaml file or one of: hg38, mm39_ens106, GRCh38_gencode40, dm6, GRCz10, SchizoSPombe_ASM294v2, mm10_gencodeM19, GRCz11

SAMPLESCONFIG

configuration file (eg. 'example.chip_samples.yaml') with sample annotation

Required Arguments

-d, --working-dir

working directory is output directory and must contain DNA-mapping pipeline output files

General Arguments

-v, --verbose

verbose output (default: 'False')

-c, --configFile

configuration file: config.yaml (default: 'None')

--clusterConfigFile

configuration file for cluster usage. In absence, the default options specified in defaults.yaml and workflows/[workflow]/cluster.yaml would be selected (default: 'None')

-j, --jobs

maximum number of concurrently submitted Slurm jobs / cores if workflow is run locally (default: '5')

--local

run workflow locally; default: jobs are submitted to Slurm queue (default: 'False')

--keepTemp

Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.

--snakemakeOptions

Snakemake options to be passed directly to snakemake, e.g. use --snakemakeOptions='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')

--DAG

If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.

--version

show program's version number and exit

Email Arguments

--emailAddress

If specified, send an email upon completion to the given email address

--smtpServer

If specified, the email server to use.

--smtpPort

The port on the SMTP server to connect to. A value of 0 specifies the default port.

--onlySSL

The SMTP server requires an SSL connection from the beginning.

--emailSender

The address of the email sender. If not specified, it will be the address indicated by --emailAddress

--smtpUsername

If your SMTP server requires authentication, this is the username to use.

--smtpPassword

If your SMTP server requires authentication, this is the password to use.

Options

--peakCaller

Possible choices: MACS2, Genrich, SEACR

The peak caller to use. The default is "MACS2" and this is only applicable for sharper peaks (broad peaks will always use histoneHMM).

--peakCallerOptions

Custom options to set for the peak caller. Default is '"--qvalue 0.001"'.

--cutntag

if set, MACS2 peakCaller is used with the parameters have been used in the method section of Meers et al. 2019, Kaya-Okur et al. 2019 and 2020. Setting this flag overwrites the '--peakCallerOptions'. Default is 'False'.

--singleEnd

Input data is single-end, not paired-end

--useSpikeInForNorm

Use the spikeIn chromosomes of the hybrid genome for normalization.

--getSizeFactorsFrom

Possible choices: genome, TSS, input

Which part of the spikein genome to use to calculate sizeFactors from.

--spikeinExt

Extention of spikein chromosome names in the hybrid genome. Ignored if useSpikeInForNorm is False (default: '"_spikein"') .

--bigWigType

Type of bigWig file to create. Options are: 'subtract' (control-subtracted ChIP coverage), 'log2ratio' (for log2 ratio of ChIP over control) or 'both' (create both set of bed files). Note that the allele-specific mode currently only produces 'log2ratio' bigwigs. (default: '"both"')

--fragmentLength

Fragment length in sequencing. Used only if --singleEnd.(default: '200')

--bwBinSize

bin size of output files in bigWig format (default: '25')

--sampleSheet

Information on samples (If differential binding analysis required); see 'https://github.com/maxplanck-ie/snakepipes/tree/master/docs/content/sampleSheet.example.tsv' for example. IMPORTANT: The first entry defines which group of samples are control. By this, the order of comparison and likewise the sign of values can be changed! Also, the condition control should only be used for input samples (control peaks are not evaluated for differential binding) (default: '')

--externalBed

A bed file with intervals to be tested for differential binding. (default: '')

--windowSize

Window size to counts reads in (If differential binding analysis required); Default size is suitable for most transcription factors and sharp histone marks. Small window sizes (~20bp) should be used for very narrow transcription factor peaks, while large window sizes (~500 bp) should be used for broad marks (eg. H3K27me3) (default: '150')

--predictChIPDict

Use existing bam files to predict a CHiP-seq sample configuration file. Write it to the workingdir. If no value is given, samples that contain 'input' are used as ChIP input/ctrl. Provide a custom pattern like 'input,H3$,H4$' to change that!

--fromBAM

Input folder with bam files. If provided, the analysis will start from this point. If bam files contain single ends, please specify --singleEnd additionally. (default: 'False')

--bamExt

Extention of provided bam files, will be substracted from basenames to obtain sample names. (default: '".filtered.bam"')

--plotFormat

Possible choices: png, pdf, None

Format of the output plots from deepTools. Select 'none' for no plots (default: '"png"')

--FDR

FDR threshold to apply for filtering DB regions(default: '0.05')

--LFC

Log fold change threshold to apply for filtering DB regions(default: '1')

code @ github.