createIndices

What it does

This is a special pipeline in that it creates index files required by various tools within snakePipes. This workflow takes as input a fasta file (or URL) and GTF file (or URL) as well as various optional files and generates both indices and the organism yaml file used by snakePipes.

../../_images/createIndices_pipeline.png

Input requirements

The pipeline has two required inputs: a fasta file or URL and a GTF file or URL. These may both be gzipped. Optionally, you may specify a blacklist file (such as that provided by ENCODE), an effective genome size, and a file listing chromosomes to be ignored during normalization steps.

Note

If you specify a blacklist file, please ensure that regions within it do NOT overlap. Overlapping regions in this file will cause incorrect results in some tools. Further, it is best to flank blacklisted regions by at least 50 bases, as otherwise many reads originating within these regions may be nonetheless included.

Configuration file

There is a configuration file in snakePipes/workflows/createIndices/defaults.yaml:

pipeline: createIndices
outdir:
configFile:
clusterConfigFile:
local: false
maxJobs: 5
verbose: False
## Genome name used in snakePipes (no spaces!)
genome:
## Tools to create indices for. "all" for all of them
tools: all
## URLs or paths for fasta and GTF files
genomeURL:
gtfURL:
## The effective genome size
effectiveGenomeSize: 0
## Regions to blacklist in the ChIP-seq and related workflows
blacklist:
## Regions to ignore during normalization (e.g., with bamCompare)
ignoreForNorm:

These values are most conveniently set on the command line.

Hybrid genome

To create a hybrid fasta, specify the host genome with --genomeURL and the spikein genome with --spikeinGenomeURL. On top of --gtfURL and --blacklist, you may optionally provide --spikeinGtfURL and --spikeinBlacklist. Default extention added to spikein chromosomes is '_spikein' and can be changes with --spikeinExt.

Output structure

The following structure will be created in the designated outdir:

.
├── annotation
│   ├── blacklist.bed
│   ├── genes.bed
│   ├── genes.gtf
│   └── genes.slop.gtf
├── BowtieIndex
├── BWAIndex
├── BWAmethIndex
├── createIndices.cluster_config.yaml
├── createIndices.config.yaml
├── createIndices_run-1.log
├── genome_fasta
│   ├── effectiveSize
│   ├── genome.2bit
│   ├── genome.fa
│   └── genome.fa.fai
├── HISAT2Index
└── STARIndex

These files are used internally within snakePipes and don't require further inspection. The createIndices_run-1.log file contains a full log and will include the URLs or file paths that you specified. Whether the annotation/blacklist.bed file exists is dependent upon whether you specified one. The genome_fasta/effectiveSize fill will have the effective genome size (if you didn't specify it, the number of non-N bases in the genome will be used).

In addition to these, an organism yaml file will be created. Its location can be found with snakePipes info.

Note

The astute observer will note that no Salmon index is created. This is intentional and done to facilitate users changing which transcripts should be included on the fly.

Command line options

Create indices for use by snakePipes. A YAML file will be created by default in the default location where snakePipes looks for organism YAML files.

usage example:
createIndices -o output-dir --genome ftp://ftp.ensembl.org/pub/release-93/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz --gtf ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus/Mus_musculus.GRCm38.93.gtf.gz --blacklist blacklist.bed --ignoreForNormalization ignore.txt GRCm38_release93

usage: createIndices -o OUTDIR [-h] [-v] [-c CONFIGFILE]
                     [--clusterConfigFile CLUSTERCONFIGFILE] [-j INT]
                     [--local] [--keepTemp]
                     [--snakemakeOptions SNAKEMAKEOPTIONS] [--DAG] [--version]
                     [--emailAddress EMAILADDRESS] [--smtpServer SMTPSERVER]
                     [--smtpPort SMTPPORT] [--onlySSL]
                     [--emailSender EMAILSENDER] [--smtpUsername SMTPUSERNAME]
                     [--smtpPassword SMTPPASSWORD] --genomeURL GENOMEURL
                     [--gtfURL GTFURL] [--spikeinGenomeURL SPIKEINGENOMEURL]
                     [--spikeinGtfURL SPIKEINGTFURL] [--spikeinExt SPIKEINEXT]
                     [--tools {all,bowtie2,hisat2,bwa,bwameth,star,none} [{all,bowtie2,hisat2,bwa,bwameth,star,none} ...]]
                     [--effectiveGenomeSize EFFECTIVEGENOMESIZE]
                     [--spikeinBlacklist SPIKEINBLACKLIST]
                     [--blacklist BLACKLIST]
                     [--ignoreForNormalization IGNOREFORNORMALIZATION]
                     [--rmskURL RMSKURL] [--userYAML]
                     GENOME

Positional Arguments

GENOME The name to save this genome as. No spaces or special characters! Specifying an organism that already exists will cause the old information to be overwritten. See also the --userYAML option.

Required Arguments

-o, --output-dir
 output directory
--genomeURL URL or local path to where the genome fasta file is located. The file may optionally be gzipped.
--gtfURL URL or local path to where the genome annotation in GTF format is located. GFF is NOT supported. The file may optionally be gzipped. If this file is not specified, then RNA-seq related tools will NOT be usable.

General Arguments

-v, --verbose verbose output (default: 'False')
-c, --configFile
 configuration file: config.yaml (default: 'None')
--clusterConfigFile
 configuration file for cluster usage. In absence, the default options specified in defaults.yaml and workflows/[workflow]/cluster.yaml would be selected (default: 'None')
-j, --jobs maximum number of concurrently submitted Slurm jobs / cores if workflow is run locally (default: '5')
--local run workflow locally; default: jobs are submitted to Slurm queue (default: 'False')
--keepTemp Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.
--snakemakeOptions
 Snakemake options to be passed directly to snakemake, e.g. use --snakemakeOptions='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')
--DAG If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.
--version show program's version number and exit

Email Arguments

--emailAddress If specified, send an email upon completion to the given email address
--smtpServer If specified, the email server to use.
--smtpPort The port on the SMTP server to connect to. A value of 0 specifies the default port.
--onlySSL The SMTP server requires an SSL connection from the beginning.
--emailSender The address of the email sender. If not specified, it will be the address indicated by --emailAddress
--smtpUsername If your SMTP server requires authentication, this is the username to use.
--smtpPassword If your SMTP server requires authentication, this is the password to use.

Options

--spikeinGenomeURL
 URL or local path to where the spikein genome fasta file is located. The file may optionally be gzipped.
--spikeinGtfURL
 URL or local path to where the spikein genome annotation in GTF format is located. GFF is NOT supported. The file may optionally be gzipped.
--spikeinExt Extention of spikein chromosome names in the hybrid genome. (default: 'None') .
--tools

Possible choices: all, bowtie2, hisat2, bwa, bwameth, star, none

Only produce indices for the following tools (by default, all indices will be created). The default is 'all'. 'none' will create everything except aligner indices.

--effectiveGenomeSize
 The effective genome size. If you don't specify a value then the number of non-N bases will be used.
--spikeinBlacklist
 An optional URL or local path to a file to use to blacklist spikein organism regions (such as that provided by the ENCODE consortium).
--blacklist An optional URL or local path to a file to use to blacklist regions (such as that provided by the ENCODE consortium).
--ignoreForNormalization
 An optional file list, with one entry per line, the chromosomes to ignore during normalization. These are typically sex chromosomes, mitochondrial DNA, and unplaced contigs.
--rmskURL URL or local path to where the repeat masker output file is located. This is only required if you plan to run the non-coding RNA-seq workflow.
--userYAML By default, this workflow creates an organism YAML file where snakePipes will look for it by default. If this isn't desired (e.g., you don't want the organism to be selectable by default or you don't have write permissions to the snakePipes installation) you can specify this option and the YAML file will instead be created in the location specified by the -o option.
code @ github.