createIndices¶
What it does¶
This is a special pipeline in that it creates index files required by various tools within snakePipes. This workflow takes as input a fasta file (or URL) and GTF file (or URL) as well as various optional files and generates both indices and the organism yaml file used by snakePipes.

Input requirements¶
The pipeline has two required inputs: a fasta file or URL and a GTF file or URL. These may both be gzipped. Optionally, you may specify a blacklist file (such as that provided by ENCODE), an effective genome size, and a file listing chromosomes to be ignored during normalization steps.
Note
If you specify a blacklist file, please ensure that regions within it do NOT overlap. Overlapping regions in this file will cause incorrect results in some tools. Further, it is best to flank blacklisted regions by at least 50 bases, as otherwise many reads originating within these regions may be nonetheless included.
Configuration file¶
There is a configuration file in snakePipes/workflows/createIndices/defaults.yaml
:
pipeline: createIndices
outdir:
configFile:
clusterConfigFile:
local: false
maxJobs: 5
verbose: False
## Genome name used in snakePipes (no spaces!)
genome:
## Tools to create indices for. "all" for all of them
tools: all
## URLs or paths for fasta and GTF files
genomeURL:
gtfURL:
## The effective genome size
effectiveGenomeSize: 0
## Regions to blacklist in the ChIP-seq and related workflows
blacklist:
## Regions to ignore during normalization (e.g., with bamCompare)
ignoreForNorm:
These values are most conveniently set on the command line.
Hybrid genome¶
To create a hybrid fasta, specify the host genome with --genomeURL
and the spikein genome with --spikeinGenomeURL
. On top of --gtfURL
and --blacklist
, you may optionally provide --spikeinGtfURL
and --spikeinBlacklist. Default extention added to spikein chromosomes is '_spikein' and can be changes with --spikeinExt
.
Output structure¶
The following structure will be created in the designated outdir
:
.
├── annotation
│ ├── blacklist.bed
│ ├── genes.bed
│ ├── genes.gtf
│ └── genes.slop.gtf
├── BowtieIndex
├── BWAIndex
├── BWAmethIndex
├── createIndices.cluster_config.yaml
├── createIndices.config.yaml
├── createIndices_run-1.log
├── genome_fasta
│ ├── effectiveSize
│ ├── genome.2bit
│ ├── genome.fa
│ └── genome.fa.fai
├── HISAT2Index
└── STARIndex
These files are used internally within snakePipes and don't require further inspection. The createIndices_run-1.log
file contains a full log and will include the URLs or file paths that you specified. Whether the annotation/blacklist.bed
file exists is dependent upon whether you specified one. The genome_fasta/effectiveSize
fill will have the effective genome size (if you didn't specify it, the number of non-N bases in the genome will be used).
In addition to these, an organism yaml file will be created. Its location can be found with snakePipes info
.
Note
The astute observer will note that no Salmon index is created. This is intentional and done to facilitate users changing which transcripts should be included on the fly.
Command line options¶
Create indices for use by snakePipes. A YAML file will be created by default in the default location where snakePipes looks for organism YAML files.
- usage example:
createIndices -o output-dir --genome ftp://ftp.ensembl.org/pub/release-93/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz --gtf ftp://ftp.ensembl.org/pub/release-93/gtf/mus_musculus/Mus_musculus.GRCm38.93.gtf.gz --blacklist blacklist.bed --ignoreForNormalization ignore.txt GRCm38_release93
usage: createIndices -o OUTDIR [-h] [-v] [-c CONFIGFILE]
[--clusterConfigFile CLUSTERCONFIGFILE] [-j INT]
[--local] [--keepTemp]
[--snakemakeOptions SNAKEMAKEOPTIONS] [--DAG] [--version]
[--emailAddress EMAILADDRESS] [--smtpServer SMTPSERVER]
[--smtpPort SMTPPORT] [--onlySSL]
[--emailSender EMAILSENDER] [--smtpUsername SMTPUSERNAME]
[--smtpPassword SMTPPASSWORD] --genomeURL GENOMEURL
[--gtfURL GTFURL] [--spikeinGenomeURL SPIKEINGENOMEURL]
[--spikeinGtfURL SPIKEINGTFURL] [--spikeinExt SPIKEINEXT]
[--tools {all,bowtie2,hisat2,bwa,bwa-mem2,bwameth,bwameth2,star,none} [{all,bowtie2,hisat2,bwa,bwa-mem2,bwameth,bwameth2,star,none} ...]]
[--effectiveGenomeSize EFFECTIVEGENOMESIZE]
[--spikeinBlacklist SPIKEINBLACKLIST]
[--blacklist BLACKLIST]
[--ignoreForNormalization IGNOREFORNORMALIZATION]
[--rmskURL RMSKURL] [--userYAML]
GENOME
Positional Arguments¶
- GENOME
The name to save this genome as. No spaces or special characters! Specifying an organism that already exists will cause the old information to be overwritten. See also the --userYAML option.
Required Arguments¶
- -o, --output-dir
output directory
- --genomeURL
URL or local path to where the genome fasta file is located. The file may optionally be gzipped.
- --gtfURL
URL or local path to where the genome annotation in GTF format is located. GFF is NOT supported. The file may optionally be gzipped. If this file is not specified, then RNA-seq related tools will NOT be usable.
General Arguments¶
- -v, --verbose
verbose output (default: 'False')
- -c, --configFile
configuration file: config.yaml (default: 'None')
- --clusterConfigFile
configuration file for cluster usage. In absence, the default options specified in defaults.yaml and workflows/[workflow]/cluster.yaml would be selected (default: 'None')
- -j, --jobs
maximum number of concurrently submitted Slurm jobs / cores if workflow is run locally (default: '5')
- --local
run workflow locally; default: jobs are submitted to Slurm queue (default: 'False')
- --keepTemp
Prevent snakemake from removing files marked as being temporary (typically intermediate files that are rarely needed by end users). This is mostly useful for debugging problems.
- --snakemakeOptions
Snakemake options to be passed directly to snakemake, e.g. use --snakemakeOptions='--dryrun --rerun-incomplete --unlock --forceall'. WARNING! ONLY EXPERT USERS SHOULD CHANGE THIS! THE DEFAULT VALUE WILL BE APPENDED RATHER THAN OVERWRITTEN! (default: '['--use-conda']')
- --DAG
If specified, a file ending in _pipeline.pdf is produced in the output directory that shows the rules used and their relationship to each other.
- --version
show program's version number and exit
Email Arguments¶
- --emailAddress
If specified, send an email upon completion to the given email address
- --smtpServer
If specified, the email server to use.
- --smtpPort
The port on the SMTP server to connect to. A value of 0 specifies the default port.
- --onlySSL
The SMTP server requires an SSL connection from the beginning.
- --emailSender
The address of the email sender. If not specified, it will be the address indicated by --emailAddress
- --smtpUsername
If your SMTP server requires authentication, this is the username to use.
- --smtpPassword
If your SMTP server requires authentication, this is the password to use.
Options¶
- --spikeinGenomeURL
URL or local path to where the spikein genome fasta file is located. The file may optionally be gzipped.
- --spikeinGtfURL
URL or local path to where the spikein genome annotation in GTF format is located. GFF is NOT supported. The file may optionally be gzipped.
- --spikeinExt
Extention of spikein chromosome names in the hybrid genome. (default: 'None') .
- --tools
Possible choices: all, bowtie2, hisat2, bwa, bwa-mem2, bwameth, bwameth2, star, none
Only produce indices for the following tools (by default, all indices will be created). The default is 'all'. 'none' will create everything except aligner indices.
- --effectiveGenomeSize
The effective genome size. If you don't specify a value then the number of non-N bases will be used.
- --spikeinBlacklist
An optional URL or local path to a file to use to blacklist spikein organism regions (such as that provided by the ENCODE consortium).
- --blacklist
An optional URL or local path to a file to use to blacklist regions (such as that provided by the ENCODE consortium).
- --ignoreForNormalization
An optional file list, with one entry per line, the chromosomes to ignore during normalization. These are typically sex chromosomes, mitochondrial DNA, and unplaced contigs.
- --rmskURL
URL or local path to where the repeat masker output file is located. This is only required if you plan to run the non-coding RNA-seq workflow.
- --userYAML
By default, this workflow creates an organism YAML file where snakePipes will look for it by default. If this isn't desired (e.g., you don't want the organism to be selectable by default or you don't have write permissions to the snakePipes installation) you can specify this option and the YAML file will instead be created in the location specified by the -o option.