For time efficiency, the positions of all genetic variants are loaded into the computer's memory (RAM), so it is to be ensured that enough memory is provided to the Java Virtual Machine. As an orientation, the variants from the 1000 Genomes project phase 1 and phase 2 just for chr22 occupy 6.4 Gb of disk and require ~1.5Gb for running splice site scoring.
Gene Models (annotation in GTF format), REQUIRED: if missing, the program responds with an error like
Chromosome sequences (in atomary FASTA files) REQUIRED: if missing, the program responds with an error like
The chromosome sequences currently have to be provided as separate files, one per chromosome. All of these files have to be in the same folder (e.g., genomes/H.sapiens/hg19) with a filename prefix that corresponds to the tags in column $1 of the GTF filel provided and a suffix ".fa" or ".fasta"; e.g., if chromosomes are named "chr1", "chr2", etc. then the program expects files named "chr1.fa", "chr2.fa", ...
The first line of every
Genetic variants (as a pseudo VCF file)
Modified bases can be provided in a 5-column file format that resembles the variant-calling-format (VCF): column 1 is chromosome ID (number or letter), column 2 is position within the chromosome (integer), column 3 is variant identifier, column 4 is reference nucleotide and column 5 is the variant nucleotide.
Characters in the first column of the VCF file have to correspond to the suffixes of the chromosome names of the (1) gene annotation and (2) chromosome files, removing the prefix "chr".