Description

 

Requirements

Hardware

For time efficiency, the positions of all genetic variants are loaded into the computer's memory (RAM), so it is to be ensured that enough memory is provided to the Java Virtual Machine. As an orientation, the variants from the 1000 Genomes project phase 1 and phase 2 just for chr22 occupy 6.4 Gb of disk and require ~1.5Gb for running splice site scoring.

Data Files

  1. Gene Models (annotation in GTF format), REQUIRED: if missing, the program responds with an error like

    Hey, you forgot to specify a valid input file!
    This is a bit important, I cannot work without an input annotation. I want a GTF file with transcript annotations (exon features, with a mandatory optional attribute named 'transcript_id') IN THE SAME COLUMN (i.e., if the transcript identifier of the 1st line is in column #10, it has to be in all lines of the file in column #10. The rest of the file should comply with the standard as specified at http://mblab.wustl.edu/GTF2.html.
  2. Chromosome sequences (in atomary FASTA files) REQUIRED: if missing, the program responds with an error like

    [ERROR] Splice site scoring requires the genomic sequence, provide a value for parameter 'CHR_SEQ' in the parameter file, or via the command line flags -c or --chr!

    The chromosome sequences currently have to be provided as separate files, one per chromosome. All of these files have to be in the same folder (e.g., genomes/H.sapiens/hg19) with a filename prefix that corresponds to the tags in column $1 of the GTF filel provided and a suffix ".fa" or ".fasta"; e.g., if chromosomes are named "chr1", "chr2", etc. then the program expects files named "chr1.fa", "chr2.fa", ...

    The first line of every

  3. Genetic variants (as a 

 

Basic Setup