Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

The Command Line

The Flux Simulator reads the parameter values from a file which is to be specified from the command line. 

Code Block
$ flux.sh -t simulator -x -l -s -xlsp myParameters.par

...

p myParameters.par

...

Section

The example carries out a complete simulation pipeline, including simulated expression ("-x" flag), library construction ("-l" flag), and sequencing ("-s" flag). With the "-p" flag a parameter file with the description of the corresponding pipeline is passed to the program.

A Minimal Example of a Simulation

Section

The kind of parameters and values which are required to be set in the file myParameters.sh depend on the desired behavior of the simulation. Most of these parameters have default values, however, the Flux Simulator requires in any case a qualitative annotation of transcripts, i.e., their intron-exon structure described by genomic coordinates in a GTF file. Therefore, a minimal parameter file consists of exclusively the line

myParameters.par
REF_FILE_NAME
myTranscriptome.gtf
Section

Note: With exclusively qualitative data about the transcripts--i.e., their genomic location and their mutual overlap--no sequence-specific attributes are taken into account in the simulation; that includes experimental characteristics caused by sequence biases as well as the in silico production of read sequences, which potentially are affected by sequencing errors.

Requirement for Simulations on Transcript/Read Sequences

Section

If the simulation should produce read sequences, respectively if intermediate steps of the simulation are to take into account sequence-dependent biases, the Flux Simulator program requires to provide the genomic sequences of the chromosomes respectively scaffolds on which the genes have been annotated.

myParameters.par
REF_FILE_NAME
myTranscriptome.gtf
GEN_DIR
<path>/myGenome/
Section

where <path> is the path in the file system pointing to the folder myGenome, which contains the (complete) set of chromosomes for transcripts in the annotation myTranscriptome.gtf; the names of the files in the myGenome folder must coincide with the names of the sequences provided in the first column of the myTranscriptome.gtf file:

Code Block
$ head -n2 myTranscriptome.gtf

10  EnsEMBL  exon  123  456  .  +  .  gene_id="gene1"; transcript_id="transcript1"; 
10  EnsEMBL  exon  789  1012  .  +  .  gene_id="gene1"; transcript_id="transcript1"; 
 
$ ls myGenome

10.fa  12.fa  14.fa  16.fa  18.fa  1.fa   21.fa  2.fa  4.fa  6.fa  8.fa  X.fa
11.fa  13.fa  15.fa  17.fa  19.fa  20.fa  22.fa  3.fa  5.fa  7.fa  9.fa  Y.fa

My First Simulation

Section

Let's consider the following parameter file

REF_FILE_NAME    
/Users/micha/annotations/hg19_RefSeq_2009-05-13.gtf

POLYA_SHAPE

5 

POLYA_SCALE

100

 
FRAG_METHODNB 
FRAG_SUBSTRATEDNA 
FILTERINGON 
READ_LENGTH75 
PAIRED_ENDTRUE 
Section

In this example, RNA molecules as annotated in the RefSeq annotation are simulated to be expressed with about normally distributed polyA-tail lengths of an average size of 100nt. Fragmentation by nebulization (FRAG_METHOD NB) is carried out after reverse transcription (FRAG_SUBSTRATE DNA). The default size selection is carried out (FILTERING ON). Finally, 75nt paired-end reads are obtained from the fragments

  
  
  

Explanation

Section

 

 

 

 

 

Section

Requires: REF_FILE_NAME, LOAD_CODING, LOAD_NONCODING
Outputs: PRO_FILE_NAME Column 1 (locus name), 2 (transcript name), 3 (CDS/NC) and 4 (spliced length)

The necessary first step in order to simulate the experiment is the loading of a reference annotation. Input data has to be in GTF format at the path specified by REF_FILE_NAME. Each transcript has to have "exon" features, LOAD_CODING takes into account the ones that have additionally "CDS" features, LOAD_NONCODING those which don't. Initiating the reading of the reference annotation (button "Run" in the toolbar) first causes a check whether the GTF structure is well sorted for efficiency of the subsequent operations. In case, the FLUX SIMULATOR will sort your file in the temporary directory, and subsequently a sorted copy of the file (with the suffix "_sorted" before the extension) should appear in the folder containing the project. Make sure to use sorted files instead of the original files in future runs, because file sorting can contribute a substantial part of the running time.

Upon termination of reading and parsing the annotation, you see several statistics including a histogram of spliced transcript lengths (upper panel) and a zoom-in onto the first 3 quartiles (lower panel). This step also initiates the pro file by writing the first 4 columns, i.e., splice locus ID, transcript ID, CDS/NC and spliced transcript length.