Page History

The Command Line

The Flux Simulator reads the parameter values from a file which is to be specified from the command line.

Code Block
$ flux.sh -t simulator -x -l -s -xlsp myParameters.par

...

p myParameters.par

...

Section
The example carries out a complete simulation pipeline, including simulated expression ("-x" flag), library construction ("-l" flag), and sequencing ("-s" flag). With the "-p" flag a parameter file with the description of the corresponding pipeline is passed to the program.

A Minimal Example of a Simulation

Section

The kind of parameters and values which are required to be set in the file myParameters.sh depend on the desired behavior of the simulation. Most of these parameters have default values, however, the Flux Simulator requires in any case a qualitative annotation of transcripts, i.e., their intron-exon structure described by genomic coordinates in a GTF file. Therefore, a minimal parameter file consists of exclusively the line

myParameters.par
REF_FILE_NAME	myTranscriptome.gtf

Section
Note: With exclusively qualitative data about the transcripts--i.e., their genomic location and their mutual overlap--no sequence-specific attributes are taken into account in the simulation; that includes experimental characteristics caused by sequence biases as well as the in silico production of read sequences, which potentially are affected by sequencing errors.

Requirement for Simulations on Transcript/Read Sequences

Section
If the simulation should produce read sequences, respectively if intermediate steps of the simulation are to take into account sequence-dependent biases, the Flux Simulator program requires to provide the genomic sequences of the chromosomes respectively scaffolds on which the genes have been annotated.

myParameters.par
REF_FILE_NAME	myTranscriptome.gtf
GEN_DIR	<path>/myGenome/

Section
where <path> is the path in the file system pointing to the folder myGenome, which contains the (complete) set of chromosomes for transcripts in the annotation myTranscriptome.gtf; the names of the files in the myGenome folder must coincide with the names of the sequences provided in the first column of the myTranscriptome.gtf file:

Code Block

$ head -n2 myTranscriptome.gtf

10  EnsEMBL  exon  123  456  .  +  .  gene_id="gene1"; transcript_id="transcript1"; 
10  EnsEMBL  exon  789  1012  .  +  .  gene_id="gene1"; transcript_id="transcript1"; 
 
$ ls myGenome

10.fa  12.fa  14.fa  16.fa  18.fa  1.fa   21.fa  2.fa  4.fa  6.fa  8.fa  X.fa
11.fa  13.fa  15.fa  17.fa  19.fa  20.fa  22.fa  3.fa  5.fa  7.fa  9.fa  Y.fa

My First Simulation

Section
Let's consider the following parameter file

REF_FILE_NAME	/Users/micha/annotations/hg19_RefSeq_2009-05-13.gtf
POLYA_SHAPE	5
POLYA_SCALE	100
FRAG_METHOD	NB
FRAG_SUBSTRATE	DNA
FILTERING	ON
READ_LENGTH	75
PAIRED_END	TRUE

Section

In this example, RNA molecules as annotated in the RefSeq annotation are simulated to be expressed with about normally distributed polyA-tail lengths of an average size of 100nt. Fragmentation by nebulization (FRAG_METHOD NB) is carried out after reverse transcription (FRAG_SUBSTRATE DNA). The default size selection is carried out (FILTERING ON). Finally, 75nt paired-end reads are obtained from the fragments

Explanation

Section

Section

Requires: REF_FILE_NAME, LOAD_CODING, LOAD_NONCODING
Outputs: PRO_FILE_NAME Column 1 (locus name), 2 (transcript name), 3 (CDS/NC) and 4 (spliced length)

The necessary first step in order to simulate the experiment is the loading of a reference annotation. Input data has to be in GTF format at the path specified by REF_FILE_NAME. Each transcript has to have "exon" features, LOAD_CODING takes into account the ones that have additionally "CDS" features, LOAD_NONCODING those which don't. Initiating the reading of the reference annotation (button "Run" in the toolbar) first causes a check whether the GTF structure is well sorted for efficiency of the subsequent operations. In case, the FLUX SIMULATOR will sort your file in the temporary directory, and subsequently a sorted copy of the file (with the suffix "_sorted" before the extension) should appear in the folder containing the project. Make sure to use sorted files instead of the original files in future runs, because file sorting can contribute a substantial part of the running time.

Upon termination of reading and parsing the annotation, you see several statistics including a histogram of spliced transcript lengths (upper panel) and a zoom-in onto the first 3 quartiles (lower panel). This step also initiates the pro file by writing the first 4 columns, i.e., splice locus ID, transcript ID, CDS/NC and spliced transcript length.

Space shortcuts

Child pages

Versions Compared

Old Version 6

New Version Current

Key

The Command Line

A Minimal Example of a Simulation

Requirement for Simulations on Transcript/Read Sequences

My First Simulation

Explanation