View Source

Parameter

Name	Default Value	Parameter Range	Description
REF_FILE			file from which the reference annotation (GTF format) is read
LOAD_CODING	true	{true,false}	flag to dis-/consider transcripts that have an annotated coding sequence
LOAD_NONCODING	true	{true,false}	flag to dis-/consider transcripts that are annotated to be non-coding
PRO_FILE			file to which the simulated expression values are written
LIB_FILE			file to which the expressed transcript molecules are written
NB_MOLECULES	5,000,000	>0	number of expressed RNA molecules simulated
EXPRESSION_K	-0.6		exponent of the expression power law ("Pareto coefficient")
EXPRESSION_X0	9,500		controls the exponential decay
EXPRESSION_X1	9,500²		controls the exponential decay

The Distribution of Gene Expression Levels

In the beginning, the Flux Simulator reads the transcripts of the reference annotation and clusters genomic overlapping ones into loci.

To assign a random expression profile where not necessarily all transcripts of the reference are expressed. Expression levels are connected with the relative expression rank by a mixed power- and exponential law of the general form

\[y=y^{k} exp^{-\frac{x}{a}-\left(\frac{x}{b}\right)^2}\]

where denotes the rank number of a gene and is the exponent of the intrinsic power law, and respectively control the exponential decay. The Flux Simulator assigns to the transcripts in the reference annotation randomly expression ranks which then are turned into relative expression levels by the modified Zipf's Law above, which determines the initial number of molecules by multiplication with the total numbers of molecules. Default values for parameters and have been estimated for mammalian cells by non-linear fitting to expression levels observed in experimental results.

Output: The first 6 columns of the PRO_FILE

Transcript Modifications during Expression

After the number of RNA molecules has been determined for each transcript, in silico expressed transcripts are assigned individual variations in transcription start and the length of the attached poly-A tail. The FLUX SIMULATOR modeles differences in transcription start are modelled by random variables under an exponential model with a mean around 10nt. During poly-adenylation in the nucleus usually 200-250 adenine residues get added to the primary transcript. Disregarding other poly-adenylation mechanisms, as cytoplasmatic polyadenylation, and the exact mechanisms of degrading processes by exo- and endonucleases, our model describes poly-A lengths by randomly sampling under a Gaussian distribution with a mean of 125nt and shape adapted s.t. >99.5% of the random variables fall in the interval [0;250].

Requires: PRO_FILE_NAME Column 1-4,
Outputs: PRO_FILE_NAME Column 5 (relative abundance) and 6 (molecule count), both after gene expression