Empirical expression vector as input

Created by David Casero on Dec 05, 2012

Hi all,

Is there a way to get around the random expression assignment? It would be a nice addition if users can provide an empirical distribution (a vector of RPKMs or something like that) obtained from a real experiment. This would help on the validation/comparison of observed and simulated profiles at the genome level.

Thanks

David Casero

No labels

2 Comments

Micha Sammeth

Hi David,

the question about custom expression values has also recently been raised by Moritz from the DKFZ. Yes, there is a way. You need to provide the Simulator with a .PRO file that has 6 columns and provides valid information on the 4 columns marked in bold below.

Column Nr	Name	Value	Description
1	Locus	chrom:start-end[W\|C]	*identifier of the transcriptional locus, given by the chromosome (chrom), start* respectively end position, and the strand (Watson or Crick).**
2	Transcript_ID	String	transcript identifier from the reference annotation.
3	Coding	[CDS\|NC]	specifies whether the transcript has an annotated coding sequence (CDS) or not (NC)
4	Length	Integer	the mature length of the transcript after splicing out introns, disregarding the poly-A tail, as annotated in the reference annotation
5	Expressed Fraction	Float	fraction of RNA molecules that represent transcripts that are qualitatively equal to this RNA form
6	Expressed Number	Integer	absolute number of expressed RNA molecules

Do not worry about column 3 and column 5, they are more of informative character (i.e., "output") rather than used in subsequent steps. If you provide such a custom .PRO file in the parameters, and if you do not request re-generation of expression values (flag -x), then the program will use your values in the subsequent steps of the simulation pipeline. For instance the command

flux-simulator -lsp parameters_pointing_to_custom_profile.par

will "eat" the expression values you provided in the custom profile.

Column 6 of your custom .PRO file has to be filled with Integer values according to the expression values you have. The numbers represent initial molecules, as a rule of thumb you may want to start with values of about (10 * RPKM) which should come close to the default settings of the Flux Simulator–which are certainly way less molecules than there are in real cells.

Column 1, 2, and 4 you obtain from the transcript annotation (GTF file), either by your preferred scripting language or by running the Simulator and "hitchhiking" the corresponding values from a generated expression profile–they are invariant transcript attributes.

I created a ticket to continue the discussion whether/how we can improve the program to make these steps more automated

type	key	summary	assignee	reporter	priority	status	resolution	created	updated	due
Data cannot be retrieved due to an unexpected error. View these issues in Jira

Please feel free to put yourself in the watchlist of the ticket to get notified when there are updates.

Micha

Permalink

Dec 10, 2012

David Casero
Thanks Micha, makes sense to me.
David
- Permalink
- Dec 10, 2012

Space shortcuts

Child pages

2 Comments

Micha Sammeth

David Casero