Hi all,

Is there a way to get around the random expression assignment? It would be a nice addition if users can provide an empirical distribution (a vector of RPKMs or something like that) obtained from a real experiment. This would help on the validation/comparison of observed and simulated profiles at the genome level.

Thanks

David Casero

  • No labels

2 Comments

  1. Hi David,

    the question about custom expression values has also recently been raised by Moritz from the DKFZ. Yes, there is a way. You need to provide the Simulator with a .PRO file that has 6 columns and provides valid information on the 4 columns marked in bold below.

    Column NrNameValueDescription
    1Locuschrom:start-end[W|C]identifier of the transcriptional locus, given by the chromosome (chrom), start respectively end position, and the strand (Watson or Crick).
    2Transcript_IDStringtranscript identifier from the reference annotation.
    3Coding[CDS|NC]specifies whether the transcript has an annotated coding sequence (CDS) or not (NC)
    4LengthIntegerthe mature length of the transcript after splicing out introns, disregarding the poly-A tail, as annotated in the reference annotation
    5Expressed FractionFloatfraction of RNA molecules that represent transcripts that are qualitatively equal to this RNA form
    6Expressed NumberIntegerabsolute number of expressed RNA molecules

    Do not worry about column 3 and column 5, they are more of informative character (i.e., "output") rather than used in subsequent steps. If you provide such a custom .PRO file in the parameters, and if you do not request re-generation of expression values (flag -x), then the program will use your values in the subsequent steps of the simulation pipeline. For instance the command

    flux-simulator -lsp parameters_pointing_to_custom_profile.par

    will "eat" the expression values you provided in the custom profile.

    Column 6 of your custom .PRO file has to be filled with Integer values according to the expression values you have. The numbers represent initial molecules, as a rule of thumb you may want to start with values of about (10 * RPKM) which should come close to the default settings of the Flux Simulator–which are certainly way less molecules than there are in real cells.

    Column 1, 2, and 4 you obtain from the transcript annotation (GTF file), either by your preferred scripting language or by running the Simulator and "hitchhiking" the corresponding values from a generated expression profile–they are invariant transcript attributes.

    I created a ticket to continue the discussion whether/how we can improve the program to make these steps more automated

    type key summary assignee reporter priority status resolution created updated due

    Data cannot be retrieved due to an unexpected error.

    View these issues in Jira

    Please feel free to put yourself in the watchlist of the ticket to get notified when there are updates.

    Micha

  2. Thanks Micha, makes sense to me. 

    David