I was wondering of why there is such a strong and pronounced bias towards the 3' end of the genes as show below (lots of reads map at the 5' end, no reads are present at the 3' end). This image is obtained from the example.fastq file distributed from the site using a very simple parameter file
---- parameter file ----
# use default 76-bp error model
# create a fastq file
Hi Istvan, my first guess would be that your observation could be due to the simulation of GC bias in the genes you show. Does it change if you add the following line to your paramter file?
What I see is that transcripts that are longer than 1kb get increasingly fewer reads downstream. The rate of decay is very abrupt and there are practically no reads mapping past the 1.5 kb mark. Every single gene shows the same behavior. Every long gene looks like the one below (I have added the PCR_DISTRIBUTION none) flag.
your observation is reproduceable, and it seems to be linked to the reverse transcription step. Though there can be a slight 5'-bias observed in the reverse transcription of (very) long molecule, ie., if carried out before any form of fragmentation, you are right that the phenomenon should not be observed in the protocol you applied (UR fragmentation before RT).
I opened the Ticket BARNA-272 where we will investigate this behaviour, after loggin into JIRA (same account as used here) you can put yourself in the watchlist of the ticket to get automatically notified if there are updates. For a quick workaround, I recommend to turn off reverse transcription (RTRANSCRIPTION NO in the .par file) for the time being.
BTW, do you know that by the setting EXPRESSION_K 0 all transcripts from the reference annotation are expressed at about equal levels? Just be aware that is not what we observe in physiologic experiments, where the distribution follows a power law, and, although the Simulator currently allows the exponent to be (-1)<= EXPRESSION_K<= 0, an exponent of 0 removes the power law characteristics.
thanks for the update.
I do recognize what the effect of EXPRESSION_K 0, I just wanted to remove other factors plus I am trying to make use of your tool as part of a lecture series on RNA seq to demonstrate what to expect in a naive approach and contrast that to a realistic scenario.
In the same vein what would really help is a toy example among your help files, with just one or two transcripts and then using that to show the effects of some of the parameters. That I think would be very instructional. I will prepare a set of such data/samples for the lecture once the issues are worked out and contribute it back so that you can evaluate it on wether these would be appropriate for inclusion into the help files.
actually I gave the answer already in my last comment: the default value for the parameter FRAG_SUBSTRATE is currently set to "DNA", which means that fragmentation is carried out after reverse transcription of RNA into DNA. If carried out before fragmentation, reverse transcription by random hexamers can be prone to biases towards the 5'-end which depends in our models on the concentration of random primers, which can be indirectly controlled by the parameters RT_MIN and RT_MAX.
However, if you would like to carry out simulated hydrolysis (as in the current Illumina protocols), the parameter FRAG_SUBSTRATE should be set to RNA, as in the example in 5.1 RNA Hydrolysis Protocol (M.musculus). To avoid confusion, we will make this setting the default in the next release.
With slight modifications of your sample parameter file
I obtained the following distribution of reads in the sample locus YDR104C
I think that answers the questions about the mysterious 5'-bias.
ok, I can confirm that I do observe the same, switching the substrate to RNA will lead to a more uniform coverage.
I don't have enough experience to comment on experimental techniques but I will read up on this in the next few days. Cursory examination of the data indicates that in DNA fragmentation the bias is so strong that - if this is indeed observed in practice - it would make the DNA based fragmentation inadequate to detect transcripts longer than 1.5kb at least in the yeast genome. I wonder how many papers were published before this was understood...
thanks again for the help, I learned a lot from this tool and the paper that describes it - more than any other resource -