I am trying to use Flux Simulator with a .gtf file from Illumina's iGenomes collection. Specifically, I am using the genes.gtf in the Drosophila_melanogaster/UCSC/dm3/Annotation/Genes subdirectory of this archive: dm3. I have set REF_FILE_NAME appropriately. However, when I run flux-simulator, I get an warning indicating that the GTF file isn't sorted. Next, flux "sorts" the GTF file for me, but then yields an error saying that the sorted file isn't sorted:
$ ./flux-simulator/bin/flux-simulator -p parameters/example.par
Flux-Simulator v1.2.1 (Flux Library: 1.22)
[INFO] No mode selected, executing the full pipeline (-x -l -s)
[INFO] I am collecting information on the run.
[INFO] Reading error model 76 bases model
[INFO] Checking GTF file
Checking GTF *[WARN] Unsorted in line 4 transcript id NM_175941 used twice, on: chr2L,chr2L
[GTF FILE] The GTF reference file given is not sorted, sorting it right now...
sorting GTF file OK (00:00:20)
[GTF FILE] The Simulator will use /Users/langmead/git/tornado/tools/flux_sim/parameters/genes_sorted.gtf
[GTF FILE] You might want to update your parameters file
[PROFILING] I am assigning the expression profile
Checking GTF *[WARN] Unsorted in line 3496 transcript id NR_073697 used twice, on: chr2L,chr2L
[ERROR] The reference annotation GTF is not sorted!
java.lang.RuntimeException: The reference annotation GTF is not sorted!
I'm guessing this has something to do with how the .gtf files from iGenomes are formatted. Apparently, they are formatted to contain information that Cufflinks likes so that it can do differential analysis w/r/t promoters and coding sequences, in addition to transcripts:
Can you help?
from the output you posted I take that the Simulator checked the original input file "genes.gtf" and found that it is not sorted as expected. This would usually trigger an automatic sorting process, however, it seems that the program already found a file named "genes_sorted.gtf" (which is the default name for files that have been sorted by the Simulator) in the same folder as your input. Next, it took this file "genes_sorted.gtf" instead of "genes.gtf" as input, and crashed because the ordering there was also not meeting the expectation. Apologies for not having another trigger for automated sorting here.
It would be interesting to know, wether the "genes_sorted.gtf" file was created by you, or by a previous run of the program; in the former case we might help you with sorting the file manually, in the latter case we should open a ticket in our bugtracker. I downloaded the annotation following your instructions, but I found multiple files called "genes.gtf" in the specified archive:
Which of these files did you use in your example run? And what happens if you temporarily rename or move the file "genes_sorted.gtf"?
Thank you for the quick reply! I tried this again, making extra sure that there was no genes_sorted.gtf file to begin with. The output from the run is the same:
So, yes, the file that ultimately generates the error is one that was sorted by Flux.
The genes.gtf file I'm using is the one in the
subdirectory. Following a couple of symbolic links indicates this is the same as this one:
Thank you again for your help - it is much appreciated,
apologies, that was my error in the first place. The problem you report is indeed caused by the intrinsic sorting, because not all lines in the gtf file have the transcript_id in the same column. Although that is not explicitly required by the gtf specification, we assume the same column for the transcript_id information to speed up the sorting.
An adhoc solution is to reposition the transcript_id field in the input file, for instance by:
With the above command, I converted the example file you provided into a file that can be processed by the Simulator:
I agree that we should at least throw a warning if our assumption on the position consistency of ordering within the gtf optional fields is not met by the file. I updated the documentation and I opened a corresponding ticket. However, I hope that the command above serves you as a temporary workaround.
Please be also warned that the folder
from your annotation database does not provide FASTA sequences for all the chromosome sequences in your gtf file (e.g., chrUextra is absent). This raises a problem at the point when the program attempts to retrieve the read sequences (if not turned OFF), and to get around this issue (when sequence output is turned ON) I had to remove the annotations on chrUextra from the gtf file by
Afterwards, I was able to process the complete Simulator pipeline with the default parameters.
Great! Many thanks. Here's a Python script others can use to perform both the steps you describe:
thanks for contributing with the python script!