Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Section

The Flux Simulator requires the key "transcript_id" to identify exons of the same transcripts. As in the UCSC standard, transcript IDs have to be unique within the chromosome a certain transcript has been annotated on.

Warning

The automated sorting of gtf files requires the transcript_id to be in the same column across all the gtf file. This column is guessed by the first lines of the file and later on assumed to be consistent. If the column "transcript_id" varies within the gtf file, the automated sorting will fail. Such files may be fixed by the command

awk 'BEGIN{FS="\t";OFS="\t"}{split($NF,a," ");pfx="";s="";for(i=1;i<=length(a);i+=2){if(a[i]=="transcript_id"){pfx=a[i]" "a[i+1]}else{s=s" "a[i]" "a[i+1]}}if(pfx==""){print "[WARN] line "NR" without transcript_id!" > "/dev/stderr"}else{$NF=pfx""s;print$0} }' genes.gtf > genes_clean.gtf

or similar, where "genes.gtf" is the file with incosistent transcript_id columns and "genes_clean.gtf" is the file after reordering to match the transcript_id information in a consistent position.