Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section

The Flux Simulator uses FASTA/FASTQ sequences at different points; for the (optional) input of a genomic sequence to (optinally) produce read sequences.Genomic references are expected to provide one single FASTA file per reference sequence (i.e., chromosome, scaffold, etc.), as described in the Sequencing Section.

 

 

Fasta formats are used very commonly as they provide easy (descriptor,sequence) tuples. Generally, it can be differentiated between single-FASTA files — that contain a single sequence — and multi-FASTA files, which correspondingly contain more than one sequence. The Flux Capacitor and Simulator programs usually output multi-FASTA files, an exception is the genomic sequence files, which are to be located in a common directory, with a file chr.fa for each chr annotated in the corresponding GTF annotation file.

FA, FASTA format

...

Section

The read sequence output is a multi-FASTA file, where each fasta block contains a description line that starts with a ">" ("greater than") symbol and the following one or multiple lines containing the read sequence. If a quality/error-model is provided, the very related FASTQ file format is produced, where the ">" identifier is replaced by the "@" symbol, and a quality block is following the fasta block, which uses a "+" separator and subsequently provides the qualities of the read sequences. The description line contains the read identifier as described in the

Example

  itself. Further examples for FASTA format can be found for instance here.Oftenly, the description line is tokenized into different tags, separated by either "|" ("pipe", as in NCBI standard) or ";" ("semi-colon", as in the Pearson FASTA format). The Flux Capacitor and the Flux Simulator use these separators to divide the descriptor line in the fields of the Flux Mapped Read Descriptor.