Section |
---|
The Flux Simulator uses FASTA/FASTQ sequences at different points; for the (optional) input of a genomic sequence to (optinally) produce read sequences.Genomic references are expected to provide one single FASTA file per reference sequence (i.e., chromosome, scaffold, etc.), as described in the Sequencing Section. |
Fasta formats are used very commonly as they provide easy (descriptor,sequence) tuples. Generally, it can be differentiated between single-FASTA files — that contain a single sequence — and multi-FASTA files, which correspondingly contain more than one sequence. The Flux Capacitor and Simulator programs usually output multi-FASTA files, an exception is the genomic sequence files, which are to be located in a common directory, with a file chr.fa for each chr annotated in the corresponding GTF annotation file.
...
Section |
---|
The read sequence output is a multi-FASTA file, where each fasta block contains a description line that starts with a ">" ("greater than") symbol and the following one or multiple lines containing the read sequence. If a quality/error-model is provided, the very related FASTQ file format is produced, where the ">" identifier is replaced by the "@" symbol, and a quality block is following the fasta block, which uses a "+" separator and subsequently provides the qualities of the read sequences. The description line contains the read identifier as described in the |
Example
itself. Further examples for FASTA format can be found for instance here.Oftenly, the description line is tokenized into different tags, separated by either "|" ("pipe", as in NCBI standard) or ";" ("semi-colon", as in the Pearson FASTA format). The Flux Capacitor and the Flux Simulator use these separators to divide the descriptor line in the fields of the Flux Mapped Read Descriptor.