.BED read locations

The default output formad is BED, which describes the genomic region of the read. Reads that fall partially into the poly-A tail are truncated to their respective content of genomic sequence. Reads that fall completely into the poly-A tail and that are sequenced receive poly-A as a special sequence name in the generated BED file. The tag name is composed according to the description of the FMRD (Flux Mapped Read Descriptor) convention. Obviously, multi-map information is not provided and has to be obtained by a subsequent alignment of the reads. If the genomic sequence is provided, additionally FASTA/FASTQ sequences can be produced (see sequencing error models). The corresponding tag equals the BED name field plus the additional information about the genomic alignment, i.e., field 1,2,3,6,11 and 12 of the BED format.

Examples:
Here an example for a BED line that represents a spliced read

chr1 2082 2503 chr1:1116-4272W:uc009vip.1:105:2772:695:1003:968:1003|P2 0 - 0 0 0,0,0 2 8,28 0,393

The complete region of the read spans from 2083 (note the 0-base in BED format) to position 2503 (which is the first excluded position in BED format and therefore directly translates to the last included position in a 1-based coordinate system) on the reference sequence chr1. The the read alignment is split in two parts, one from 2083 to 2083+8-1=2090, and the other one from 2083+393=2476 to 2476+28-1=2502. The name field denotes that the read has been the downstream mate P2 of a read pair, derived from the 105th transcript copy of the annotated uc009vip.1 structure (which has spliced length 2772) in splicing locus chr1:1116-4272W. The fragment of this transcript that has been sequenced starts at position 695 and ends at position 1003 in the spliced sequence, relative to the annotated transcription start. From this fragment, the subarea 968-1003 relative to the annotated transcription start has generated the rea

The Read Identifier

The read identifier produced by the Flux Simulator encode some of the information about where the read originated from in the simulation. In the BED format, these read identifiers can be found in the 4th column, in FASTA/FASTQ files they correspond to the identifier line without the initial '>' respectively '@' character. Simulated read identifier are colon separated, where each token corresponds to a certain information.

Token Number	Name	Type	Example	Description
1	Reference ID	String	chr1	Name of the reference sequence, usually the chromosome, a read has been sequenced from
2	Locus ID	[0-9]+\-[0-9]+[WC]	4847775-4887990W	Genomic start and end position, and the strand of the locus from which the read has been obtained; W denotes to the Watson strand (i.e., transcribed RNA will have the same directionality as the genomic reference), C denotes the Crick strand (i.e., transcribed RNA sequences are reverse-complemented substrings of the genomic reference sequence)
3	Transcript ID	String	NM_001159750	Identifier of the transcript form from which the read has been sequenced
4	Molecule Nr	Integer	1	Number, i.e., identifier, of the specific molecule that has been simulated from the transcript form
5	Annotated Length	Integer	2668	Length of the transcript form as annotated in the reference annotation, after removal of introns and without considering simulated variations of the transcription start respectively the poly-A tail
6	Fragment Start	Integer	917	Start position of the fragment from which the read has been derived. Coordinates are provided relative to the annotated transcription start, excluding introns (i.e., relative positions in the processed transcripts). Negative values can occur where in silico TSS variations move the transcription start site to further upstream locations, values greater than the annotated length are in the poly-A tail.
7	Fragment End	Integer	1137	End position of the fragment from which the read has been derived. Coordinates are provided relative to the annotated transcription start, excluding introns (i.e., relative positions in the processed transcripts). Negative values can occur where in silico TSS variations move the transcription start site to further upstream locations, values greater than the annotated length are in the poly-A tail.
8	Relative Orientation	[AS]	S	Orientation of the read relative to the transcription directionality. S stands for sense, A for anti-sense. Note that this is not the absolute directionality with respect to the reference chromosome seuqence, e.g., an anti-sense read of a form produced from a locus that is transcribed from the Crick strand reproduces a substring in the same orientation as the reference genomic sequence.

chr1:4847775-4887990W:NM_001159750:1:2668:917:1137:S/2

Space shortcuts

Child pages

The Read Identifier