View Source

Parameters

Parameter

Name

Description

LIB_FILE_NAME

input file that contains the set of fragments from which reads are sampled

SEQ_FILE_NAME

output file where genomic positions of sequenced reads are stored to

Overview

During the "Final Library Preparation" step of the Flux Simulator pipeline the cDNA fragments are optionally filtered by gel segregation, and also optionally amplified by a PCR (i.e., polymerase chain reaction).

Details

Requires: LIB_FILE_NAME, READ_LENGTH, READ_NUMBER, PAIRED_END, FASTQ, GEN_DIR, ERR_FILE_NAME, QTHOLD
Outputs: SEQ_FILE_NAME

This step produces about READ_NUMBER sequencing reads from the library in LIB_FILE_NAME. The simulator iterates the input annotation and maps the READ_LENGTH long stretches from the ends of cDNA molecules in the library LIB_FILE_NAME to genomic coordinates. In the case of READ_LENGTH exceeds the length of a cDNA molecule, the read is truncated to the length of the molecule. For each fragment a Bernoulli trial is carried out, by $r < p$ with $r$ being a uniformly sampled random variable in the boundaries $[0; 1 [$ compared to the sequencing probability

(7)

p = R E A D _ N U M B E R L I B R A R Y _ N U M B E R

By this, never more reads (respectively, read pairs) are generated than there are LIBRARY_NUMBER cDNA molecules. In the case of single end sequencing, randomly one end of the cDNA molecule that succeeded the Bernoulli trial is sequenced, and if PAIRED_END is set, correspondingly both ends are sequenced. The FLUX SIMULATOR shows you the number of reads and their fraction (relative to the planned number), the number of splicing loci represented in these reads (and the ratio they constitute of the total number of expressed loci), and the number of transcripts (ratio of total expressed spliceforms, respectively).

Please note that the final number of molecules you obtain provides an upper limit on your sequencing capacity, as over sampling a small amount of molecules will not enlarge the diversity in the produced reads — it means, if you would produce a 1000 reads from 10 molecules left after RT/fragmentation, you will find groups of about 100 that map to identical locations. Upon termination the step copies the .frg file from the temporary directory to the project directory and updates column 7 and 8 of the .PRO file.

The default output formad is BED, which describes the genomic region of the read. Reads that fall partially into the poly-A tail are truncated to their respective content of genomic sequence. Reads that fall completely into the poly-A tail and that are sequenced receive poly-A as a special sequence name in the generated BED file. The tag name is composed according to the description of the FMRD (Flux Mapped Read Descriptor) convention. Obviously, multi-map information is not provided and has to be obtained by a subsequent alignment of the reads. If the genomic sequence is provided, additionally FASTA/FASTQ sequences can be produced (see sequencing error models). The corresponding tag equals the BED name field plus the additional information about the genomic alignment, i.e., field 1,2,3,6,11 and 12 of the BED format.

Examples:
Here an example for a BED line that represents a spliced read

chr1 2082 2503 chr1:1116-4272W:uc009vip.1:105:2772:695:1003:968:1003|P2 0 - 0 0 0,0,0 2 8,28 0,393

The complete region of the read spans from 2083 (note the 0-base in BED format) to position 2503 (which is the first excluded position in BED format and therefore directly translates to the last included position in a 1-based coordinate system) on the reference sequence chr1. The the read alignment is split in two parts, one from 2083 to 2083+8-1=2090, and the other one from 2083+393=2476 to 2476+28-1=2502. The name field denotes that the read has been the downstream mate P2 of a read pair, derived from the 105th transcript copy of the annotated uc009vip.1 structure (which has spliced length 2772) in splicing locus chr1:1116-4272W. The fragment of this transcript that has been sequenced starts at position 695 and ends at position 1003 in the spliced sequence, relative to the annotated transcription start. From this fragment, the subarea 968-1003 relative to the annotated transcription start has generated the read