Hi everyone,

I've been running the flux simulator to generate 20M reads, 100bp, paired-end. The total running time is rather big,

[END] I finished, took me 37099 sec.

That'd be ok with me, but I've notice that most of the running time (~90%) is spent on the Sequencing step, and was wondering if this is expected.

thanks

David

  • No labels

1 Comment

  1. Ho ho ho,

    the sequencing step--especially when reproducing the actual read sequences rather than only their genomic locations–is dominated by I/O overhead; many bytes that are parts of the genomic sequence are to be read, and many bytes of the (possibly mutated) read sequences are to be written. In order to make yourself a picture, please have a look at the profiling output provided for the transition to the sequencing step:

    type key summary assignee reporter priority status resolution created updated due

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

    The efficiency of the simulator in the sequencing step therefore depends directly on the corresponding hardware/system configuration. Having said that, approx. 10h for producing 2 GigaBases seems really a bit much of overhead, please consider whether someo of the following points can improve your situation:

    • Files with the chromosome sequences should be located on the local file system. Note that accessing a shared file system is likely to be much slower than when bytes just travel over the local bus, and in high-throughput you can note the differences quite substantially.
    • Reserved RAM size should be larger than the longest chromosome. The Flux Simulator is implemented such that all reads from a certain chromosome are produced in one batch, for which the complete chromosome sequence is loaded into main memory (RAM) to be processed. (warning) If there is not enough RAM reserved for the execution, the program will access the disk for every read. Consult Section 3.1 - System Requirements in order to change the default RAM reservation for the Flux Simulator, i.e., by setting the environment variable FLUX_MEM. 
    • Provide a temporary directory in the local file system. The output of the sequencing step is first written to a file in the temporary directory (TEMP_DIR), before being transferred to its final destination (MAPPING_FILE). Although it would be optimal to have both locations set to local folders, parameter TMP_DIR has a higher impact on the efficiency of the sequencing step than MAPPING_FILE. Refer to Section Parameters - PAR format and see the examples at Appendix B - Frequently Asked Questions (FAQ) for setting these parameters accordingly.

    You may consider to watchlist the evolution of ticket BARNA-117 above to get notified if there are updates on the topic.

    Currently we are still busy with deliveries on the Eastern routes, but who knows--maybe we have a super-fast new hard disk for you tonight? Just in case, please remember hanging a 3.5" sock at your chimney. Merry Christmas (smile)