In this page, we describe the command lines and steps to generate the results of the AstaFunk paper.
The gene annotation files were downloaded using the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables). The 2nd column has the links to download the genome assemblies from UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/downloads.html)
|Specie||Assembly (and link to download)||Group||Track||Table||Description||Update|
|C. elegans||WS190/ce6||Gene and Gene Predictions||Wormbase Genes||sangerGene|
Sanger Gene predictions from the Wormbase version WS190 files downloaded from the Sanger Institute FTP site.
|sangerGeneToWBGeneID||File with gene Id's from Wormbase Genes Track from UCSC and the respective gene Id's from Wormbase. Download here: ce6_sanger_wormbase_map.txt||2008-06-04|
|D. melanogaster||BDGP R5/dm3||Gene and Gene Predictions||Flybase Genes||flyBaseGene|
Protein-coding genes annotated by FlyBase and the Drosophila Heterochromatin Genome Project (DHGP). Annotations on both heterochromatin and euchromatic
|flyBase2004xref||File with gene Id's from Flybase Genes Track from UCSC and the respective gene Id's from Flybase. Download here: dm3_bdgp_flybase_map.txt||2008-10-21|
|Gene and Gene Predictions||RefSeq genes||refGene|
Known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq)
|Gene and Gene Predictions||UCSC genes||knownGene|
Set of gene predictions based on data from RefSeq, GenBank, CCDS, Rfam, and the tRNA Genes track.
|Gene and Gene Predictions||GENCODE Genes V19||Comprehensive (wgEncodeGencodeCompV19)|
High-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline.
Transcript Accession Number
(Ensembl release 87)
|Version||Link to Download|
HMMER (hmmsearch) is used to create reference domain files.
To obtain AStalavista events only for coding sequence structures, the gene annotation must be pre-processed:
This command line creates a GTF file with the same CDS entries from the original file, but duplicating the theses entries changing the feature column CDS to EXON, preserving the remaining fields.
Create a multi-fasta of sequences of the reference transcript of each alternatively spliced gene, i.e. the AS transcript with the longest coding sequence and the respective transcript of non-AS genes.
Instead to use the whole Pfam-A.hmm database to search protein domains, you can fetch only HMM models for a specific reference domain file:
The resulting HMM database is specific for the (AS, alternatively spliced) reference transcripts of RefSeq annotation.
The Näive approach to search AS domains consists of scanning the whole coding sequence of the alternative transcripts. Differently, AstaFunk approach only scans the coding sequence regions flanking the alternative splicing events, extending the begin and end position of the events by a specific window Δπ for each HMM π from Pfam-A.hmm.
|Transcript annotation (GTF)|
|Exon read count|
|Read counts for each exon across samples||https://gtexportal.org/home/datasets|
|Genome assembly||GRCh37/hg19||H. sapiens genome assembly||http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/|
|Pfam domains v28||Pfam-A.hmm||ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz|
|GTEx Samples and Tissues||samples_tissues||Tab-separated file with GTEx samples and respective tissue.||Download|
Domain clusters are predictions of the same domain that overlap in their genomic coordinates. We assumed the highest scoring prediction to represent the wild-type of the domain in the gene. We then computed for each alternative prediction of the domain in a cluster the "domain conservation" as the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. File output.txt is output file of the default run of AstaFunk. Each line is a domain prediction. Using awk, we create a hash data structure where the key is the fields (columns of output.txt) $2 (loci id, e.g., gene id; list of transcripts overlapping the loci, etc), $3 (domain cluster) $5 (domain id) and $15 (domain profile length). The stored value of this data structure is the "domain conservation",. This command prints out the domain name, length and domain conservation for each cluster