Table of Contents |
---|
In this page, we describe the command lines and steps to generate the results of the AstaFunk paper.
The gene annotation files were downloaded using the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables). The 2nd column has the links to download the genome assemblies from UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/downloads.html)
Specie | Assembly (and link to download) | Group | Track | Table | Description | Update |
---|---|---|---|---|---|---|
C. elegans | WS190/ce6 | Gene and Gene Predictions | Wormbase Genes | sangerGene | Sanger Gene predictions from the Wormbase version WS190 files downloaded from the Sanger Institute FTP site. | 2008-06-03 |
sangerGeneToWBGeneID | File with gene Id's from Wormbase Genes Track from UCSC and the respective gene Id's from Wormbase. Download here: ce6_sanger_wormbase_map.txt | 2008-06-04 | ||||
D. melanogaster | BDGP R5/dm3 | Gene and Gene Predictions | Flybase Genes | flyBaseGene | Protein-coding genes annotated by FlyBase and the Drosophila Heterochromatin Genome Project (DHGP). Annotations on both heterochromatin and euchromatic | 2008-10-21 |
flyBase2004xref | File with gene Id's from Flybase Genes Track from UCSC and the respective gene Id's from Flybase. Download here: dm3_bdgp_flybase_map.txt | 2008-10-21 | ||||
H. sapiens |
| Gene and Gene Predictions | RefSeq genes | refGene | Known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq) | 2015-09-07 |
Gene and Gene Predictions | UCSC genes | knownGene | Set of gene predictions based on data from RefSeq, GenBank, CCDS, Rfam, and the tRNA Genes track. | 2013-06-14 | ||
Gene and Gene Predictions | GENCODE Genes V19 | Comprehensive (wgEncodeGencodeCompV19) | High-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. | 2013-12-13 |
Gene ID | Transcript Accession Number (Ensembl release 87) | |
---|---|---|
Hs.133034 (ZFP69B) | ENST00000361584.4 | Link |
ZNF263 | ENST00000219069.5 | Link |
ZNF174 | ENST00000268655.4 | Link |
ZNF24 | ENST00000261332.10 | Link |
ZNF317 | ENST00000247956.10 | Link |
ZNF74 | ENST00000611540.4 | Link |
ZNF85 | ENST00000345030.6 | Link |
EZFIT (ZNF71) | ENST00000328070.10 | Link |
ZNF222 | ENST00000391960.3 | Link |
Version | Link to Download | |
---|---|---|
Pfam-A | 28 | ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz |
Pfam-A | 27 | ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-A.hmm.gz |
HMMER (hmmsearch) is used to create reference domain files.
Version | Download |
---|---|
v3.1b2 | http://eddylab.org/software/hmmer3/3.1b2/hmmer-3.1b2-linux-intel-x86_64.tar.gz |
To obtain AStalavista events only for coding sequence structures, the gene annotation must be pre-processed:
Code Block |
---|
~$ cat ce6_original.gtf | awk -v FS="\t" -v OFS="\t" '{if($3=="CDS") {print $0; $3="exon"; print $0}}' > ce6.gtf |
This command line creates a GTF file with the same CDS entries from the original file, but duplicating the theses entries changing the feature column CDS to EXON, preserving the remaining fields.
Create a multi-fasta of sequences of the reference transcript of each alternatively spliced gene, i.e. the AS transcript with the longest coding sequence and the respective transcript of non-AS genes.
Code Block |
---|
Create a multi-fasta of sequences of the reference transcript of each alternatively spliced gene, i.e. the AS transcript with the longest coding sequence and the respective transcript of non-AS genes.
Code Block |
---|
astalavista -t astafunk --tref --genome ~/genome/worm/ce6/ --gtf ~/genome/worm/ce6/annotation/ce6.gtf > ce6_ref_transcripts.fa astalavista -t astafunk --tref --genome ~/genome/fly/dm3/ --gtf ~/genome/fly/dm3/annotation/dm3.gtf > dm3_ref_transcripts.fa astalavista -t astafunk --tref --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/refseq.gtf > refseq_ref_transcripts.fa astalavista -t astafunk --tref --genome ~/genome/humanworm/hg19ce6/ --gtf ~/genome/humanworm/hg19ce6/annotation/ucscce6.gtf > ucscce6_ref_transcripts.fa astalavista -t astafunk --tref --genome ~/genome/humanfly/hg19dm3/ --gtf ~/genome/humanfly/hg19dm3/annotation/gencodedm3.gtf > gencodedm3_ref_transcripts.fa |
Code Block |
---|
hmmsearch astalavista -t astafunk --cut_gatref --domtblout ce6_ref_domains.txt Pfam-A.hmm ce6genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/refseq.gtf > refseq_ref_transcripts.fa hmmsearchastalavista -t astafunk --cut_gatref --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/ucsc.gtf > ucsc_ref_transcripts.fa astalavista -t astafunk --tref --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/gencode.gtf > gencode_ref_transcripts.fa |
Code Block |
---|
domtblout dm3_ref_domains.txt Pfam-A.hmm dm3_ref_transcripts.fa hmmsearch --cut_ga --domtblout refseq_ref_domains.txt Pfam-A.hmm refseq_ref_transcripts.fa hmmsearch --cut_ga --domtblout ucsc_ref_domains.txt Pfam-A.hmm ucsc_ref_transcripts.fa hmmsearch --cut_ga --domtblout gencodece6_ref_domains.txt Pfam-A.hmm gencodece6_ref_transcripts.fa |
Tip |
---|
Code Block | |
---|---|
Instead to use the whole Pfam-A.hmm database to search protein domains, you can fetch only HMM models for a specific reference domain file: Code Block | hmmsearch --cut_ga --domtblout dm3_ref_domains.txt| awk '{print $5}' | sort | uniq | hmmfetch -f Pfam-A.hmm dm3_ref_transcripts.fa hmmsearch --cut_ga --domtblout refseq_ref_domains.txt Pfam-A.hmm- > as_refseq.hmm The resulting HMM database is specific for the (AS, alternatively spliced) reference transcripts of RefSeq annotation. |
astalavista -t astafunk --cpu 20 --genome ~/genome/worm/ce6/ --gtf ~/genome/worm/ce6/annotation/ce6.gtf --hmmrefseq_ref_transcripts.fa hmmsearch --cut_ga --domtblout ucsc_ref_domains.txt Pfam-A.hmm ucsc_ref_transcripts.fa hmmsearch --cut_ga --domtblout gencode_ref_domains.txt Pfam-A.hmm --reference ce6gencode_ref_domains.txt > as_ce6.output astalavista -t astafunk --cpu 20 --genome ~/genome/fly/dm3/ --gtf ~/genome/fly/dm3/annotation/dm3.gtf --hmm Pfam-A.hmm --reference dm3transcripts.fa |
Tip | ||||
---|---|---|---|---|
Instead to use the whole Pfam-A.hmm database to search protein domains, you can fetch only HMM models for a specific reference domain file:
The resulting HMM database is specific for the (AS, alternatively spliced) reference transcripts of RefSeq annotation. |
Code Block |
---|
output astalavista -t astafunk --cpu 20 --genome ~/genome/humanworm/hg19ce6/ --gtf ~/genome/humanworm/hg19ce6/annotation/ucscce6.gtf --hmm Pfam-A.hmm --reference ucscce6_ref_domains.txt > as_ucscce6.output astalavista -t astafunk --cpu 20 --genome ~/genome/humanfly/hg19dm3/ --gtf ~/genome/humanfly/hg19dm3/annotation/gencodedm3.gtf --hmm Pfam-A.hmm --reference gencodedm3_ref_domains.txt > as_gencode.output |
The Näive approach to search AS domains consists of scanning the whole coding sequence of the alternative transcripts. Differently, AstaFunk approach only scans the coding sequence regions flanking the alternative splicing events, extending the begin and end position of the events by a specific window Δπ for each HMM π from Pfam-A.hmm.
Code Block |
---|
dm3.output astalavista -t astafunk --cpu 20 --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/refseq.gtf --hmm Pfam-A.hmm --reference refseq_ref_domains.txt > as_refseq.output astalavista -t astafunk --naive --cpu 20 --genome ~/genome/wormhuman/ce6hg19/ --gtf ~/genome/wormhuman/ce6hg19/annotation/ce6ucsc.gtf --hmm Pfam-A.hmm --reference ce6ucsc_ref_domains.txt > as_ucsc.output astalavista -t astafunk --naive --cpu 20 --genome ~/genome/flyhuman/dm3hg19/ --gtf ~/genome/flyhuman/dm3hg19/annotation/dm3gencode.gtf --hmm Pfam-A.hmm --reference dm3gencode_ref_domains.txt > as_gencode.output |
The Näive approach to search AS domains consists of scanning the whole coding sequence of the alternative transcripts. Differently, AstaFunk approach only scans the coding sequence regions flanking the alternative splicing events, extending the begin and end position of the events by a specific window Δπ for each HMM π from Pfam-A.hmm.
Code Block |
---|
astalavista -t astafunk --naive --cpu 20 --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/refseq.gtf --hmm Pfam-A.hmm --reference refseq_ref_domains.txt astalavista -t astafunk --naive --cpu 20 --genome ~/genome/humanworm/hg19ce6/ --gtf ~/genome/humanworm/hg19ce6/annotation/ucscce6.gtf --hmm Pfam-A.hmm --reference ucscce6_ref_domains.txt astalavista -t astafunk --naive --cpu 20 --genome ~/genome/humanfly/hg19dm3/ --gtf ~/genome/humanfly/hg19dm3/annotation/gencodedm3.gtf --hmm Pfam-A.hmm --reference gencodedm3_ref_domains.txt |
Code Block | ||
---|---|---|
| ||
hmmsearch --cut_ga --domtblout znf_genes_hmmer_output database.hmm znf_genes.fa |
Code Block | ||
---|---|---|
| ||
astalavista -t astafunk --naive --cpu 20 --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/refseq.gtf --hmm Pfam-A.hmm --reference refseq_ref_domains.txt astalavista -t astafunk --testnaive --localcpu 20 --fa znf_genes.fagenome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/ucsc.gtf --hmm databasePfam-A.hmm >--reference znfucsc_predictions_astafunk |
File | Name | Description | Download |
---|---|---|---|
Transcript annotation (GTF) | gencode.v19.transcripts.patched_contigs.gtf.gz | GENCODE annotation | https://gtexportal.org/home/datasets |
Exon read count | GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_exon_reads.txt.gz | Read counts for each exon across samples | https://gtexportal.org/home/datasets |
Genome assembly | GRCh37/hg19 | H. sapiens genome assembly | http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/ |
Pfam domains v28 | Pfam-A.hmm | ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz | |
GTEx Samples and Tissues | samples_tissues | Tab-separated file with GTEx samples and respective tissue. | Download |
ref_domains.txt
astalavista -t astafunk --naive --cpu 20 --genome ~/genome/human/hg19/ --gtf ~/genome/human/hg19/annotation/gencode.gtf --hmm Pfam-A.hmm --reference gencode_ref_domains.txt |
Code Block | ||
---|---|---|
| ||
hmmsearch --cut_ga --domtblout znf_genes_hmmer_output database.hmm znf_genes.fa |
Code Block | ||
---|---|---|
| ||
astalavista -t astafunk --test --local --fa znf_genes.fa --hmm database.hmm > znf_predictions_astafunk |
File | Name | Description | Download |
---|---|---|---|
Transcript annotation (GTF) | gencode. |
title | Obtain GTF annotation of the target genes |
---|
v19.transcripts.patched_contigs.gtf.gz | GENCODE annotation | https://gtexportal.org/home/datasets | |
Exon read count | GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_exon_reads.txt.gz | Read counts for each exon across samples | https://gtexportal.org/home/datasets |
Genome assembly | GRCh37/hg19 | H. sapiens genome assembly | http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/ |
Pfam domains v28 | Pfam-A.hmm | ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz | |
GTEx Samples and Tissues | samples_tissues | Tab-separated file with GTEx samples and respective tissue. | Download |
Code Block | ||
---|---|---|
| ||
~$: zcat ../gencode.v19.transcripts.patched_contigs.gtf.gz | grep 'ENSG00000075415.\|ENSG00000066405.\|ENSG00000078328.' > target_genes.gtf |
Code Block | ||
---|---|---|
| ||
~$: astalavista-4.0.1-SNAPSHOT/bin/astalavista -t astafunk --tref --genome ./ target_genes.gtf > ref_txs.fa |
Code Block | ||
---|---|---|
| ||
~$: hmmsearch --cut_ga --domtblout target_ref_domains.txt Pfam-A.hmm ref_txs.fa |
Code Block | ||
---|---|---|
| ||
~$: astalavista-4.0.1-SNAPSHOT/bin/astalavista -t astafunk --genome ./ --gtf target_genes.gtf --hmm Pfam-A.hmm --local --reference target_ref_domains.txt > as_domains_target.txt |
Code Block | ||
---|---|---|
| ||
~$: ./calculate_mean_exon_count.sh samples_tissue GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_exon_reads.txt |
Code Block | ||
---|---|---|
| ||
#!/bin/sh
SAMPLES_TISSUE=$1
TX_RPKM=$2
cat $SAMPLES_TISSUE | awk -v FS="\t" -v q="'" '{str=str"s/"$1"/"$3"/g;"} END {print str}' > sed_command
sed -f sed_command $TX_RPKM | awk -v FS="\t" -v OFS="\t" '{
if(NR==1){ |
Code Block | ||
---|---|---|
| ||
~$: astalavista-4.0.1-SNAPSHOT/bin/astalavista -t astafunk --tref --genome ./ target_genes.gtf > ref_txs.fa |
Code Block | ||
---|---|---|
| ||
~$: hmmsearch --cut_ga --domtblout target_ref_domains.txt Pfam-A.hmm ref_txs.fa |
Code Block | ||
---|---|---|
| ||
~$: astalavista-4.0.1-SNAPSHOT/bin/astalavista -t astafunk --genome ./ --gtf target_genes.gtf --hmm Pfam-A.hmm --local --reference target_ref_domains.txt > as_domains_target.txt |
Code Block | ||
---|---|---|
| ||
~$: ./calculate_mean_exon_count.sh samples_tissue GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_exon_reads.txt |
Code Block | ||
---|---|---|
| ||
#!/bin/sh SAMPLES_TISSUE=$1 TX_RPKM=$2 cat $SAMPLES_TISSUE | awk -v FS="\t" -v q="'" '{str=str"s/"$1"/"$3"/g;"} END {print str}' > sed_command sed -f sed_command $TX_RPKM | awk -v FS="\t" -v OFS="\t" '{ if(NR==1){ header = "transcript_id" for(i=2;i<=NF;i++){ headers[i]=$i; sum[headers[i]] = 0; num_samples[headers[i]] = 0; } forheader (i in sum){= "transcript_id" header=header"\t"ifor(i=2;i<=NF;i++){ } headers[i]=$i; print header sum[headers[i]] = 0; }else{ num_samples[headers[i]] = 0; for(i=2;i<=NF;i++){ } for (i in sum[headers[i]]+= $i){ num_samples[headers[i]] = num_samples[headers[i]] + 1header=header"\t"i } curr_line = $1 for(i in sum){ curr_line=curr_line"\t"sum[i]/num_samples[i] sum[i]=0 num_samples[i] = 0 } print curr_line } }'print header }else{ for(i=2;i<=NF;i++){ sum[headers[i]]+= $i num_samples[headers[i]] = num_samples[headers[i]] + 1 } curr_line = $1 for(i in sum){ curr_line=curr_line"\t"sum[i]/num_samples[i] sum[i]=0 num_samples[i] = 0 } print curr_line } }' |
Domain clusters are predictions of the same domain that overlap in their genomic coordinates. We assumed the highest scoring prediction to represent the wild-type of the domain in the gene. We then computed for each alternative prediction of the domain in a cluster the "domain conservation" as the fraction between the domain score assigned to the alternatively spliced domain and the wild-type score. File output.txt is output file of the default run of AstaFunk. Each line is a domain prediction. Using awk, we create a hash data structure where the key is the fields (columns of output.txt) $2 (loci id, e.g., gene id; list of transcripts overlapping the loci, etc), $3 (domain cluster) $5 (domain id) and $15 (domain profile length). The stored value of this data structure is the "domain conservation",. This command prints out the domain name, length and domain conservation for each cluster
Code Block |
---|
~$ cat output.txt | grep -v "NO_HIT\|NO_CDS" | awk -v FS="\t" 'NR>1{cluster[$2"_"$3"_"$5"_"$15]=cluster[$2"_"$3"_"$5"_"$15]" "$6}END{for(i in cluster){split(cluster[i],scores," ");max = 0;for(j in scores){if(scores[j] > max)max=scores[j]}split(i,key,"_");for(j in scores){if(scores[j]!=max)print key[3],key[4],scores[j]/max;}}}' |
Generic pipeline to search alternatively spliced domains | Download |