Alternative splicing (AS) is an important process of gene regulation at transcriptional level and substantially contributes to the understanding of proteomic diversity and function. In order to advance in large-scale research about the functional impact of AS on the proteome, automated methods are required for identifying AS events and linking them to functional regions of proteins in a systematic manner.
AstaFunk is a JAVA tool to study how diversity of a custom transcriptome translates into functional variation, based on standard transcriptome annotations and protein family profiles. In a nutshell, ASTAFUNK translates alternatively spliced parts of open reading frames (given by GTF annotation) on the fly into amino acid sequences. Subsequently, profile HMMs of Pfam database are searched against these amino acid sequences only in the regions of alternative splicing events. ASTAFUNK algorithm is designed to avoid redundant sequence scans in AS-enriched transcriptomes.
This document presents the information to download binaries, build AstaFunk from source code and execute basic commands.
Alternatively, the current version can always be obtained from the GIT repository. Clone the Git repository of the barna. The Barna library consists of a set of tools bundled with the package.
Build the binaries of AStalavista and create a distribution version.
Enter into the distribution directory and extract the files (.tgz or .zip).
The current bundle uses 'astalavista' as the default tool. You can switch tools with the -t option and get help for a specific tool with -t <toolname> --help. This will print the usage and description of the specified tool
You will see:
In this section, we describe the optional and mandatory input data required to run AstaFunk:
<HMM_FILE> is an unique profile HMM or multiples HMMs in the same file (with extension .hmm) of the Pfam-A database from Pfam. You can download the complete Pfam-A database from FTP site: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz or download individual profiles using the family browser: http://pfam.xfam.org/family/browse.
<GTF> is the gene annotation based on GTF (Gene Transfer Format) format file of the input genome.
<GENOME_DIR> is the directory path to FASTA files (one chromosome per file) of the genome assembly.
Assume your annotation GTF file is (some fields are hidden after coordinates):
So, the FASTA files in the directory <GENOME_DIR> must be chr1.fasta, chr2.fasta, chr3.fasta, chr4.fasta and chr5.fasta.
Reference file with predicted domains for the reference transcript of each alternatively spliced gene.
<REFERENCE_FILE> is computed by hmmsearch of the HMMER program, using the command line below:
hmmsearch is the HMMER algorithm (hmmer.org) to search one or more profiles (from the Pfam-A.hmm database) against the amino acid sequences of reference transcripts (in the <REFERENCE_TRANSCRIPTS>.fasta, see help below). The parameter --cut_ga is that hmmsearch uses gathering domain thresholds stored in the HMM profiles during predictions. The --domtblout output saves a parseable table of per-domain hits to <REFERENCE_FILE>. The reference transcript is the transcript with the longest ORF of a gene.
AstaFunk includes a feature to generate a multi-fasta file with the amino acid sequences of reference transcripts for a given annotation.
Firstly, you execute ASTAFUNK to print on standard output (redirected to the file <REFERENCE_TRANSCRIPTS.fasta>) the amino acid sequences of the reference transcripts. A reference transcript is the transcript with the longest Open Reading Frame (ORF) of an alternatively spliced gene.
Obtain the reference transcript FASTA file with the command:
The basic search command of AstaFunk is:
Path to the profile HMM file
Path to the directory with the genomic sequences, i.e., one fasta file per chromosome/scaffold/contig with a file name
corresponding to the identifiers of the first column in the GTF annotation
Path to the GTF reference annotation
Path to the reference domain file
Output FASTA file with reference transcript of each gene. This parameter is only used with--gtf and --genome parameters
Output best non-overlapped domain hits of the alternatively spliced (AS) gene.(Default: output best non-overlapped
domain hits of each variant; Method to obtain results for the paper).
Hit overlapping allowed [0.0 to 1.0] (default: 0)
Run local search (Default: glocal)
Output the all domain hits of each alternative variant.(Default: output best non-overlapped domain hits of each variant).
Perform exhaustive search against HMM database.
Run Näive search. Search domains against all genes with alternaive splicing. This search uses a reference file
(Method to obtain results for the paper).
Performs a domain search only on constitutive regions of all genes (Method to obtain results for the paper)
Number of threads to run (Default: 1)
AstaFunk prints on standard output the predictions of domains for each variant. See below column names of the standard output:
chr: Field "seqname" of the GTF annotation; name of the chromosome or scaffold; Example: “chr1”.
gene_cluster_name: string of concatenated AS transcript/gene identifiers. Example: “uc001dhm.2,uc001dhn.3,uc001dho.3”.
name_hmm: Name of the protein family in the profile HMM. Example: “ADK”.
acc: Accession number of the profile HMM. Example: “PF00406.19”.
description: Description of the profile HMM. Example: “Adenylate kinase”.
bitscore: Bit score of the alignment
start_seq, end_seq: Start/End position of the alignment in the sequence.
start_genomic, end_genomic: Start/End position of the alignment in the genome
first_source, last_source: Source is the start genomic position of the merged AS events. Sink is the end genomic position of the merged AS events.
start_model, end_model: Alignment start/end state number of the profile HMM
length_model: number of states of the profile HMM
sequence: Sequence ID of the gene/transcript used to search the protein domain/family. This sequence is randomly selected from the variants
variants: set of transcripts with same exon/intron composition between the first source and last sink. If variants contains a list of transcripts identifiers separated by commas means that the transcripts have the same exon/intron composition between the first source and last sink.