Table of Contents |
---|
Children Display |
---|
Alternative splicing (AS) is an important process of gene regulation at transcriptional level and substantially contributes to the understanding of proteomic diversity and function. In order to advance in large-scale research about the functional impact of AS on the proteome, automated methods are required for identifying AS events and linking them to functional regions of proteins in a systematic manner.
AstaFunk is a JAVA tool to study how diversity of a custom transcriptome translates into functional variation, based on standard transcriptome annotations and protein family profiles. In a nutshell, ASTAFUNK translates alternatively spliced parts of open reading frames (given by GTF annotation) on the fly into amino acid sequences. Subsequently, profile HMMs of Pfam database are searched against these amino acid sequences only in the regions of alternative splicing events. ASTAFUNK algorithm is designed to avoid redundant sequence scans in AS-enriched transcriptomes.
This document presents the information to download binaries, build AstaFunk from source code and execute basic commands.
Alternatively, the current version can always be obtained from the GIT repository. Clone the Git repository of the barna. The Barna library consists of a set of tools bundled with the package.
Code Block |
---|
$> git clone http://sammeth.net/bitbucket/scm/barna/barna.git Cloning into 'barna'... remote: Counting objects: 29522, done. remote: Compressing objects: 100% (11638/11638), done. remote: Total 29522 (delta 11254), reused 27997 (delta 10681) Receiving objects: 100% (29522/29522), 99.43 MiB | 706.00 KiB/s, done. Resolving deltas: 100% (11254/11254), done. Checking connectivity... done. $> git checkout vitor_dev_fix1 Branch vitor_dev_fix1 set up to track remote branch vitor_dev_fix1 from origin. Switched to a new branch 'vitor_dev_fix1' |
Build the binaries of AStalavista and create a distribution version.
Code Block |
---|
$> cd barna/ $> cd barna.astalavista/ $> ../gradlew dist . . {some log messages} . BUILD SUCCESSFUL Total time: 2 mins 11.574 secs |
Enter into the distribution directory and extract the files (.tgz or .zip). In barna.astalavista directory:
Code Block |
---|
$> cd build/distributions/ $> unzip astalavista-3.2.1-SNAPSHOT.zip |
The current bundle uses 'astalavista' as the default tool. You can switch tools with the -t option and get help for a specific tool with -t <toolname> --help. This will print the usage and description of the specified tool
Code Block |
---|
$> ./astalavista -t astafunk --help |
You will see:
Code Block |
---|
$> ./astalavista -t astafunk --help [INFO] Astalavista v4.0 (Flux Library: 1.30) -------Documentation & Issue Tracker------- Barna Wiki (Docs): http://sammeth.net/confluence Barna JIRA (Bugs): http://sammeth.net/jira Please feel free to create an account in the public JIRA and reports any bugs or feature requests. ------------------------------------------- Current tool: astafunk Search HMM-profiles of protein families (Pfam) on alternatively spliced genes. Tool specific options: . . {help messages} . |
Option | Description |
---|---|
[--hmm <HMM_FILE>] | Profile-HMM file. |
[--gtf <GTF>] | Gene annotation (GTF) |
[--genome <GENOME>] | Path to the directory with the genomic sequences, i.e., one fasta file per chromosome/scaffold/contig with a file name corresponding to the identifiers of the first column in the GTF annotation |
[(-r|--reference) <REFERENCE_FILE>] | Path to the reference domain file. See Usage Examples |
[-e|--exh] | Perform exhaustive search against HMM database (default: heuristic search) |
[-g|--output-hits-per-gene] | Output best non-overlapped domain hits of the AS gene. (Default: output best non-overlapped domain hits of each variant). |
[--all] | Output all different overlapped domain hits of each alternative variant.(Default: output the best non-overlapped domain hits of each variant). |
[-l|--local] | Run local search mode. (Default: glocal) |
[(-o|--overlapping) <OVERLAPPING>] | Hit overlapping threshold (integer) (default: 0) |
[--tref] | Print on standard output the sequences of reference transcript of each gene on FASTA format. This parameter is only used with--gtf and --genome parameters. |
[--const] | Performs a domain search only on constitutive regions of all genes (Method to obtain results for the paper) |
[--naive] | Run Näive search. Search domains against all genes with alternative splicing without AstaFunk Heuristics. Needs a reference domain file (Method to obtain results for the paper). |
[--test] | Search HMM database (--hmm) against FASTA sequences. (Method to obtain results for the paper). |
[--fa <SEQUENCE_FILE>] | Path to FASTA Sequence file. This file is used as input to evaluate the method employed by AstaFunk to align sequences (--test). |
[--cpu <CPU>] | Number of threads (Default: 1) |
[--verbose] | Verbose. |
In this section, we describe the optional and mandatory input data required to run AstaFunk:
--hmm <HMM_FILE.hmm>
<HMM_FILE> is an unique profile HMM or multiples HMMs in the same file (with extension .hmm) of the Pfam-A database from Pfam. You can download the complete Pfam-A database from FTP site: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz or download individual profiles using the family browser: http://pfam.xfam.org/family/browse.
--gtf <GTF_FILE.gtf>
<GTF> is the gene annotation based on GTF (Gene Transfer Format) format file of the input genome.
If you only have a GFF annotation file, convert to GTF using gffread of Cufflinks or other script.
--genome <GENOME_DIR>
<GENOME_DIR> is the directory path to FASTA files (one chromosome per file) of the genome assembly.
Warning | ||
---|---|---|
Assume your annotation GTF file is (some fields are hidden after coordinates):
So, the FASTA files in the directory <GENOME_DIR> must be chr1.fasta, chr2.fasta, chr3.fasta, chr4.fasta and chr5.fasta. |
-r|--reference <REFERENCE_FILE>
Reference domain file with predicted domains for the reference transcript of each alternatively spliced gene. See how to create a reference domain file on Usage Examples.
<REFERENCE_FILE> is computed by hmmsearch of the HMMER program, using the command line below:
Code Block |
---|
$> hmmsearch --cut_ga --domtblout <REFERENCE_FILE> <HMM_FILE> <REFERENCE_TRANSCRIPTS.fasta> |
hmmsearch is the HMMER algorithm (hmmer.org) to search one or more profiles (from the Pfam-A.hmm database) against the amino acid sequences of reference transcripts (in the <REFERENCE_TRANSCRIPTS>.fasta, see help below). The parameter --cut_ga is that hmmsearch uses gathering domain thresholds stored in the HMM profiles during predictions. The --domtblout output saves a parseable table of per-domain hits to <REFERENCE_FILE>. The reference transcript is the transcript with the longest ORF of a gene. See below how to obtain the reference transcript FASTA file <REFERENCE_TRANSCRIPTS.fasta>.
AstaFunk includes a feature to generate a multi-fasta file with the amino acid sequences of reference transcripts for a given annotation.
Firstly, you execute ASTAFUNK to print on standard output (redirected to the file <REFERENCE_TRANSCRIPTS.fasta>) the amino acid sequences of the reference transcripts. A reference transcript is the transcript with the longest Open Reading Frame (ORF) of an alternatively spliced gene.
Code Block |
---|
$> astalavista -t astafunk --tref --genome <GENOME_DIR> --gtf <GTF_FILE.gtf> > <REFERENCE_TRANSCRIPTS.fasta> |
AstaFunk prints on standard output the predictions of domains for each variant. See below column names of the standard output (tab-separated):
Code or pattern and splice chain (pipe-separated) of each AS event overlapped by the domain hit. The events are single-whitespace separated. Example:
Code Block | ||
---|---|---|
| ||
... start_model end_model length_model events ... 1 150 150 code_event1|splice_chain_event1 code_event2|splice_chain_event2 |
To learn more about AS event patterns see references on 3.1 - Tool ASTA (AS Event Retriever).
The parameters without brackets are mandatory for the respective mode. Otherwise, it is optional. Parameters between pipe ("|") are mutually exclusive.
Code Block |
---|
astalavista -t astafunk [--verbose] [--cpu <INT>] [--all | -g] [--local] [-o <INT>] --genome <GENOME_DIR> --gtf <GTF_FILE> --hmm <HMM_FILE> --reference|-r <REFERENCE_FILE> |
Code Block |
---|
astalavista -t astafunk [--verbose] [--cpu <INT>] [--local] [-o <INT>] --const --genome <GENOME_DIR> --gtf <GTF_FILE> --hmm <HMM_FILE> --reference|-r <REFERENCE_FILE> |
Observation: On AS genes, the current version of this mode searches constitutive domains only on reference transcript (longest ORF).
Searches exhaustively the HMM database against the variant sequences, i.e., without a reference domain file.
Code Block |
---|
astalavista -t astafunk [--verbose] [--cpu <INT>] [--all | -g] [--local] [-o <INT>] -e|--exh --genome <GENOME_DIR> --gtf <GTF_FILE> --hmm <HMM_FILE> |
Code Block |
---|
astalavista -t astafunk [--verbose] [--cpu <INT>] [--local] [-o <INT>] --naive --genome <GENOME_DIR> --gtf <GTF_FILE> --hmm <HMM_FILE> --reference|-r <REFERENCE_FILE> |
Code Block |
---|
astalavista -t astafunk --tref --genome <GENOME_DIR> --gtf <GTF_FILE> |
Code Block |
---|
astalavista -t astafunk [--local] [-o <INT>]--test --hmm <HMM_FILE> --fa <SEQUENCE_FILE> |
You can view the complete javadoc of barna on http://sammeth.net/jenkins/job/barna-devel/javadoc/: AstaFunk documentation can be found on packages barna.astafunk.*