Introduction

Alternative splicing (AS) is an important process of gene regulation at transcriptional level and substantially contributes to the understanding of proteomic diversity and function. In order to advance in large-scale research about the functional impact of AS on the proteome, automated methods are required for identifying AS events and linking them to functional regions of proteins in a systematic manner.

AstaFunk is a JAVA tool to study how diversity of a custom transcriptome translates into functional variation, based on standard transcriptome annotations and protein family profiles. In a nutshell, ASTAFUNK translates alternatively spliced parts of open reading frames (given by GTF annotation) on the fly into amino acid sequences. Subsequently, profile HMMs of Pfam database are searched against these amino acid sequences only in the regions of alternative splicing events. ASTAFUNK algorithm is designed to avoid redundant sequence scans in AS-enriched transcriptomes.

This document presents the information to download binaries, build AstaFunk from source code and execute basic commands.

Recommended: Obtaining AstaFunk Source Codes

Alternatively, the current version can always be obtained from the GIT repository. Clone the Git repository of the barna. The Barna library consists of a set of tools bundled with the package.

$> git clone http://sammeth.net/bitbucket/scm/barna/barna.git
Cloning into 'barna'...
remote: Counting objects: 29522, done.
remote: Compressing objects: 100% (11638/11638), done.
remote: Total 29522 (delta 11254), reused 27997 (delta 10681)
Receiving objects: 100% (29522/29522), 99.43 MiB | 706.00 KiB/s, done.
Resolving deltas: 100% (11254/11254), done.
Checking connectivity... done.
 
$> git checkout vitor_dev_fix1
Branch vitor_dev_fix1 set up to track remote branch vitor_dev_fix1 from origin.
Switched to a new branch 'vitor_dev_fix1'

Building binaries

Build the binaries of AStalavista and create a distribution version.

$> cd barna/
$> cd barna.astalavista/
$> ../gradlew dist
.
. {some log messages}
.
BUILD SUCCESSFUL
Total time: 2 mins 11.574 secs

Extracting binary files

Enter into the distribution directory and extract the files (.tgz or .zip). In barna.astalavista directory:

$> cd build/distributions/
$> unzip astalavista-3.2.1-SNAPSHOT.zip

Check build

The current bundle uses 'astalavista' as the default tool. You can switch tools with the -t option and get help for a specific tool with -t <toolname> --help. This will print the usage and description of the specified tool

$> ./astalavista -t astafunk --help

You will see:

$> ./astalavista -t astafunk --help
[INFO] Astalavista v4.0 (Flux Library: 1.30)

-------Documentation & Issue Tracker-------
Barna Wiki (Docs): http://sammeth.net/confluence
Barna JIRA (Bugs): http://sammeth.net/jira

Please feel free to create an account in the public
JIRA and reports any bugs or feature requests.
-------------------------------------------

Current tool: astafunk
Search HMM-profiles of protein families (Pfam) on alternatively spliced genes.
Tool specific options:
.
. {help messages}
.

Input data

In this section, we describe the optional and mandatory input data required to run AstaFunk:

--hmm <HMM_FILE.hmm>

<HMM_FILE> is an unique profile HMM or multiples HMMs in the same file (with extension .hmm) of the Pfam-A database from Pfam. You can download the complete Pfam-A database from FTP site: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz or download individual profiles using the family browser: http://pfam.xfam.org/family/browse.

--gtf <GTF_FILE.gtf>

<GTF> is the gene annotation based on GTF (Gene Transfer Format) format file of the input genome.

If you only have a GFF annotation file, convert to GTF using gffread of Cufflinks or other script.

--genome <GENOME_DIR>

<GENOME_DIR> is the directory path to FASTA files (one chromosome per file) of the genome assembly.

Assume your annotation GTF file is (some fields are hidden after coordinates):

chr1 hg19_refGene start_codon  67000042    67000044 ...;
chr2 hg19_refGene start_codon  201173796   201173798  ...;
chr3 hg19_refGene exon         134204575   134204894 ...;
chr4 hg19_refGene start_codon  41937502    41937504 ...;
chr5 hg19_refGene start_codon  134210118   134210120 ...;

So, the FASTA files in the directory <GENOME_DIR> must be chr1.fasta, chr2.fasta, chr3.fasta, chr4.fasta and chr5.fasta.

-r|--reference <REFERENCE_FILE>

Reference file with predicted domains for the reference transcript of each alternatively spliced gene.

How to Create a Reference File

<REFERENCE_FILE> is computed by hmmsearch of the HMMER program, using the command line below:

$> hmmsearch --cut_ga --domtblout <REFERENCE_FILE> Pfam-A.hmm <REFERENCE_TRANSCRIPTS.fasta>

hmmsearch is the HMMER algorithm (hmmer.org) to search one or more profiles (from the Pfam-A.hmm database) against the amino acid sequences of reference transcripts (in the <REFERENCE_TRANSCRIPTS>.fasta, see help below). The parameter --cut_ga is that hmmsearch uses gathering domain thresholds stored in the HMM profiles during predictions. The --domtblout output saves a parseable table of per-domain hits to <REFERENCE_FILE>. The reference transcript is the transcript with the longest ORF of a gene.

Using AstaFunk to Generate a Multi-fasta File with the Reference Transcripts

AstaFunk includes a feature to generate a multi-fasta file with the amino acid sequences of reference transcripts for a given annotation.

Firstly, you execute ASTAFUNK to print on standard output (redirected to the file <REFERENCE_TRANSCRIPTS.fasta>) the amino acid sequences of the reference transcripts. A reference transcript is the transcript with the longest Open Reading Frame (ORF) of an alternatively spliced gene.

Obtain the reference transcript FASTA file with the command:

$> astalavista -t astafunk --tref --genome <GENOME_DIR> --gtf <GTF_FILE.gtf> > <REFERENCE_TRANSCRIPTS.fasta>

Program Usage

The basic search command of AstaFunk is:

$> astalavista -t astafunk --gtf <GTF_FILE.gtf> --genome <GENOME_DIR> --hmm <HMM_FILE.hmm> --reference <REFERENCE_FILE>

Options

[--hmm <HMM_FILE.hmm>]	Path to the profile HMM file (Pfam format).
[--genome <GENOME>]	Path to the directory with the genomic sequences, i.e., one fasta file per chromosome/scaffold/contig with a file name corresponding to the identifiers of the first column in the GTF annotation.
[--gtf <GTF_FILE.gtf>]	Path to the GTF reference annotation file.
[(-r\|--reference) <REFERENCE_FILE>]	Path to the reference domain file.
[--tref]	Print to standard output the FASTA sequences of non-AS transcripts and reference transcripts (longest ORF) of AS genes. This parameter is only used with--gtf and --genome parameters.
[-g\|--output-hits-per-gene]	Output best non-overlapped domain hits of the alternatively spliced (AS) gene. (Default: output best non-overlapped domain hits of each variant; Method to obtain results for the paper).
[(-o\|--overlapping) <OVERLAPPING>]	Hit overlapping threshold [0.0 to 1.0] (default: 0)
[-l\|--local]	Run local search (Default: glocal)
[--all]	Output all different overlapped domains hits. (Default: output best non-overlapped domain hits of each variant).
[-e\|--exh]	Perform exhaustive search against HMM database.
[--naive]	Run Näive search. Search domains against all genes with alternaive splicing. This search uses a reference file (Method to obtain results for the paper).
[--const]	Output constitutive domain hits (Method to obtain results for the paper)
[--test]	Search HMM database (--hmm) against FASTA sequences. (Method to obtain results for the paper).
[--cpu <CPU>]	Number of threads to run (Default: 1)
[--verbose]	Verbose

Program Output

AstaFunk prints on standard output the predictions of domains for each variant. See below column names of the standard output (tab-separated):

chr: Field "seqname" of the GTF annotation; name of the chromosome or scaffold; Example: “chr1”.
gene_cluster: string of concatenated AS transcript/gene identifiers. Example: “uc001dhm.2,uc001dhn.3,uc001dho.3”.
variant: set of transcripts with same exon/intron composition between the first source and last sink. If variants contains a list of transcripts identifiers separated by commas means that the transcripts have the same exon/intron composition between the first source and last sink.
acc: Accession number of the profile HMM. Example: “PF00406.19”.
bitscore: Bit score of the alignment
start_seq: Startposition of the alignment in the sequence.
end_seq: End position of the alignment in the sequence.
start_genomic : Start position of the alignment in the genome
end_genomic: End position of the alignment in the genome
first_source : Source is the start genomic position of the fused AS events.
last_sink: Sink is the end genomic position of the fused AS events.
start_model: Alignment start state of the profile HMM.
end_model: Alignment end state of the profile HMM
length_model: number of states of the profile HMM
events:
- Contains the code and splice chain (pipe-separated) of each AS event overlapped by the domain hit. The events are single-whitespace separated. Example:
  ... start_model end_model length_model events ... 1 150 150 code_event1|splice_chain_event1 code_event2|splice_chain_event2

Getting Started

Searching protein domains on alternatively spliced regions of human gene TNNT1

According to RefSeq (NM_003283),

This gene encodes a protein that is a subunit of troponin, which is a regulatory complex located on the thin filament of the sarcomere. This complex regulates striated muscle contraction in response to fluctuations in intracellular calcium concentration.

Input

Annotation of eight alternative transcripts from GENCODE Basic v24 (Download)
Chromosome 19 FASTA file from GRCh38/hg38 (Download)
Reference file (Download)
HMM file (Download)

Command line

$> astalavista -t astafunk --tref --gtf tnnt1.gtf --genome ~/example/genome/ > reference_tx.fasta

$> hmmsearch --domtblout reference_file ~/Databases/Pfam/Pfam-A.hmm reference_tx.fasta

$> grep -v "#" reference_file | awk '{print $5}' | sort | uniq > list-hmm-tnnt1
$> hmmfetch -f Pfam-A.hmm list-hmm-tnnt1 > database.hmm

or skip these commands and use directly the whole database Pfam-A.hmm as parameter for the option [–hmm].

astalavista -t astafunk --genome ~/example/genome/ --gtf tnnt1.gtf --reference reference_file --hmm database.hmm

Extra: Constitutive Domains

astalavista -t astafunk --const --genome ~/example/genome/ --gtf tnnt1.gtf --reference reference_file --hmm database.hmm

Output

AstaFunk identifies six complete alternative events between the eight alternative transcripts of the gene TNNT1 (in the paper we present, just for an example, only two events). See the standard output:

#	chr	gene_cluster_name	name_hmm	acc	description	bitscore	start_seq	end_seq	start_genomic	end_genomic	first_source	last_sink	start_model	end_model	length_model	sequence	variants
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	35.2309670601	1	102	-55147129	-55134152	-55147168	-55147168	1	134	134	ENST00000588426.5	ENST00000588426.5
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	98.6465479489	85	205	-55141239	-55134200	-55147168	-55147168	1	134	134	ENST00000588981.5	ENST00000588981.5
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	139.0575466987	1	135	-55141281	-55134152	-55147168	-55147168	1	134	134	ENST00000587465.6	ENST00000587465.6
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	139.0575466987	1	135	-55141281	-55134152	-55147168	-55147168	1	134	134	ENST00000585321.6	ENST00000585321.6
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	139.0575466987	1	135	-55141281	-55134152	-55147168	-55147168	1	134	134	ENST00000536926.5	ENST00000536926.5
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	158.3576407894	69	205	-55141287	-55134152	-55147168	-55147168	1	134	134	ENST00000291901.12	ENST00000291901.12
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	158.3576407894	58	194	-55141287	-55134152	-55147168	-55147168	1	134	134	ENST00000356783.9	ENST00000356783.9
EHIT	chr19	ENST00000291901.12,ENST00000356783.9,ENST00000587465.6, ENST00000587758.5,ENST00000585321.6,ENST00000588981.5, ENST00000536926.5,ENST00000588426.5	Troponin	PF00992.17	Troponin	158.3576407894	58	194	-55141287	-55134152	-55147168	-55147168	1	134	134	ENST00000587758.5	ENST00000587758.5

Documentation of the JAVA source code

You can view the complete javadoc of barna on http://sammeth.net/jenkins/job/barna-devel/javadoc/: AstaFunk documentation can be found on packages barna.astafunk.*

Contents

Child pages