This walk-through demonstrates how VAPoR can be used for the functional annotation of genomic plant sequences. In the example we will annotate a sequence from the purple yam (D. alata) assembly published by Benjamen White at The Genome Analysis Centre (ENA: ERP012202). This sequence was identified as a potential plant disease resistance gene in an earlier step of analysis. Therefore, we are interested in learning more about its function using VAPoR.

>TDa95/00328_TGAC00001_contig_2946
ATTTTTTGCATTTTTCAAATTGAGAGATTAATAAGTGGGCATGCAAATCTGGTGATTTTGTTTCTGATGTTTTTCTTGATTTGGAGATTTGCAGAATGACTC
TGCTTTTAGGCCCTCCAGGCTCAGGGAAGACAACTTTATTGTTGGCTTTGGCTGGAAAACTGAGTTCAGATCTGAAGGTGAGGGAAGACAACTTTTTTGTTT
TGTAATTGATTTCTAAAAATCATCCACTAAAGTAATATGATGGACAGCTGACAAATGGGAATTACTCTACTATTGAAATCAGGTTACCGGTAGTGTGACATA
CAATGGTCACAGCATGAAAGAGTTTGTTCCTGGGAGAACAGCAGCTTATATTGGTCAATATGACCTGCATATTGGTGAGATGACAGTGCGTGAAACTCTGGC
CTTCTCTGCAAGATGTCAAGGAGTTGGTACTCGACATGGTATACTTTTCTTCTGCAAACTGGGAATGTTTCATCATTTGATGTTCCTCAAAGTTGATAATTT
TCCCCACCATGAAGGCAGAACTAGTGAAATTTTGATACCCATAAACAGACTAAGCCTTCAACAGTTAGTTATAATTCCCGGAAATTGTGTCAAAGACTTCTA
TTTGATAAAGAGAAAATATTGAGATTTGAGTTCTTCAGAATTAGCCATTCTGCTTTGAGGGATTCAATTCTTTGTTGGTCTCCATTCCTTTCCATCTGCAAA
ATTTTAACTGCATTTCATGACTGCAGATATGTTGTCTGAATTAGTAAGAAGAGAGAAGCAAGCAAATATCAAACCTGATTCAGATATTGATGTCTTCATGAA
GGTAAAATTAAAAAAAAAAAAAAAAAAAACCTCTTGGTTTCAATCACCATAATCCTGTGGCACAAATGTATATATAGAAATTCTTAGTTTGGTATTGAGAAC
CTTTGTTTTCACACACTCAGGCAATAGGAACTGAAGGACAAGAGACCAGTGTGGTTACAGATTATATACTGAAGGTAAGTATAGTTTTCAAGACAAACATTT
GACACTATGCTCGTTAAAGAAAATAGACAACCTAAGCTCTTTTTTTTAAATGCAAACAGATTTTAGGATTGGAAGTCTGCGCCGATACCATGGTAGGAGATG
ACATGGTGAGAGGCATCTCTGGAGGACAAAGGAAGCGTGTTACTACAGGTGAATTCTAGCTTTACTTGTAGTTTCTCTAGTGCTTGTCATGTTCAATCTGCG
GGGGAACACTAACATGCTACTCTTCATCTCCAACTCAAATACTGCAGGTGAGATGCTTGTCGGTCCAGCCAGAGTGTTGCTGATGGATGAGATATCCACTGG
TCTAGATAGCTCCACAACCTTCCAGATTGTTGATTCACTCAGGCATTCCATTCACATTCTTGGTGGTACAGCAGTAATCTCATTGCTGCAACCTGCACCAGA
GACATATGACCTCTTTGACGATATAATTCTCCTCTCTGATGGGCAAGTTGTGTATCAAGGCCCCCGTGAACATGTGCTTGAGTTCTTTGAGTCCATGGGTTT
CAGATGCCCTGAGAGGAAAGGTGTTGCTGACTTCTTGCAAGAAGTAGGTCCTCCGCAATACTCCCCCAGTATCCATTCAATGTAATACTAACTTCAATGTGG
TTCTAATTTTTCTTTGTCATTTAAATCCACAGGTGACATCAAGAAAAGATCAGCAGCAGTATTGGGCACGTCATGAGGAACCTTATAGGTATGTGCCTGTGA
GAGAATTTGCAGAGGCATTCCACTCATTTCATATCGGCGCGAGCATGGGACATGAGCTCTCTGTCCCTTACGATAAGACCAAGAGCCACCCTGCTGCCCTGG
CAACTTCAAAATTTGGTGTTGGCAAGATGGAACTACTGAAAGCTTGCATTTGGAGAGAACAATTGCTGATGAAGAGGAACTCGTTTGTCTACATCTTCAAGG
CAGTTCAGGTAAGAATGGAATCTTTCACTGCAGTACAAAATTTCAAATTTCATTAGCAAAACTGTGCTTTGAGCAAATTCATGCTTGTATTTTCAGCTTTGT
GTCATGGCGTTCATCACAATGACACTCTTCTTCCGCACAAATATGCACCATGATACAGTAACTGATGGAGGAATTTACATGGGTGCACTCTTCTTCGGGATC
CTCTCAATCATGTTCAATGGATTCTCAGAACTTGCCATGACCATAATGAAGCTTCCTGTTTTCTTCAAGCAAAGAGATCTTCTCTTTTTCCCTGCATGGGCT
TATGCCTTGCCATCGTGGATTCTGAAGATACCCATCACATTTATGGAAGTTGGAGTCTGGGTGTTCACAACATACTATGTCATAGGATTTGATCCCAATGTT
GGAAGGTGAGAAACCACTCTGACTTCAGTACTGCTGCAATTAGTTGCAGAGTGAAATTAATGTCAAGCCTCCAATTCTTGATTCAGTGAGACAATACGATGA
TTTTCTCACATGTACTGCAGGCTGTTCAAGCAATATCTGCTCCTCCTTTGTGTCCAGCAAATGGCATCTGCTTTATTTCGGTTCATTGCAGCGCTAGGTAGG
AACATGATTGTTGCCAATACTTTTGGATCTTTTGCGCTCCTTGTGCTAATGGTGCTGGGTGGATTCATCATTTCAAGAGGTATTTCTGATAGCCATCTCTCT
GTTGTCAGCCATACAAACTTTGCTGCTCTAATTTTTCACAAATTAAAACTCTACTTGACCTTTAATTACTTTTAACTGACAGAGGACATAAAGAAATGGTGG
ATATGGGGTTACTGGATTTCACCCTTGATGTACTCACAAAATGCAATTACCACAAATGAATTTCTAGGGAAAAAGTGGAGACATGTAAGAGTGCTTATCATA
ATTATTTTTCCCCTACAAAAACAAATAATGCTAAGAAAAGAAAAACCATCAACTTATTATGCAATAAAGGAAGGAAAGGCTCTAGGGAAACAAAAACAGCTT
ACCATTCAAAATTGTAACAGATTTCTCCTGGATCAACGGAGC

Running VAPoR

Step 1: Copy query sequence from above, set E-value to "1.0-60" and run VAPoR

First of all we copy our purple yam query sequence and paste into the text input field. The input must be provided in FASTA format. Therefore, it is necessary to provide a header. As a first step, VAPoR performs a BLAST search to identify homologs to the query in A. thaliana. Because we are especially interested in finding out about A. thaliana proteins which are as closely related to our query as possible, we choose a low E-value threshold of 1.0e-60 for the BLAST search. Clicking the "Submit" button triggers VAPoR to start running. This should only take a few seconds.

Analysing the results

Step 2: Get familiar with MSA and phylogenetic tree functionality

Once the program finished running, a phylogenetic tree and a multiplie sequence alignment (MSA) show up on the page. These components are highly functional. The following image illustrates which user interactions the visualisation can listen and react to. On mousing over different nodes in the tree, relevant GO terms are displayed in tooltip format. The tree also reacts to mouse clicks. Clicking on tree leaves or protein identifiers triggers the visualisation of genetic interaction data (derived from STRING) and tissue specific gene expression data (derived from GEO). Clicking on inner tree nodes causes the node to collapse, hiding the node's sub-tree. Clicking on a collapsed node causes the hidden sub-tree to reappear. Tree nodes have different colours, indicating the amount of GO terms associated with it (the greener a node is, the more GO terms are available). The MSA visualisation can be navigated by dragging or using the scroll bar.

Step 3: Move the mouse over different tree leaves to learn about the function of A. thaliana homologs from their GO terms.

Mousing over the different leaves in the tree causes a tooltip to pop up, showing GO terms associated with the corresponding A. thaliana proteins. This way general functionality of related proteins can be explored quickly. Inner nodes hold the GO term intersection of all their leaves. The image on the right shows GO terms for protein AB39G. These are displayed when the mouse is moved over the phylogenetic tree leave next to the "AB39G_ARATH" protein identifier.

The more we scroll through the different proteins' GO terms the more terms directly related to plant stress response start coming up. These observations match the initial assumption that our query sequence may code for a disease resistance protein. Relationships between query and hits are further supported through high sequence conservation as shown by the MSA. The figure to the left shows the result of mousing over the tree leave next to the "AB40G_ARATH" identifer.

The image to the right shows the GO terms for A. thaliana protein AB36G. This protein seems to be especially well investigated as indicated by the big number of annotated GO terms. As a number of These GO terms are especially interesting to us (as indicated by the red box) we would like to learn more about it. Therefore, we click on the leave (or alternatively protein ID) to display genetic interactions and tissue specific gene expression information.

Step 4: Click on "AB36G_ARATH"

Once we click on the AB36G protein identifier (or leave), a genetic interaction network and and image of A. thaliana, showing tissue specific gene expression levels, appear.

Step 5: Get familiar with the genetic network and expression visualisations.

The following two images illustrate the user functionialities of both these visualisations. The interaction network can be navigated by panning and zooming (via mouse drag and scrolling, respectively). Nodes can be dragged around to adapt the network to the user's preference. The nodes in the network react to mouse clicks, causing tissue specific gene expression to be displayed for the protein. Mousing over green nodes causes a little window to appear. This window contains information about functionality and GO terms of the node. For blue nodes no such data is available from SWISS-PROT. Mousing over edges shows the experiment that was used to determine the interawction. The edges are colour-coded according to the interaction confidence, as indicated by the graph legend. Mousing over the different tissues of the A. thaliana plant in the gene expression visualisation causes the display of a tooltip showing the protein identifier and the exact expression value. It is also possible to mouse over the legend to get more details on the expression level related to different colours.

Step 6: Click on "MLO2" in the gene interaction network.

The tissue specific gene expression levels shows high expression throughout all tissues but the flower. The genetic interaction network is especially interesting as it shows that AB36G interacts with several important plant resistance genes known to be involved in ant-fungus resistance (e.g. MLO2). This is indicated by the yellow and red boxes in the image on the right.

Summary

All in all we were able to use VAPoR to confirm our initial hypothesis (the query sequence might code for a plant resistance gene) within few minutes. We did not have to run BLAST, MAFFT or NINJA ourselves. We also did not have to perform any file conversions to be able to do so. Moreover, we did not have to go to the UniProt, STRING and GEO pages and mine them for relevant information. Finally, as VAPoR does not just gather those data but also puts them into context, we are able to draw a number of biological conclusions from our findings. We find that the query is likely to be a member of the ABC transporter G protein family. It may be an interesting starting point for future analysis regarding purple yam resistance to fungi. For example, interacting purple yam genes could be identified through gene coexpression analysis from RNA seq data.