BioKnowledge Transfer (BKT)

BioKnowledge Transfer is the assignment of title lines and the prediction of (Gene Ontology) (GO) properties for uncharacterized proteins. Annotation is based on family membership and domain structure (Pfam analysis), and on similarity to characterized proteins from the Proteome databases (BLAST analysis). BioKnowledge Transfer curation, like all curation within the Proteome module, is not automated but is done manually by expert curators.

BioKnowledge Transfer is a proprietary annotation process for the Proteome module that entails writing of a one-line description, or Title Line, and predicting GO terms from molecular function, biological process, and cellular component hierarchies, for an uncharacterized protein of interest. BioKnowledge Transfer GO terms and Title Lines are based on Pfam analysis that reveals the presence of conserved domain or protein family sequences, and on BLAST analysis that reveals sequence similarity to characterized proteins. Title Lines written using BioKnowledge Transfer are always referenced (references are visible in the relational tables) and are written in a standard style described below. GO terms predicted by BioKnowledge Transfer have the evidence code PK, are described as resulting from "Direct Transfer", "Consensus Transfer", or "Pfam Transfer", and are referenced to the literature source from which the GO term was derived for the related, characterized protein.


Pfam Analysis

Pfam analysis uses HMMer (version 3), a multiple sequence alignment program using Hidden Markov Models, to compare protein sequences from the BioKnowledge Library to all Pfam domain/family seed sequences. A query protein is considered to contain a particular domain or belong to a particular protein family only if the HMMer alignment score is less than or equal to the trusted cut-off score defined by Pfam for each Pfam seed sequence. HMMer analysis using the most recent version of Pfam is performed approximately every month, or as new versions of Pfam are released. Pfam domain/family alignments are displayed in the Sequence section, referenced with the Pfam version used in the comparison.

For each alignment of a query protein with a Pfam protein family or protein domain alignment, BIOBASE generates a standard descriptive phrase, and where possible, assigns GO terms, based on our curation of information available from Pfam, Interpro, and published literature. GO terms are only assigned to a Pfam protein alignment if the GO terms are applicable to all known members of the family or domain. These descriptive phrases and associated GO terms are then used for BioKnowledge Transfer curation.


BLAST Analysis

In addition to the Pfam information, BioKnowledge Transfer curation also uses two complementary methods of BLAST analysis, termed Direct Transfer and Consensus Transfer. In the case of Direct Transfer, BLAST analysis is used to identify the target protein with the "best" similarity to the uncharacterized protein. Proteins are ranked according to Smith-Waterman score, and preference is given to the highest ranking characterized protein with full-length similarity to the uncharacterized protein of interest. A BLAST target is considered to have "full-length similarity" to the uncharacterized protein if the region of overlap, including any gaps introduced to improve alignment, covers at least 70% of the length of the uncharacterized protein as well as at least 70% of the length of the BLAST target. 

A BLAST target is considered "characterized" if it has had an experimentally determined GO term from within either the molecular function or biological process hierarchy published in a peer-reviewed journal and curated in our BioKnowledge Library. If there are no characterized proteins with full-length similarity to the uncharacterized protein of interest, a characterized protein with less than full-length similarity ("regional similarity") is selected as the "best" BLAST target. If there are no characterized proteins with similarity to the uncharacterized protein of interest, similarity to an uncharacterized protein (i.e. a protein that does not have an experimentally determined GO term from within either the molecular function or biological process hierarchy published in a peer-reviewed journal and curated in our BioKnowledge Library), is selected.

In the case of Consensus Transfer, the entire set of BLAST target protein sequences with an E-value of 1e-10 or less are considered, in order to identify the most predominant GO terms (published in a peer-reviewed journal and curated in our BioKnowledge Library) shared among the members of the group.


Title Lines

The title line for an uncharacterized protein contains two types of information:  (1) a phrase describing its similarity to the best BLAST target, along with a descriptive phrase describing that BLAST target's function, and (2) a phrase or phrases describing the protein's membership in conserved families or the presence of conserved domains or motifs (determined by Pfam analysis), along with standard descriptive phrase(s) about those families, domains or motifs.

An example Title Line is as follows: "Protein with high similarity to reticulocalbin (human RCN1), which binds calcium and is found in the lumen of the ER, contains four EF hand domains, which are found in signaling, buffering or transport proteins". In this case, the phrase "which binds calcium and is found in the lumen of the ER" is a literature-based description of the characterized BLAST target, RCN1, and is not a description of the uncharacterized protein itself. Similarly, the phrase "which are found in signaling, buffering or transport proteins" is a generalized description of the EF hand domain, and is not a specific description of the uncharacterized protein itself.

The level of similarity to the best BLAST target is indicated by the following controlled vocabulary:

Definition of Adjectives Used in BioKnowledge Transfer Title Lines

Adjective Overlap* Identity
very strong >=80% >=95%
strong >=80% 80-95%
high >=70% 45-100%
moderate >=70% 35-44%
low >=70% 25-34%
weak >=70% 20-24%

* "Overlap" refers to the % of matchlength over both
the uncharacterized protein of interest and the BLAST target.
For example, BOTH values must be 80% or more to be considered
as having "very strong similarity"

If the region of overlap from the alignment with the best BLAST target covers less than 70% of the BLAST target, the description " has adjective (strong, high, etc.) similarity to a region of protein X" is used. Conversely, if the region of overlap covers less than 70% of the uncharacterized protein of interest, the description "has a region of adjective (strong, high, etc.) similarity to protein X" is used.

The information in the BioKnowledge Transfer title line is written in a standard order to emphasize the most important information. If the similarity to the best BLAST target is very strong, strong, or high, information about the BLAST target is listed first in the title line. If the similarity to the best BLAST target is moderate, low, weak, or regional, or if the best BLAST target is uncharacterized, then information about the family membership or domain structure based on Pfam analysis of the uncharacterized protein of interest is placed first in the title line.

The title line "Protein of unknown function" indicates that the uncharacterized protein contains no conserved domain or family sequences from Pfam analysis and has no significant similarity (has less than 20% full-length or regional identity to a characterized protein or has less than 20% full-length identity to an uncharacterized protein) to a protein of the BioKnowledge Library by BLAST analysis.

Title line references are visible in the relational tables. If information about the family membership or domain structure of the protein is used in the title line, then the version of Pfam used in the analysis is referenced. If information about similarity to another protein, based on BLAST analysis, is used in the title line, this is referenced as "see BLAST".


GO Terms

In addition to writing a descriptive title line for the uncharacterized protein of interest using both Pfam and BLAST analysis, the BioKnowledge Transfer process predicts GO terms (representing molecular function, biological process, and cellular component hierarchies) for the uncharacterized protein. GO terms are predicted from family membership and domain structure (i.e. from Pfam analysis) only if the GO terms apply to all known members of the family or domain. GO terms are predicted from the best BLAST target according to the criteria listed in Table 2(Direct Transfer), and are also predicted from the entire set of BLAST hits with an E-value of 1e-10 or less (Consensus Transfer). 

Importantly, GO terms are only transferred to the uncharacterized protein of interest if the corresponding term from the BLAST target(s) was originally curated from published literature. Additionally, in the case of the Direct Transfer method, GO terms are only predicted for the uncharacterized protein of interest if the corresponding GO terms from the best BLAST target was experimentally determined in the literature.

All GO terms predicted by BioKnowledge Transfer analysis are identified as being derived from "Direct Transfer", "Consensus Transfer", or "Pfam Transfer". In cases where GO terms were assigned based on Pfam analysis, the Pfam domain(s) is listed along with the corresponding amino acid coordinates and the associated E-value. In cases where the Direct Transfer and/or Consensus Transfer method was used to assign a GO term, up to the five top BLAST hits are listed, along with the individual E-value and the published literature source from which the indicated GO term (with evidence code) was originally derived for the hit(s). 

In cases where the "Consensus Transfer" method was used to assign a GO term, a value is presented (as a fraction) specifying the number of characterized BLAST hits (with a cutoff E-value of 1e-10) which possess the assigned GO term versus the number of all characterized BLAST hits that possess GO terms in the corresponding hierarchy. For example, "Signal Transduction" is part of the Biological Process hierarchy. A value of 4/6 for the GO term "Signal Transduction" indicates that out of 6 characterized BLAST targets with GO terms in the Biological Process hierarchy, 4 shared "Signal Transduction" or a child of "Signal Transduction" as a curated GO term.

Conditions for Direct Transfer of GO terms by BioKnowledge Transfer

BLAST similarity Are GO terms Predicted?
Molecular Function Biological Process Cellular Component
very strong yes yes yes
strong yes yes yes
high yes yes no
moderate yes yes no
low yes yes no
weak no no no
regional no no no

The ranges of percent identity used in similarity to the best BLAST target are based on the threshold values for structural homology from Sander and Schneider (1991). The validity of the BioKnowledge Transfer process has been rigorously tested by BIOBASE's research and development team by blind BioKnowledge Transfer curation of characterized proteins.


References

BLAST
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol 215:403-410. [Abstract]

gapped BLAST
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402. [Abstract]

Gene Ontology
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM. Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet 25: 25-29. [Abstract]

HMMer
Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. (1994) Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol 235:1501-1531. [Abstract]

Interpro
Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MDR, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJA, Zdobnov EM (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29:37-40. [Abstract]

Pfam
Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. (2000) The Pfam Protein Families Database. Nucleic Acids Res 28:263-266. [Abstract]

SEG filter program
Wootton JC, Federhen S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266: 554-571. [Abstract]

Smith-Waterman algorithm
Waterman MS. Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall. London: 1995. ISBN 0-412-99391-0

Threshold values for structural homology
Sander C, Schneider R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56-68. [Abstract]


Copyright © geneXplain. All rights reserved.
Contact us at support@genexplain.com