Guide to Statistics in the Ontology Search Tool

Overview

Term statistics presented in the Ontology Search tool identify curated vocabulary terms that are unusually common within a pool of selected genes, relative to the overall occurrence of those terms in the complete subscribed set of genes. For instance, if a user performs an experiment to detect genes expressed in liver cells, and enters that set of genes using the input feature, Ontology Search vocabulary terms that are related to liver cell gene expression would be expected to be over-represented. While the term "EX:liver" might have been curated to about 20% of the total genes with "Expression" curation in the user's subscribed copy of BKL, the number of genes annotated to "EX:liver" among the user's input set of genes might be closer to 95% or higher, making that term dramatically over-represented.

Likewise, if a user focuses on the term "MF:motor activity", the resulting set of genes will also show a dramatic over-representation of curated terms such as "CC:cytoskeleton", and "MF:ATPase activity", due to their common assignment to many of these same genes. The displayed P-values are a way to identify such co-occurrence relationships, some of which may not be intuitively obvious.

Specifically, the P-value represents the probability that over-representation of a given term has happened due to chance. A value of 1 indicates that the term is either under-represented or occurs at very close to the same frequency in the chosen gene set as in the total subscribed BKL set of genes. A very low value, in contrast, indicates that the term is significantly over-represented, not likely due to chance.

Please note: For simplicity and brevity, the documentation uses genes as the focus of the Set Analysis feature, but diseases and drugs can also be made to be the focus of by selecting either from the pull-down menu at the top right corner of the tool.

Calculating Term Statistics

The hypergeometric distribution is used to calculate P-values based on counts of genes associated with individual terms, using both the selected and the overall sets of genes.

x = number of genes associated with a chosen term or its children, in the selected set
n = total number of genes in the selected (focused or input) set associated with the ultimate parent of the chosen term that is still within the same category (called the root term).
M = number of genes associated with the chosen term or its children in the total subscribed BKL
N = total number of genes associated with the ultimate parent of the chosen term that is still within the same category (called the root term). For a GO term such as "BP:transport", this would be the number of genes associated with "GO:biological process" in the full subscribed version of BKL.

P = (M, x) * ((N-M), (n-x)) / (N, n)
where (x,y) means: (x!) / ((x-y)! * y!)

We use a speed-optimized approximation of this formula which provides precision to 2 decimal places for all P-values, down to 1e-300. Only over-represented values are calculated, while all under-represented terms ( x < n * (M/N)) are set to 1. The value for x is shown in the driller as the gene count (value for M is the count on the same term after pressing reset), and is shown in the pop-up display of ranked P-values as the "Gene count". The "Expected count" shown in the popup for comparison purposes is n * (M/N).

FAQs

How are terms "grouped by proximity"?

After P-values are calculated for all terms, and collected up to the user-chosen threshold value, those terms that are directly connected in the vocabulary tree (either parents or children of each other) are grouped together and then re-ranked. The geometric average P-value for those terms is provided as the group average. (This average is not itself a P-value, however.)

Can I speed up the popup statistics feature?

All P-value calculations are done on the browser, and therefore are sensitive to the browser's javascript implementation speed. Firefox has a particularly fast javascript engine, and is recommended to run the Ontology Search tool on all platforms. Mozilla-related browsers will all be relatively fast, while Internet Explorer and Safari are slower in our tests.

Another way to speed processing is to use this feature on smaller sets of genes, whether from input or focus sources. The smaller the set, the fewer the number of associated terms and calculations.