Hypergeometric test
Background
The hypergeometric test is a simple statistical procedure to test for
over- or under-representation of items (groups) in a defined global set.
For example, if a global set consists of 50 black and 50 white balls,
and 20 balls are randomly selected from this set, what is the
probability that 17 white balls and only 3 black balls are chosen? The
hypergeometric test can answer such questions by using the probability
function of the hypergeometric
distribution. This distribution is highly related to the binomial
distribution which uses a similar definition but with
replacement, instead of without.
This document walks through one example where the hypergeometric test
is used to determine gene enrichment for pathways of the [subtiwiki
pathway categories](http://www.subtiwiki.uni-goettingen.de/v3/category/ for
the bacterium Bacillus subtilis.
Libraries and test data
- Install
Category
package from Bioconductor
BiocManager::install("Category")
suppressPackageStartupMessages({
library(Category)
library(tidyverse)
})
- import category test data (from subtiwiki)
df_subtiwiki <- read_csv("../data/geneCategories-2022-09-02.csv",
name_repair = function(x) str_replace_all(x, " ", "_"),
show_col_types = FALSE)
head(df_subtiwiki)
- convert dataframe into a nested list format
subti_categories <- df_subtiwiki %>%
mutate(category = str_sub(category, 1, 25)) %>%
select(category, gene_locus) %>%
group_by(category) %>%
summarize(gene_locus = list(gene_locus)) %>%
deframe()
head(subti_categories, n = 3)
$`6S RNA`
[1] "BSU_misc_RNA_32" "BSU_misc_RNA_41"
$`A/P endonucleases`
[1] "BSU_22340" "BSU_25130" "BSU_40880"
$`ABC transporters for the `
[1] "BSU_01610" "BSU_01610" "BSU_01620" "BSU_01620" "BSU_01630" "BSU_01630" "BSU_03800" "BSU_03800"
[9] "BSU_03810" "BSU_03810" "BSU_03820" "BSU_03820" "BSU_03830" "BSU_03830" "BSU_07490" "BSU_07490"
[17] "BSU_07500" "BSU_07500" "BSU_07510" "BSU_07510" "BSU_07520" "BSU_07520" "BSU_08440" "BSU_08440"
[25] "BSU_08450" "BSU_08450" "BSU_08460" "BSU_08460" "BSU_10330" "BSU_10330" "BSU_32940" "BSU_32940"
[33] "BSU_33290" "BSU_33290" "BSU_33300" "BSU_33300" "BSU_33310" "BSU_33310" "BSU_33320" "BSU_33320"
[41] "BSU_39610" "BSU_39610"
Run test and inspect results
- run hypergeometric test using
hyperg
function from
Categories
result <- hyperg(
assayed,
significant,
universe,
representation = "over"
)
result <- mutate(result, across(everything(), unlist))
- expected result: all p-values are very similar (no enrichment of a
specific pathway)
head(result)
- what happens if one pathway has more than 10% sign. genes?
significant[[1]] <- sample(assayed[[1]], 20)
- run hypergeometric test again and inspect result
- the first (modified) pathway has lower p-value and higher odds
result2 <- hyperg(
assayed,
significant,
universe,
representation = "over"
) %>%
mutate(across(everything(), unlist))
head(result2)
P-value adjustment
- p-values from testing hundreds to thousands of pathways can be
highly inflated
- the chance of discovering false positive correlations increases with
the number of carried out tests
- to correct for multiple hypothesis testing, p-value adjustment is
often advisable
- simply speaking, p-value adjustment works by correcting the observed
p-value distribution depending on the number of tests that were carried
out
- various methods are available, differing in their
conservativeness
- a good option to start with is usually the Benjamini-Hochberg
method, implemented in R with
stats::p.adjust(p, method = "BH")
- the example below shows how p-values increase (become less
“significant”) after adjustment
result2 <- mutate(result2, p_adj = p.adjust(p, method = "BH"))
result2 %>%
ggplot(aes(x = p, y = p_adj)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = grey(0.5)) +
geom_point() +
theme_light() +
theme(aspect.ratio = 1.0)

Interpretation
The output table from the gene enrichment function can be interpreted
as follows:
p
: column with un-adjusted p-values
odds
: intermediate statistic from hypergeometric test –
if we picture an urn containing genes from the universe, genes annotated
at a given category are white (W
) and the rest black
(B
), then odds ratio is
(W_drawn * (B - B_drawn)) / (B_drawn * (W - W_drawn))
expected
: the number of expected genes from the assayed
group
LS0tCnRpdGxlOiAiSHlwZXJnZW9tZXRyaWMgdGVzdCBpbiBSIgphdXRob3I6IE1pY2hhZWwgSmFobgpkYXRlOiAiYHIgZm9ybWF0KFN5cy50aW1lKCksICclZCAlQiwgJVknKWAiCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgdGhlbWU6IGNvc21vCiAgICB0b2M6IG5vCiAgICBudW1iZXJfc2VjdGlvbnM6IG5vCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogbm8KICAgIGRmX3ByaW50OiBwYWdlZAotLS0KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpCmBgYAoKIyMgSHlwZXJnZW9tZXRyaWMgdGVzdAoKIyMjIEJhY2tncm91bmQKClRoZSBoeXBlcmdlb21ldHJpYyB0ZXN0IGlzIGEgc2ltcGxlIHN0YXRpc3RpY2FsIHByb2NlZHVyZSB0byB0ZXN0IGZvciBvdmVyLSBvciB1bmRlci1yZXByZXNlbnRhdGlvbiBvZiBpdGVtcyAoZ3JvdXBzKSBpbiBhIGRlZmluZWQgZ2xvYmFsIHNldC4gRm9yIGV4YW1wbGUsIGlmIGEgZ2xvYmFsIHNldCBjb25zaXN0cyBvZiA1MCBibGFjayBhbmQgNTAgd2hpdGUgYmFsbHMsIGFuZCAyMCBiYWxscyBhcmUgcmFuZG9tbHkgc2VsZWN0ZWQgZnJvbSB0aGlzIHNldCwgd2hhdCBpcyB0aGUgcHJvYmFiaWxpdHkgdGhhdCAxNyB3aGl0ZSBiYWxscyBhbmQgb25seSAzIGJsYWNrIGJhbGxzIGFyZSBjaG9zZW4/IFRoZSBoeXBlcmdlb21ldHJpYyB0ZXN0IGNhbiBhbnN3ZXIgc3VjaCBxdWVzdGlvbnMgYnkgdXNpbmcgdGhlIHByb2JhYmlsaXR5IGZ1bmN0aW9uIG9mIHRoZSBbaHlwZXJnZW9tZXRyaWMgZGlzdHJpYnV0aW9uXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9IeXBlcmdlb21ldHJpY19kaXN0cmlidXRpb24pLiBUaGlzIGRpc3RyaWJ1dGlvbiBpcyBoaWdobHkgcmVsYXRlZCB0byB0aGUgYmlub21pYWwgZGlzdHJpYnV0aW9uIHdoaWNoIHVzZXMgYSBzaW1pbGFyIGRlZmluaXRpb24gYnV0ICp3aXRoIHJlcGxhY2VtZW50KiwgaW5zdGVhZCBvZiAqd2l0aG91dCouCgpUaGlzIGRvY3VtZW50IHdhbGtzIHRocm91Z2ggb25lIGV4YW1wbGUgd2hlcmUgdGhlIGh5cGVyZ2VvbWV0cmljIHRlc3QgaXMgdXNlZCB0byBkZXRlcm1pbmUgZ2VuZSBlbnJpY2htZW50IGZvciBwYXRod2F5cyBvZiB0aGUgW3N1YnRpd2lraSBwYXRod2F5IGNhdGVnb3JpZXNdKGh0dHA6Ly93d3cuc3VidGl3aWtpLnVuaS1nb2V0dGluZ2VuLmRlL3YzL2NhdGVnb3J5LyBmb3IgdGhlIGJhY3Rlcml1bSAqQmFjaWxsdXMgc3VidGlsaXMqLgoKIyMjIExpYnJhcmllcyBhbmQgdGVzdCBkYXRhCgotIEluc3RhbGwgYENhdGVnb3J5YCBwYWNrYWdlIGZyb20gQmlvY29uZHVjdG9yCgpgYGB7ciwgZXZhbCA9IEZBTFNFfQpCaW9jTWFuYWdlcjo6aW5zdGFsbCgiQ2F0ZWdvcnkiKQpgYGAKCgotIGxvYWQgcmVxdWlyZWQgbGlicmFyaWVzCgpgYGB7cn0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKHsKICBsaWJyYXJ5KENhdGVnb3J5KQogIGxpYnJhcnkodGlkeXZlcnNlKQp9KQpgYGAKCi0gaW1wb3J0IGNhdGVnb3J5IHRlc3QgZGF0YSAoZnJvbSBzdWJ0aXdpa2kpCgpgYGB7cn0KZGZfc3VidGl3aWtpIDwtIHJlYWRfY3N2KCIuLi9kYXRhL2dlbmVDYXRlZ29yaWVzLTIwMjItMDktMDIuY3N2IiwKICBuYW1lX3JlcGFpciA9IGZ1bmN0aW9uKHgpIHN0cl9yZXBsYWNlX2FsbCh4LCAiICIsICJfIiksCiAgc2hvd19jb2xfdHlwZXMgPSBGQUxTRSkKCmhlYWQoZGZfc3VidGl3aWtpKQpgYGAKCi0gY29udmVydCBkYXRhZnJhbWUgaW50byBhIG5lc3RlZCBsaXN0IGZvcm1hdAoKYGBge3J9CnN1YnRpX2NhdGVnb3JpZXMgPC0gZGZfc3VidGl3aWtpICU+JQogIG11dGF0ZShjYXRlZ29yeSA9IHN0cl9zdWIoY2F0ZWdvcnksIDEsIDI1KSkgJT4lCiAgc2VsZWN0KGNhdGVnb3J5LCBnZW5lX2xvY3VzKSAlPiUKICBncm91cF9ieShjYXRlZ29yeSkgJT4lCiAgc3VtbWFyaXplKGdlbmVfbG9jdXMgPSBsaXN0KGdlbmVfbG9jdXMpKSAlPiUKICBkZWZyYW1lKCkKCmhlYWQoc3VidGlfY2F0ZWdvcmllcywgbiA9IDMpCmBgYAoKIyMjIFByZXBhcmUgaW5wdXQgZGF0YQoKLSAqKmFzc2F5ZWQqKjogZ2VuZSBzZXRzIHRoYXQgd2VyZSB0ZXN0ZWQgaW4gYW4gZXhwZXJpbWVudCwgdXN1YWxseSBhbGwgZ2VuZXMvcGF0aHdheXMgaW4gdGhlIHVuaXZlcnNlCi0gZXhhbXBsZTogd2Ugc2VsZWN0IGEgc3Vic2V0IG9mIDIwIHN1YnRpd2lraSBjYXRlZ29yaWVzIHdpdGggZWFjaCBtb3JlIHRoYW4gMjAgZ2VuZXMKCmBgYHtyfQphc3NheWVkIDwtIHN1YnRpX2NhdGVnb3JpZXNbbGFwcGx5KHN1YnRpX2NhdGVnb3JpZXMsIGxlbmd0aCkgPj0gMjBdWzE6MjBdCmBgYAoKLSAqKnNpZ25pZmljYW50Kio6IGdlbmVzIHdpdGhpbiB0aGUgYXNzYXllZCBzZXQgdGhhdCB3ZXJlIG9ic2VydmVkLCBvciBmYWxsIGJlbG93IGEgc2lnbi4gdGhyZXNob2xkCi0gZXhhbXBsZTogd2Ugc2VsZWN0IDEwJSBvZiBnZW5lcyBhcyBzaWduaWZpY2FudAoKYGBge3J9CnNpZ25pZmljYW50IDwtIGxhcHBseShhc3NheWVkLCBmdW5jdGlvbih4KSBzYW1wbGUoeCwgbGVuZ3RoKHgpLzEwKSkKYGBgCgotICoqdW5pdmVyc2UqKjogYWxsIHVuaXF1ZSBnZW5lIG5hbWVzCgpgYGB7cn0KdW5pdmVyc2UgPSBhc3NheWVkICU+JQogIHVubGlzdCh1c2UubmFtZXMgPSBGQUxTRSkgJT4lCiAgdW5pcXVlCmBgYAoKIyMjIFJ1biB0ZXN0IGFuZCBpbnNwZWN0IHJlc3VsdHMKCi0gcnVuIGh5cGVyZ2VvbWV0cmljIHRlc3QgdXNpbmcgYGh5cGVyZ2AgZnVuY3Rpb24gZnJvbSBgQ2F0ZWdvcmllc2AKCmBgYHtyfQpyZXN1bHQgPC0gaHlwZXJnKAogIGFzc2F5ZWQsCiAgc2lnbmlmaWNhbnQsCiAgdW5pdmVyc2UsCiAgcmVwcmVzZW50YXRpb24gPSAib3ZlciIKKQoKcmVzdWx0IDwtIG11dGF0ZShyZXN1bHQsIGFjcm9zcyhldmVyeXRoaW5nKCksIHVubGlzdCkpCmBgYAoKLSBleHBlY3RlZCByZXN1bHQ6IGFsbCBwLXZhbHVlcyBhcmUgdmVyeSBzaW1pbGFyIChubyBlbnJpY2htZW50IG9mIGEgc3BlY2lmaWMgcGF0aHdheSkKCmBgYHtyfQpoZWFkKHJlc3VsdCkKYGBgCgotIHdoYXQgaGFwcGVucyBpZiBvbmUgcGF0aHdheSBoYXMgbW9yZSB0aGFuIDEwJSBzaWduLiBnZW5lcz8KCmBgYHtyfQpzaWduaWZpY2FudFtbMV1dIDwtIHNhbXBsZShhc3NheWVkW1sxXV0sIDIwKQpgYGAKCi0gcnVuIGh5cGVyZ2VvbWV0cmljIHRlc3QgYWdhaW4gYW5kIGluc3BlY3QgcmVzdWx0Ci0gdGhlIGZpcnN0IChtb2RpZmllZCkgcGF0aHdheSBoYXMgbG93ZXIgcC12YWx1ZSBhbmQgaGlnaGVyIG9kZHMKCmBgYHtyfQpyZXN1bHQyIDwtIGh5cGVyZygKICAgIGFzc2F5ZWQsCiAgICBzaWduaWZpY2FudCwKICAgIHVuaXZlcnNlLAogICAgcmVwcmVzZW50YXRpb24gPSAib3ZlciIKICApICU+JSAKICBtdXRhdGUoYWNyb3NzKGV2ZXJ5dGhpbmcoKSwgdW5saXN0KSkKCmhlYWQocmVzdWx0MikKYGBgCgojIyMgUC12YWx1ZSBhZGp1c3RtZW50CgotIHAtdmFsdWVzIGZyb20gdGVzdGluZyBodW5kcmVkcyB0byB0aG91c2FuZHMgb2YgcGF0aHdheXMgY2FuIGJlIGhpZ2hseSBpbmZsYXRlZAotIHRoZSBjaGFuY2Ugb2YgZGlzY292ZXJpbmcgZmFsc2UgcG9zaXRpdmUgY29ycmVsYXRpb25zIGluY3JlYXNlcyB3aXRoIHRoZSBudW1iZXIgb2YgY2FycmllZCBvdXQgdGVzdHMKLSB0byBjb3JyZWN0IGZvciBtdWx0aXBsZSBoeXBvdGhlc2lzIHRlc3RpbmcsIHAtdmFsdWUgYWRqdXN0bWVudCBpcyBvZnRlbiBhZHZpc2FibGUKLSBzaW1wbHkgc3BlYWtpbmcsIHAtdmFsdWUgYWRqdXN0bWVudCB3b3JrcyBieSBjb3JyZWN0aW5nIHRoZSBvYnNlcnZlZCBwLXZhbHVlIGRpc3RyaWJ1dGlvbiBkZXBlbmRpbmcgb24gdGhlIG51bWJlciBvZiB0ZXN0cyB0aGF0IHdlcmUgY2FycmllZCBvdXQKLSB2YXJpb3VzIG1ldGhvZHMgYXJlIGF2YWlsYWJsZSwgZGlmZmVyaW5nIGluIHRoZWlyIGNvbnNlcnZhdGl2ZW5lc3MKLSBhIGdvb2Qgb3B0aW9uIHRvIHN0YXJ0IHdpdGggaXMgdXN1YWxseSB0aGUgQmVuamFtaW5pLUhvY2hiZXJnIG1ldGhvZCwgaW1wbGVtZW50ZWQgaW4gUiB3aXRoIGBzdGF0czo6cC5hZGp1c3QocCwgbWV0aG9kID0gIkJIIilgCi0gdGhlIGV4YW1wbGUgYmVsb3cgc2hvd3MgaG93IHAtdmFsdWVzIGluY3JlYXNlIChiZWNvbWUgbGVzcyAic2lnbmlmaWNhbnQiKSBhZnRlciBhZGp1c3RtZW50CgpgYGB7cn0KcmVzdWx0MiA8LSBtdXRhdGUocmVzdWx0MiwgcF9hZGogPSBwLmFkanVzdChwLCBtZXRob2QgPSAiQkgiKSkKCnJlc3VsdDIgJT4lCiAgZ2dwbG90KGFlcyh4ID0gcCwgeSA9IHBfYWRqKSkgKwogIGdlb21fYWJsaW5lKHNsb3BlID0gMSwgaW50ZXJjZXB0ID0gMCwgbGluZXR5cGUgPSAyLCBjb2xvciA9IGdyZXkoMC41KSkgKwogIGdlb21fcG9pbnQoKSArCiAgdGhlbWVfbGlnaHQoKSArCiAgdGhlbWUoYXNwZWN0LnJhdGlvID0gMS4wKQpgYGAKCiMjIyBJbnRlcnByZXRhdGlvbgoKVGhlIG91dHB1dCB0YWJsZSBmcm9tIHRoZSBnZW5lIGVucmljaG1lbnQgZnVuY3Rpb24gY2FuIGJlIGludGVycHJldGVkIGFzIGZvbGxvd3M6CgotIGBwYDogY29sdW1uIHdpdGggKnVuLWFkanVzdGVkKiBwLXZhbHVlcwotIGBvZGRzYDogaW50ZXJtZWRpYXRlIHN0YXRpc3RpYyBmcm9tIGh5cGVyZ2VvbWV0cmljIHRlc3QgLS0gaWYgd2UgcGljdHVyZSBhbiB1cm4gY29udGFpbmluZyBnZW5lcyBmcm9tIHRoZSB1bml2ZXJzZSwgZ2VuZXMgYW5ub3RhdGVkIGF0IGEgZ2l2ZW4gY2F0ZWdvcnkgYXJlIHdoaXRlIChgV2ApIGFuZCB0aGUgcmVzdCBibGFjayAoYEJgKSwgdGhlbiBvZGRzIHJhdGlvIGlzIGAoV19kcmF3biAqIChCIC0gQl9kcmF3bikpIC8gKEJfZHJhd24gKiAoVyAtIFdfZHJhd24pKWAKLSBgZXhwZWN0ZWRgOiB0aGUgbnVtYmVyIG9mIGV4cGVjdGVkIGdlbmVzIGZyb20gdGhlIGFzc2F5ZWQgZ3JvdXAK