Hypergeometric test

Background

The hypergeometric test is a simple statistical procedure to test for over- or under-representation of items (groups) in a defined global set. For example, if a global set consists of 50 black and 50 white balls, and 20 balls are randomly selected from this set, what is the probability that 17 white balls and only 3 black balls are chosen? The hypergeometric test can answer such questions by using the probability function of the hypergeometric distribution. This distribution is highly related to the binomial distribution which uses a similar definition but with replacement, instead of without.

This document walks through one example where the hypergeometric test is used to determine gene enrichment for pathways of the [subtiwiki pathway categories](http://www.subtiwiki.uni-goettingen.de/v3/category/ for the bacterium Bacillus subtilis.

Libraries and test data

  • Install Category package from Bioconductor
BiocManager::install("Category")
  • load required libraries
suppressPackageStartupMessages({
  library(Category)
  library(tidyverse)
})
  • import category test data (from subtiwiki)
df_subtiwiki <- read_csv("../data/geneCategories-2022-09-02.csv",
  name_repair = function(x) str_replace_all(x, " ", "_"),
  show_col_types = FALSE)

head(df_subtiwiki)
  • convert dataframe into a nested list format
subti_categories <- df_subtiwiki %>%
  mutate(category = str_sub(category, 1, 25)) %>%
  select(category, gene_locus) %>%
  group_by(category) %>%
  summarize(gene_locus = list(gene_locus)) %>%
  deframe()

head(subti_categories, n = 3)
$`6S RNA`
[1] "BSU_misc_RNA_32" "BSU_misc_RNA_41"

$`A/P endonucleases`
[1] "BSU_22340" "BSU_25130" "BSU_40880"

$`ABC transporters for the `
 [1] "BSU_01610" "BSU_01610" "BSU_01620" "BSU_01620" "BSU_01630" "BSU_01630" "BSU_03800" "BSU_03800"
 [9] "BSU_03810" "BSU_03810" "BSU_03820" "BSU_03820" "BSU_03830" "BSU_03830" "BSU_07490" "BSU_07490"
[17] "BSU_07500" "BSU_07500" "BSU_07510" "BSU_07510" "BSU_07520" "BSU_07520" "BSU_08440" "BSU_08440"
[25] "BSU_08450" "BSU_08450" "BSU_08460" "BSU_08460" "BSU_10330" "BSU_10330" "BSU_32940" "BSU_32940"
[33] "BSU_33290" "BSU_33290" "BSU_33300" "BSU_33300" "BSU_33310" "BSU_33310" "BSU_33320" "BSU_33320"
[41] "BSU_39610" "BSU_39610"

Prepare input data

  • assayed: gene sets that were tested in an experiment, usually all genes/pathways in the universe
  • example: we select a subset of 20 subtiwiki categories with each more than 20 genes
assayed <- subti_categories[lapply(subti_categories, length) >= 20][1:20]
  • significant: genes within the assayed set that were observed, or fall below a sign. threshold
  • example: we select 10% of genes as significant
significant <- lapply(assayed, function(x) sample(x, length(x)/10))
  • universe: all unique gene names
universe = assayed %>%
  unlist(use.names = FALSE) %>%
  unique

Run test and inspect results

  • run hypergeometric test using hyperg function from Categories
result <- hyperg(
  assayed,
  significant,
  universe,
  representation = "over"
)

result <- mutate(result, across(everything(), unlist))
  • expected result: all p-values are very similar (no enrichment of a specific pathway)
head(result)
  • what happens if one pathway has more than 10% sign. genes?
significant[[1]] <- sample(assayed[[1]], 20)
  • run hypergeometric test again and inspect result
  • the first (modified) pathway has lower p-value and higher odds
result2 <- hyperg(
    assayed,
    significant,
    universe,
    representation = "over"
  ) %>% 
  mutate(across(everything(), unlist))

head(result2)

P-value adjustment

  • p-values from testing hundreds to thousands of pathways can be highly inflated
  • the chance of discovering false positive correlations increases with the number of carried out tests
  • to correct for multiple hypothesis testing, p-value adjustment is often advisable
  • simply speaking, p-value adjustment works by correcting the observed p-value distribution depending on the number of tests that were carried out
  • various methods are available, differing in their conservativeness
  • a good option to start with is usually the Benjamini-Hochberg method, implemented in R with stats::p.adjust(p, method = "BH")
  • the example below shows how p-values increase (become less “significant”) after adjustment
result2 <- mutate(result2, p_adj = p.adjust(p, method = "BH"))

result2 %>%
  ggplot(aes(x = p, y = p_adj)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = grey(0.5)) +
  geom_point() +
  theme_light() +
  theme(aspect.ratio = 1.0)

Interpretation

The output table from the gene enrichment function can be interpreted as follows:

  • p: column with un-adjusted p-values
  • odds: intermediate statistic from hypergeometric test – if we picture an urn containing genes from the universe, genes annotated at a given category are white (W) and the rest black (B), then odds ratio is (W_drawn * (B - B_drawn)) / (B_drawn * (W - W_drawn))
  • expected: the number of expected genes from the assayed group
LS0tCnRpdGxlOiAiSHlwZXJnZW9tZXRyaWMgdGVzdCBpbiBSIgphdXRob3I6IE1pY2hhZWwgSmFobgpkYXRlOiAiYHIgZm9ybWF0KFN5cy50aW1lKCksICclZCAlQiwgJVknKWAiCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgdGhlbWU6IGNvc21vCiAgICB0b2M6IG5vCiAgICBudW1iZXJfc2VjdGlvbnM6IG5vCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogbm8KICAgIGRmX3ByaW50OiBwYWdlZAotLS0KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpCmBgYAoKIyMgSHlwZXJnZW9tZXRyaWMgdGVzdAoKIyMjIEJhY2tncm91bmQKClRoZSBoeXBlcmdlb21ldHJpYyB0ZXN0IGlzIGEgc2ltcGxlIHN0YXRpc3RpY2FsIHByb2NlZHVyZSB0byB0ZXN0IGZvciBvdmVyLSBvciB1bmRlci1yZXByZXNlbnRhdGlvbiBvZiBpdGVtcyAoZ3JvdXBzKSBpbiBhIGRlZmluZWQgZ2xvYmFsIHNldC4gRm9yIGV4YW1wbGUsIGlmIGEgZ2xvYmFsIHNldCBjb25zaXN0cyBvZiA1MCBibGFjayBhbmQgNTAgd2hpdGUgYmFsbHMsIGFuZCAyMCBiYWxscyBhcmUgcmFuZG9tbHkgc2VsZWN0ZWQgZnJvbSB0aGlzIHNldCwgd2hhdCBpcyB0aGUgcHJvYmFiaWxpdHkgdGhhdCAxNyB3aGl0ZSBiYWxscyBhbmQgb25seSAzIGJsYWNrIGJhbGxzIGFyZSBjaG9zZW4/IFRoZSBoeXBlcmdlb21ldHJpYyB0ZXN0IGNhbiBhbnN3ZXIgc3VjaCBxdWVzdGlvbnMgYnkgdXNpbmcgdGhlIHByb2JhYmlsaXR5IGZ1bmN0aW9uIG9mIHRoZSBbaHlwZXJnZW9tZXRyaWMgZGlzdHJpYnV0aW9uXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9IeXBlcmdlb21ldHJpY19kaXN0cmlidXRpb24pLiBUaGlzIGRpc3RyaWJ1dGlvbiBpcyBoaWdobHkgcmVsYXRlZCB0byB0aGUgYmlub21pYWwgZGlzdHJpYnV0aW9uIHdoaWNoIHVzZXMgYSBzaW1pbGFyIGRlZmluaXRpb24gYnV0ICp3aXRoIHJlcGxhY2VtZW50KiwgaW5zdGVhZCBvZiAqd2l0aG91dCouCgpUaGlzIGRvY3VtZW50IHdhbGtzIHRocm91Z2ggb25lIGV4YW1wbGUgd2hlcmUgdGhlIGh5cGVyZ2VvbWV0cmljIHRlc3QgaXMgdXNlZCB0byBkZXRlcm1pbmUgZ2VuZSBlbnJpY2htZW50IGZvciBwYXRod2F5cyBvZiB0aGUgW3N1YnRpd2lraSBwYXRod2F5IGNhdGVnb3JpZXNdKGh0dHA6Ly93d3cuc3VidGl3aWtpLnVuaS1nb2V0dGluZ2VuLmRlL3YzL2NhdGVnb3J5LyBmb3IgdGhlIGJhY3Rlcml1bSAqQmFjaWxsdXMgc3VidGlsaXMqLgoKIyMjIExpYnJhcmllcyBhbmQgdGVzdCBkYXRhCgotIEluc3RhbGwgYENhdGVnb3J5YCBwYWNrYWdlIGZyb20gQmlvY29uZHVjdG9yCgpgYGB7ciwgZXZhbCA9IEZBTFNFfQpCaW9jTWFuYWdlcjo6aW5zdGFsbCgiQ2F0ZWdvcnkiKQpgYGAKCgotIGxvYWQgcmVxdWlyZWQgbGlicmFyaWVzCgpgYGB7cn0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKHsKICBsaWJyYXJ5KENhdGVnb3J5KQogIGxpYnJhcnkodGlkeXZlcnNlKQp9KQpgYGAKCi0gaW1wb3J0IGNhdGVnb3J5IHRlc3QgZGF0YSAoZnJvbSBzdWJ0aXdpa2kpCgpgYGB7cn0KZGZfc3VidGl3aWtpIDwtIHJlYWRfY3N2KCIuLi9kYXRhL2dlbmVDYXRlZ29yaWVzLTIwMjItMDktMDIuY3N2IiwKICBuYW1lX3JlcGFpciA9IGZ1bmN0aW9uKHgpIHN0cl9yZXBsYWNlX2FsbCh4LCAiICIsICJfIiksCiAgc2hvd19jb2xfdHlwZXMgPSBGQUxTRSkKCmhlYWQoZGZfc3VidGl3aWtpKQpgYGAKCi0gY29udmVydCBkYXRhZnJhbWUgaW50byBhIG5lc3RlZCBsaXN0IGZvcm1hdAoKYGBge3J9CnN1YnRpX2NhdGVnb3JpZXMgPC0gZGZfc3VidGl3aWtpICU+JQogIG11dGF0ZShjYXRlZ29yeSA9IHN0cl9zdWIoY2F0ZWdvcnksIDEsIDI1KSkgJT4lCiAgc2VsZWN0KGNhdGVnb3J5LCBnZW5lX2xvY3VzKSAlPiUKICBncm91cF9ieShjYXRlZ29yeSkgJT4lCiAgc3VtbWFyaXplKGdlbmVfbG9jdXMgPSBsaXN0KGdlbmVfbG9jdXMpKSAlPiUKICBkZWZyYW1lKCkKCmhlYWQoc3VidGlfY2F0ZWdvcmllcywgbiA9IDMpCmBgYAoKIyMjIFByZXBhcmUgaW5wdXQgZGF0YQoKLSAqKmFzc2F5ZWQqKjogZ2VuZSBzZXRzIHRoYXQgd2VyZSB0ZXN0ZWQgaW4gYW4gZXhwZXJpbWVudCwgdXN1YWxseSBhbGwgZ2VuZXMvcGF0aHdheXMgaW4gdGhlIHVuaXZlcnNlCi0gZXhhbXBsZTogd2Ugc2VsZWN0IGEgc3Vic2V0IG9mIDIwIHN1YnRpd2lraSBjYXRlZ29yaWVzIHdpdGggZWFjaCBtb3JlIHRoYW4gMjAgZ2VuZXMKCmBgYHtyfQphc3NheWVkIDwtIHN1YnRpX2NhdGVnb3JpZXNbbGFwcGx5KHN1YnRpX2NhdGVnb3JpZXMsIGxlbmd0aCkgPj0gMjBdWzE6MjBdCmBgYAoKLSAqKnNpZ25pZmljYW50Kio6IGdlbmVzIHdpdGhpbiB0aGUgYXNzYXllZCBzZXQgdGhhdCB3ZXJlIG9ic2VydmVkLCBvciBmYWxsIGJlbG93IGEgc2lnbi4gdGhyZXNob2xkCi0gZXhhbXBsZTogd2Ugc2VsZWN0IDEwJSBvZiBnZW5lcyBhcyBzaWduaWZpY2FudAoKYGBge3J9CnNpZ25pZmljYW50IDwtIGxhcHBseShhc3NheWVkLCBmdW5jdGlvbih4KSBzYW1wbGUoeCwgbGVuZ3RoKHgpLzEwKSkKYGBgCgotICoqdW5pdmVyc2UqKjogYWxsIHVuaXF1ZSBnZW5lIG5hbWVzCgpgYGB7cn0KdW5pdmVyc2UgPSBhc3NheWVkICU+JQogIHVubGlzdCh1c2UubmFtZXMgPSBGQUxTRSkgJT4lCiAgdW5pcXVlCmBgYAoKIyMjIFJ1biB0ZXN0IGFuZCBpbnNwZWN0IHJlc3VsdHMKCi0gcnVuIGh5cGVyZ2VvbWV0cmljIHRlc3QgdXNpbmcgYGh5cGVyZ2AgZnVuY3Rpb24gZnJvbSBgQ2F0ZWdvcmllc2AKCmBgYHtyfQpyZXN1bHQgPC0gaHlwZXJnKAogIGFzc2F5ZWQsCiAgc2lnbmlmaWNhbnQsCiAgdW5pdmVyc2UsCiAgcmVwcmVzZW50YXRpb24gPSAib3ZlciIKKQoKcmVzdWx0IDwtIG11dGF0ZShyZXN1bHQsIGFjcm9zcyhldmVyeXRoaW5nKCksIHVubGlzdCkpCmBgYAoKLSBleHBlY3RlZCByZXN1bHQ6IGFsbCBwLXZhbHVlcyBhcmUgdmVyeSBzaW1pbGFyIChubyBlbnJpY2htZW50IG9mIGEgc3BlY2lmaWMgcGF0aHdheSkKCmBgYHtyfQpoZWFkKHJlc3VsdCkKYGBgCgotIHdoYXQgaGFwcGVucyBpZiBvbmUgcGF0aHdheSBoYXMgbW9yZSB0aGFuIDEwJSBzaWduLiBnZW5lcz8KCmBgYHtyfQpzaWduaWZpY2FudFtbMV1dIDwtIHNhbXBsZShhc3NheWVkW1sxXV0sIDIwKQpgYGAKCi0gcnVuIGh5cGVyZ2VvbWV0cmljIHRlc3QgYWdhaW4gYW5kIGluc3BlY3QgcmVzdWx0Ci0gdGhlIGZpcnN0IChtb2RpZmllZCkgcGF0aHdheSBoYXMgbG93ZXIgcC12YWx1ZSBhbmQgaGlnaGVyIG9kZHMKCmBgYHtyfQpyZXN1bHQyIDwtIGh5cGVyZygKICAgIGFzc2F5ZWQsCiAgICBzaWduaWZpY2FudCwKICAgIHVuaXZlcnNlLAogICAgcmVwcmVzZW50YXRpb24gPSAib3ZlciIKICApICU+JSAKICBtdXRhdGUoYWNyb3NzKGV2ZXJ5dGhpbmcoKSwgdW5saXN0KSkKCmhlYWQocmVzdWx0MikKYGBgCgojIyMgUC12YWx1ZSBhZGp1c3RtZW50CgotIHAtdmFsdWVzIGZyb20gdGVzdGluZyBodW5kcmVkcyB0byB0aG91c2FuZHMgb2YgcGF0aHdheXMgY2FuIGJlIGhpZ2hseSBpbmZsYXRlZAotIHRoZSBjaGFuY2Ugb2YgZGlzY292ZXJpbmcgZmFsc2UgcG9zaXRpdmUgY29ycmVsYXRpb25zIGluY3JlYXNlcyB3aXRoIHRoZSBudW1iZXIgb2YgY2FycmllZCBvdXQgdGVzdHMKLSB0byBjb3JyZWN0IGZvciBtdWx0aXBsZSBoeXBvdGhlc2lzIHRlc3RpbmcsIHAtdmFsdWUgYWRqdXN0bWVudCBpcyBvZnRlbiBhZHZpc2FibGUKLSBzaW1wbHkgc3BlYWtpbmcsIHAtdmFsdWUgYWRqdXN0bWVudCB3b3JrcyBieSBjb3JyZWN0aW5nIHRoZSBvYnNlcnZlZCBwLXZhbHVlIGRpc3RyaWJ1dGlvbiBkZXBlbmRpbmcgb24gdGhlIG51bWJlciBvZiB0ZXN0cyB0aGF0IHdlcmUgY2FycmllZCBvdXQKLSB2YXJpb3VzIG1ldGhvZHMgYXJlIGF2YWlsYWJsZSwgZGlmZmVyaW5nIGluIHRoZWlyIGNvbnNlcnZhdGl2ZW5lc3MKLSBhIGdvb2Qgb3B0aW9uIHRvIHN0YXJ0IHdpdGggaXMgdXN1YWxseSB0aGUgQmVuamFtaW5pLUhvY2hiZXJnIG1ldGhvZCwgaW1wbGVtZW50ZWQgaW4gUiB3aXRoIGBzdGF0czo6cC5hZGp1c3QocCwgbWV0aG9kID0gIkJIIilgCi0gdGhlIGV4YW1wbGUgYmVsb3cgc2hvd3MgaG93IHAtdmFsdWVzIGluY3JlYXNlIChiZWNvbWUgbGVzcyAic2lnbmlmaWNhbnQiKSBhZnRlciBhZGp1c3RtZW50CgpgYGB7cn0KcmVzdWx0MiA8LSBtdXRhdGUocmVzdWx0MiwgcF9hZGogPSBwLmFkanVzdChwLCBtZXRob2QgPSAiQkgiKSkKCnJlc3VsdDIgJT4lCiAgZ2dwbG90KGFlcyh4ID0gcCwgeSA9IHBfYWRqKSkgKwogIGdlb21fYWJsaW5lKHNsb3BlID0gMSwgaW50ZXJjZXB0ID0gMCwgbGluZXR5cGUgPSAyLCBjb2xvciA9IGdyZXkoMC41KSkgKwogIGdlb21fcG9pbnQoKSArCiAgdGhlbWVfbGlnaHQoKSArCiAgdGhlbWUoYXNwZWN0LnJhdGlvID0gMS4wKQpgYGAKCiMjIyBJbnRlcnByZXRhdGlvbgoKVGhlIG91dHB1dCB0YWJsZSBmcm9tIHRoZSBnZW5lIGVucmljaG1lbnQgZnVuY3Rpb24gY2FuIGJlIGludGVycHJldGVkIGFzIGZvbGxvd3M6CgotIGBwYDogY29sdW1uIHdpdGggKnVuLWFkanVzdGVkKiBwLXZhbHVlcwotIGBvZGRzYDogaW50ZXJtZWRpYXRlIHN0YXRpc3RpYyBmcm9tIGh5cGVyZ2VvbWV0cmljIHRlc3QgLS0gaWYgd2UgcGljdHVyZSBhbiB1cm4gY29udGFpbmluZyBnZW5lcyBmcm9tIHRoZSB1bml2ZXJzZSwgZ2VuZXMgYW5ub3RhdGVkIGF0IGEgZ2l2ZW4gY2F0ZWdvcnkgYXJlIHdoaXRlIChgV2ApIGFuZCB0aGUgcmVzdCBibGFjayAoYEJgKSwgdGhlbiBvZGRzIHJhdGlvIGlzIGAoV19kcmF3biAqIChCIC0gQl9kcmF3bikpIC8gKEJfZHJhd24gKiAoVyAtIFdfZHJhd24pKWAKLSBgZXhwZWN0ZWRgOiB0aGUgbnVtYmVyIG9mIGV4cGVjdGVkIGdlbmVzIGZyb20gdGhlIGFzc2F5ZWQgZ3JvdXAK