Plot logos with python logomaker
¶
Description¶
This small notebook shows examples of how to use the python package logomaker
to plot different types of logos.
- 'Logos' are a type of visualization that shows the probability of certain bases at certain positions in a sequence
- the underlying data is a matrix of aligned DNA or RNA sequences, in fasta format or as a data frame
- Probability of enrichment or depletion of certain bases is visualized by magnified or shrinked one-letter symbols for each position
Requirements¶
- need to install
logomaker
e.g. with pip - optionally need
biopython
for sequence import and transformations - can set up virtual environment for local package installs
- or simply using
pip install
will install a package in the user's local site library
To install from a python chunk in a jupyter notebook, run:
import sys
!{sys.executable} -m pip install logomaker biopython requests
- import libraries
In [ ]:
import Bio.SeqIO
import Bio.SeqUtils
from Bio import motifs
from Bio.Seq import Seq
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from requests import get
from io import StringIO
import logomaker
import itertools
Data import¶
- first step is to import the data in fasta format
- use biopython to parse the fasta file
- extract the sequences only and convert to list
In [ ]:
path_fasta = ("../data/motif_alignments.fa")
fasta_db = Bio.SeqIO.parse(path_fasta, "fasta")
fasta_db = [str(i.seq) for i in fasta_db]
In [ ]:
fasta_db[0:5]
Out[ ]:
['AGAGAAGGAAAUAUCGAGAAAUGCCACAGGGAACUGUGAAGUGGUUCAAC', 'GCUUCAUGAUCGUCGGUUGCUCCCCAGCGGUCACCGGCGGUGACACCACC', 'UGUCCGGAGCAACCCACCACAUGCCAAGUCCCUCCGUCACCUCGCCGCAA', 'UCGUGGCGAGCGUGAACAUCAAGCCACUCGAGGACAAGAUCCUCGUUCAG', 'UCGCCAACUUCACCGUGGCGUCGACACCGCGCAUGUUCGACCGCCAGAGC']
- sequence data can be stored to a matrix of nucleotide (cols) X sequence positions (rows)
- values can be one of ‘counts’, ‘probability’, ‘weight’, or ‘information’
In [ ]:
logomaker.alignment_to_matrix(fasta_db, to_type="counts")[0:5]
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
Out[ ]:
A | C | G | U | |
---|---|---|---|---|
pos | ||||
0 | 5.0 | 9.0 | 7.0 | 5.0 |
1 | 4.0 | 13.0 | 9.0 | 0.0 |
2 | 8.0 | 6.0 | 6.0 | 6.0 |
3 | 5.0 | 6.0 | 5.0 | 10.0 |
4 | 4.0 | 9.0 | 8.0 | 5.0 |
In [ ]:
logomaker.alignment_to_matrix(fasta_db, to_type="weight")[0:5]
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
Out[ ]:
A | C | G | U | |
---|---|---|---|---|
pos | ||||
0 | -0.321928 | 0.415037 | 0.093109 | -0.321928 |
1 | -0.584963 | 0.900464 | 0.415037 | -2.906891 |
2 | 0.263034 | -0.099536 | -0.099536 | -0.099536 |
3 | -0.321928 | -0.099536 | -0.321928 | 0.552541 |
4 | -0.584963 | 0.415037 | 0.263034 | -0.321928 |
- supply a custom background matrix in order to correct for the GC content of your organism
- example: Streptococcus pyogenes has GC content 39%, AT content 61%: probability for A and T: 0.305, and for G and C: 0.195
- compare the following computed weights with the previous without GC correction
In [ ]:
array_bg = [0.305, 0.195, 0.195, 0.305]
logomaker.alignment_to_matrix(
fasta_db, to_type="weight", background=np.array([0.305, 0.195, 0.195, 0.305])
)[0:5]
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
Out[ ]:
A | C | G | U | |
---|---|---|---|---|
pos | ||||
0 | -0.608809 | 0.773491 | 0.451563 | -0.608809 |
1 | -0.871844 | 1.258918 | 0.773491 | -3.193772 |
2 | -0.023847 | 0.258918 | 0.258918 | -0.386417 |
3 | -0.608809 | 0.258918 | 0.036526 | 0.265660 |
4 | -0.871844 | 0.773491 | 0.621488 | -0.608809 |
Logo plots¶
- logo plot depends on the type of input
- default for centered logos can be
weight
- make a generalized function for plotting
In [ ]:
def plot_logo(data, to_type="weight", background=None,
ignore="", title="", ylim=[-6, 2],
colors = "classic", vpad=0, figsize = [10,3]):
data_logo = logomaker.alignment_to_matrix(
data, to_type=to_type, background=background,
characters_to_ignore=ignore
)
data_logo.index = data_logo.index + 1
title = "{0} (n = {1})".format(title, len(data))
logo = logomaker.Logo(data_logo,
fade_below=0.33, flip_below = False,
color_scheme=colors,
vpad=vpad, figsize = figsize)
logo.ax.set_ylim(ylim)
logo.ax.text(0.5, ylim[1] + 0.1, title, fontsize=14)
logo.ax.set_xticks(np.arange(2, len(data[0]) + 1, 2))
In [ ]:
plot_logo(fasta_db, title="all cleavage sites")
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# define custom colors for the four bases
colors = {"A": "#E7298A", "C": "#66A61E", "G": "#E6AB02", "U": "#999999"}
plot_logo(fasta_db, title="all cleavage sites", colors=colors)
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# use a custom background array for A, C, G, U
plot_logo(fasta_db, title="all cleavage sites",
background=np.array([0.305, 0.195, 0.195, 0.305]),
colors=colors
)
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# use probability instead of weight as alternative
plot_logo(fasta_db, title="all cleavage sites", colors=colors,
to_type="probability", ylim=[0, 1])
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# use information instead of weight as alternative
plot_logo(fasta_db, title="all cleavage sites", colors=colors,
to_type="information", ylim=[0, 1])
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# leave more space between letters: vpad parameter between 0 and 1
plot_logo(fasta_db, title="all cleavage sites", colors=colors, vpad = 0.5)
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
In [ ]:
# change size and xport figure
plot_logo(fasta_db, title="all cleavage sites", colors=colors, figsize = [7.0,3.0])
plt.savefig("../output/sequence_logo.svg")
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
Extract sequence probabilities¶
Variant 1¶
- from the probability matrix, we can extract the consensus sequence (best hit per position)
- we also want to extract the N next best sequences, for a given window
- to this end use again probability matrix with background correction
- compute all possible k-mers (itertools)
In [ ]:
# compute k-mers, for example with length = 5
kmer_length = 5
kmers = [''.join(p) for p in itertools.product(['G', 'A', 'U', 'C'], repeat=kmer_length)]
print("A total of {0} k-mers with length {1} have been computed.".format(len(kmers), kmer_length))
# make data frame with probabilities
df_prob = logomaker.alignment_to_matrix(
fasta_db, to_type="probability", background=np.array([0.305, 0.195, 0.195, 0.305])
)
# select a window of length 5, here the interesting region with CA motif
df_prob = df_prob.iloc[23:28].reset_index()
A total of 1024 k-mers with length 5 have been computed.
/home/michael/micromamba/envs/jupyter/lib/python3.9/site-packages/logomaker/src/matrix.py:584: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)` counts_df.loc[:, c] = tmp_mat.sum(axis=0).T
- then for each k-mer, fetch proability of respective nt per position
- sort by sum, or mean of probability to get best average score
- print top hits
In [ ]:
# extract nucleotide probabilities and make new df
result = {}
for kmer in kmers:
result[kmer] = [df_prob.at[pos, kmer[pos]] for pos in range(0, len(kmer))]
result = (
pd.DataFrame(result)
.melt()
.groupby(by="variable", as_index=False, sort=False)
.agg(["sum", "mean", "median"])
.value
.sort_values("sum", ascending=False)
)
result["rank"] = result["sum"].rank(ascending=False)
# print top 10 hits
result.head(10)
Out[ ]:
sum | mean | median | rank | |
---|---|---|---|---|
variable | ||||
CCACG | 2.500000 | 0.500000 | 0.400000 | 1.0 |
ACACG | 2.466667 | 0.493333 | 0.400000 | 2.0 |
CCACC | 2.400000 | 0.480000 | 0.400000 | 3.0 |
CCAGG | 2.400000 | 0.480000 | 0.366667 | 4.0 |
ACAGG | 2.366667 | 0.473333 | 0.366667 | 6.0 |
ACACC | 2.366667 | 0.473333 | 0.400000 | 6.0 |
CCACA | 2.366667 | 0.473333 | 0.400000 | 6.0 |
ACACA | 2.333333 | 0.466667 | 0.400000 | 8.0 |
UCACG | 2.300000 | 0.460000 | 0.400000 | 9.5 |
CCAGC | 2.300000 | 0.460000 | 0.366667 | 9.5 |
Variant 2¶
- we use the set of fasta sequences from before and turn it into a
Bio.motifs
class - can find consensus, which should be the same as the top hits from logo
- then we map this object to arbitrary sequences to find best N hits
- this strategy is more flexible and allows comparison with new sequences, instead of relying on the predefined window
In [ ]:
fasta_dna = [i.replace("U", "T") for i in fasta_db]
motif = motifs.create([Seq(i[23:28]) for i in fasta_dna], alphabet="ACGT")
motif.consensus
Out[ ]:
Seq('CCACG')
- the degenerate consensus masks ambiguous positions with
N
In [ ]:
motif.degenerate_consensus
Out[ ]:
Seq('NCANN')
- searching for instances : the most frequent use for a motif is to find its instances in a reference sequence
- for the reference we simply concenate all entries from making the motif to a single sequence
In [ ]:
ref_seq=Seq("".join(fasta_dna))
print(f"Length of reference sequence: {len(ref_seq)}")
ref_seq
Length of reference sequence: 1300
Out[ ]:
Seq('AGAGAAGGAAATATCGAGAAATGCCACAGGGAACTGTGAAGTGGTTCAACGCTT...GTG')
In [ ]:
result2 = pd.DataFrame(columns=(["start", "end", "strand", "seq", "score"]))
for position, score in motif.pssm.search(ref_seq, threshold=4.0):
result2 = pd.concat(
[
result2,
pd.DataFrame({
"start": position,
"end": position + kmer_length,
"strand": "+" if position > 0 else "-",
"seq": str(ref_seq[position:position + kmer_length]),
"score": score,
}, index = [1]),
]
).sort_values("score", ascending=False)
result2["rank"] = result2["score"].rank(ascending=False)
# print top 10 hits
result2.head(10)
Out[ ]:
start | end | strand | seq | score | rank | |
---|---|---|---|---|---|---|
1 | -1149 | -1144 | - | CGTGG | 5.163332 | 3.0 |
1 | -1087 | -1082 | - | CGTGG | 5.163332 | 3.0 |
1 | 1093 | 1098 | + | CCACG | 5.163332 | 3.0 |
1 | -95 | -90 | - | CGTGG | 5.163332 | 3.0 |
1 | -724 | -719 | - | CGTGG | 5.163332 | 3.0 |
1 | 598 | 603 | + | ACACG | 5.011329 | 7.5 |
1 | -778 | -773 | - | CGTGT | 5.011329 | 7.5 |
1 | 1073 | 1078 | + | ACACG | 5.011329 | 7.5 |
1 | -645 | -640 | - | CGTGT | 5.011329 | 7.5 |
1 | 623 | 628 | + | CCAGG | 4.703900 | 12.5 |
Session info¶
In [ ]:
!pip freeze
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1688651106312/work/dist argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1649500321618/work asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1667325728359/work attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1683424013410/work backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1680888073205/work biopython==1.79 black==22.10.0 bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1696630167146/work certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1700303426725/work/certifi cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work chardet==4.0.0 charset-normalizer==3.0.1 click==8.1.3 contourpy==1.0.6 cycler==0.11.0 debugpy @ file:///tmp/build/80754af9/debugpy_1637091799509/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1700579780973/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1667317341051/work fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1700055509243/work/dist fonttools==4.38.0 idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1701026962277/work importlib-metadata==7.0.0 importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1699364556997/work ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1667262163608/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1667140637743/work ipython-genutils==0.2.0 isodate==0.6.1 jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1659959867326/work Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1700159890288/work jsonschema-specifications @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-specifications_1701365715051/work jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1633454794268/work jupyter-server @ file:///croot/jupyter_server_1671707632269/work jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1667424442212/work jupyterlab_pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1700744013163/work kiwisolver==1.4.4 logomaker==0.8 MarkupSafe @ file:///opt/conda/conda-bld/markupsafe_1654597864307/work matplotlib==3.6.2 matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1698947099619/work mypy-extensions==0.4.3 nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1683202081046/work nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1662750566673/work nbconvert==7.12.0 nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1690814868471/work nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1664684991461/work notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1695225629675/work notebook_shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1682360583588/work numpy==1.23.4 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1637239678211/work pandas==1.5.1 pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work pathspec==0.10.2 pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work Pillow==9.3.0 pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1694617248815/work platformdirs==2.5.4 prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1700579315247/work prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1667565496306/work psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1660666458521/work pyopenproject==0.7.4 pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1652235407899/work python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work pytz==2022.6 PyYAML==5.3.1 pyzmq==19.0.2 referencing @ file:///home/conda/feedstock_root/build_artifacts/referencing_1701974441252/work requests==2.25.1 rpds-py @ file:///croot/rpds-py_1698945930462/work Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1682601222253/work six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1662051266223/work soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1693929250441/work stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1667334518791/work terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1699810101464/work tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1666100256010/work tomli==2.0.1 tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827245914/work traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1666115969632/work typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1702176139754/work urllib3==1.26.14 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work webencodings @ file:///home/conda/feedstock_root/build_artifacts/webencodings_1694681268211/work websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1701630677416/work zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work