Skip to main content

MPUSP/snakemake-ms-proteomics

Pipeline for automatic processing and quality control of mass spectrometry data

Overview

Testing: GitHub Actions Workflow Status GitHub Actions Workflow Status

Last update: 2026-01-09

Latest release: v1.0.0

Topics: bioinformatics conda mass-spectrometry pipeline proteomics snakemake snakemake-workflow workflow

Authors: @m-jahn

Configuration

The following configuration details are extracted from the config's README file.


snakemake-ms-proteomics

This workflow is a best-practice workflow for the automated analysis of mass spectrometry proteomics data. It currently supports automated analysis of data-dependent acquisition (DDA) data with label-free quantification. An extension by different wokflows (DIA, isotope labeling) is planned in the future. The workflow is mainly a wrapper for the excellent tools fragpipe and MSstats, with additional modules that supply and check the required input files, and generate reports. The workflow is built using snakemake and processes MS data using the following steps:

  1. Prepare workflow file (python script)
  2. check user-supplied sample sheet (python script)
  3. Fetch protein database from NCBI or use user-supplied fasta file (python, NCBI Datasets)
  4. Generate decoy proteins (DecoyPyrat)
  5. Import raw files, search protein database (fragpipe)
  6. Align feature maps using IonQuant (fragpipe)
  7. Import quantified features, infer and quantify proteins (R MSstats)
  8. Compare different biological conditions, export results (R MSstats)
  9. Generate HTML report with embedded QC plots (R markdown)
  10. Generate PDF report from HTML weasyprint
  11. Send out report by email (python script)
  12. Clean up temporary files after workflow execution (bash script)

If you want to contribute, report issues, or suggest features, please get in touch on github.

Installation

Snakemake

Step 1: Install snakemake with conda, mamba, micromamba (or any another conda flavor). This step generates a new conda environment called snakemake-ms-proteomics, which will be used for all further installations.

conda create -c conda-forge -c bioconda -n snakemake-ms-proteomics snakemake

Step 2: Activate conda environment with snakemake

source /path/to/conda/bin/activate
conda activate snakemake-ms-proteomics

Alternatively, install snakemake using pip:

pip install snakemake

Or install snakemake globally from linux archives:

sudo apt install snakemake

Fragpipe

Fragpipe is not available on conda or other package archives. However, to make the workflow as user-friendly as possible, the latest fragpipe release from github (currently v22.0) is automatically installed to the respective conda environment when using the workflow the first time. After installation, the GUI (graphical user interface) will pop up and ask to you to finish the installation by downloading the missing modules MSFragger, IonQuant, and Philosopher. This step is necessary to abide to license restrictions. From then on, fragpipe will run in headless mode through command line only.

All other dependencies for the workflow are automatically pulled as conda environments by snakemake.

Running the workflow

Input data

The workflow requires the following input files:

  1. mass spectrometry data, such as Thermo *.raw or *.mzML files
  2. an (organism) database in *.fasta format OR a NCBI Refseq ID. Decoys (rev_ prefix) will be added if necessary
  3. a sample sheet in tab-separated format (aka manifest file)
  4. a workflow file for fragpipe (see resources dir)

The samplesheet file has the following structure with four mandatory columns and no header (example file: test/input/samplesheet/samplesheet.tsv).

  • sample: names/paths to raw files
  • condition: experimental group, treatments
  • replicate: replicate number, consecutively numbered. Repeating numbers (e.g. 1,2,1,2) will be treated as paired samples!
  • type: the type of MS data, will be used to determine the workflow
  • control: reference condition for testing differential abudandance
sampleconditionreplicatetypecontrol
sample_1condition_11DDAcondition_1
sample_2condition_12DDAcondition_1
sample_3condition_23DDAcondition_1
sample_4condition_24DDAcondition_1

Execution

To run the workflow from command line, change the working directory.

cd /path/to/snakemake-ms-proteomics

Adjust options in the default config file config/config.yml. Before running the entire workflow, you can perform a dry run using:

snakemake --dry-run

To run the complete workflow with test files using conda, execute the following command. The definition of the number of compute cores is mandatory.

snakemake --cores 10 --sdm conda --directory .test

To supply options that override the defaults, run the workflow like this:

snakemake --cores 10 --sdm conda --directory .test \
--configfile 'config/config.yml' \
--config \
samplesheet='my/sample_sheet.tsv'

Parameters

This table lists all global parameters to the workflow.

parametertypedetailsexample
samplesheet*.tsvtab-separated filetest/input/config/samplesheet.tsv
database*.fasta OR refseq IDplain texttest/input/database/database.fasta, GCF_000009045.1
workflow*.workflow OR stringa fragpipe workflowworkflows/LFQ-MBR.workflow, from_samplesheet

This table lists all module-specific parameters and their default values, as included in the config.yml file.

moduleparameterdefaultdetails
decoypyratcleavage_sitesKRamino acids residues used for decoy peptide generation
decoy_prefixrevdecoy prefix appended to proteins names
fragpipetarget_dirsharedefault path in conda env to store fragpipe
executablefragpipe/bin/fragpipepath to fragpipe executable
downloadFragPipe-22.0 (see config)downlowd link to Fragpipe Github repo
msstatslogTrans2base for log fold change transformation
normalizationequalizeMediansnormalization strategy for feature intensity, see MSstats manual
featureSubsetallwhich features to use for quantification
summaryMethodTMPhow to calculate protein from feature intensity
MBimputeTrueImputes missing values with Accelerated failure time model
reporthtmlTrueGenerate HTLM report
pdfTrueGenerate PDF report
emailsendFalsewhether reports should send out by email
port0default port for email server
smtp_serversmtp.example.comsmtp server address
smtp_userusersmtp server user name
smtp_pwpasswordsmtp server user password
fromsender@email.comsender's email address
to["receiver@email.com"]receiver's email address(es), a list
subject"Results MS proteomics workflow"subject line for email