MPUSP/snakemake-bacterial-rnaseq-processing
A Snakemake workflow for the processing of short read rnaseq data in bacteria.
Overview
Testing:
Last update: 2026-05-12
Latest release: v2.0.0
Topics: bioinformatics bioinformatics-pipeline computational-biology conda rnaseq-pipeline snakemake workflow rnaseq biosciences snakemake-workflow
Configuration
The following configuration details are extracted from the config's README file.
Workflow overview
This workflow can be used in combination with subsequent workflows for follow-up analyses. For example, differential expression analysis can be performed using snakemake-bacterial-rnaseq-deseq.
This workflow is a best-practice workflow for the processing of short read sequencing data in bacteria. The workflow is built using snakemake and consists of the following steps:
- Obtain genome database in
fastaandgffformat (python, NCBI Datasets)- Using automatic download from NCBI with a
RefSeqID - Using user-supplied files
- Using automatic download from NCBI with a
- Check quality of input sequencing data (FastQC)
- Cut adapters and filter by length and/or sequencing quality score (fastp)
- Identify unique molecular identifier (UMI, UMI-tools)
- Map reads to the reference genome (STAR aligner)
- Sort and index aligned RNA-Seq data (Samtools)
- Deduplicate reads by unique molecular identifier (UMI, UMI-tools)
- Generate cpm normalized coverage files (deepTools)
- Quantify biotype features (featureCounts)
- Generate summary report for all processing steps (MultiQC)
Running the workflow
Input
Reference genome
An NCBI Refseq ID, e.g. GCF_000006785.2. Find your genome assembly and corresponding ID on NCBI genomes. Alternatively use a custom pair of *.fasta file and *.gff file that describe the genome of choice.
Important requirements when using custom *.fasta and *.gff files:
*.gffgenome annotation must have the same chromosome/region name as the*.fastafile (example:NC_002737.2)*.gffgenome annotation must havegeneandCDStype annotation that is automatically parsed to extract transcripts- all chromosomes/regions in the
*.gffgenome annotation must be present in the*.fastasequence - but not all sequences in the
*.fastafile need to have annotated genes in the*.gfffile
Read data
RNA sequencing data in *.fastq.gz format. The currently supported input data are second generation reads. Input data files are supplied via a mandatory table, whose location is indicated in the config.yml file (default: samples.tsv). The sample sheet has the following layout:
| sample | condition | replicate | read1 | read2 | readumi |
|---|---|---|---|---|---|
| RNA-1 | RNA | 1 | RNA-1_R1.fastq.gz | RNA-1_R2.fastq.gz | - |
| RNA-2 | RNA | 2 | RNA-2_R2.fastq.gz | RNA-2_R2.fastq.gz | - |
Some configuration parameters of the pipeline may be specific for your data and library preparation protocol. The options should be adjusted in the config.yml file.
Configuration files for different sequencing protocols can be found in resources/protocols/.
Currently, you may find protocols for i.e. rnaseq_nextflex, rnaseq_neb_umi and a custom protocol rnaseq_mpusp_custom.
To run the workflow with the respective test data for the different protocols, use the following commands:
snakemake --sdm conda --cores 12 --directory .test --configfile resources/protocols/rnaseq_mpusp_custom.yml
snakemake --sdm conda --cores 12 --directory .test --configfile resources/protocols/rnaseq_neb_umi.yml
snakemake --sdm conda --cores 12 --directory .test --configfile resources/protocols/rnaseq_nextflex.yml
Output
| Output File/Folder | Description |
|---|---|
results/genome/ | Downloaded or user-supplied reference genome and annotation files. |
results/fastp/ | Adapter-trimmed and quality-filtered FASTQ files. |
results/mapped/ | Aligned reads in BAM format, coverage in BigWig format |
results/deduplicated/ | Aligned and UMI-deduplicated reads in BAM format, coverage in BigWig format. |
results/qc/ | Quality control reports for raw and processed reads (FastQC HTML files). |
results/quantify_biotypes/ | Gene/feature count tables (tab-delimited text files). |
results/multiqc/ | MultiQC report aggregating QC metrics from all steps. |