Skip to main content

MPUSP/snakemake-assembly-postprocessing

A Snakemake workflow for the post-processing of microbial genome assemblies.

Overview

Testing: GitHub Actions Workflow Status GitHub Actions Workflow Status

Last update: 2025-12-10

Latest release: v1.1.0

Topics: apptainer bacteria conda genome-assembly genome-sequencing microbes pipeline postprocessing quality-control snakemake-workflow genomics

Authors: @rabioinf @m-jahn

Configuration

The following configuration details are extracted from the config's README file.


Workflow overview

A Snakemake workflow for the post-processing of microbial genome assemblies.

  1. Parse samples.csv table containing the samples's meta data (python)
  2. Annotate assemblies using one of the following tools:
    1. NCBI's Prokaryotic Genome Annotation Pipeline (PGAP). Note: needs to be installed manually
    2. prokka, a fast and light-weight prokaryotic annotation tool
    3. bakta, a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
  3. Create a QC report for the assemblies using Quast
  4. Create a pangenome analysis (orthologs/homologs) using Panaroo

Running the workflow

Input data

This workflow requires fasta input data. The samplesheet table has the following layout:

samplespeciesstrainid_prefixfile
EC2224"Streptococcus pyogenes"SF370SPYassembly.fasta
...............

Note: Pangenome analysis with Panaroo requires at least two samples.

Parameters

This table lists all parameters that can be used to run the workflow.

ParameterTypeDetailsDefault
samplesheetstringPath to the sample sheet file in csv format
toolarray[string]Annotation tool to use (one of prokka, pgap, bakta)
pgapPGAP configuration object
binstringPath to the PGAP script
use_yaml_configbooleanWhether to use YAML configuration for PGAPFalse
prepare_yaml_filesPaths to YAML templates for PGAP
genericstringPath to the generic YAML configuration file
submolstringPath to the submol YAML configuration file
prokkaProkka configuration object
centerstringCenter name for Prokka annotation (used in sequence IDs)
extrastringExtra command-line arguments for Prokka--addgenes
baktaBakta configuration object
download_dbstringBakta database type (full, light, or none)light
existing_dbstringPath to an existing Bakta database (optional). Needs to be combined with download_db='none'--keep-contig-headers --compliant
extrastringExtra command-line arguments for Bakta
quastQUAST configuration object
reference_fastastringPath to the reference genome for QUAST
reference_gffstringPath to the reference annotation for QUAST
extrastringExtra command-line arguments for QUAST
panarooPanaroo configuration object
remove_sourcestringSource types to remove in Panaroo (regex supported)cmsearch
remove_featurestringFeature types to remove in Panaroo (regex supported)tRNA|rRNA|ncRNA|exon|sequence_feature
extrastringExtra command-line arguments for Panaroo--clean-mode strict --remove-invalid-genes