Submit raw read data to ENA (European Nucleotide Archive)¶

Description¶

This small notebook shows examples of how to submit raw read data to the European Nucleotide Archive (ENA). The ENA is a database of nucleotide sequences and associated metadata, which is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with GenBank and the DNA Data Bank of Japan (DDBJ). ENA has a lot of different options to submit data, and also different data types. In this example we will focus only on fastq.gz read data, and two types of submission:

  1. Webin-CLI: This is a command-line tool provided by ENA for submitting data. It allows you to submit more complex samples such as multi-fastq RNA-Seq
  2. Webin Online Submission: The web-based interface for submitting (meta)-data. This is more suitable for simple samples such as single- and paired-end RNA-Seq or Amplicon-Seq.

ENA Documentation: https://ena-docs.readthedocs.io/

Requirements¶

  • no requirements except for a recent python version and java
  • all of this should be available on any recent linux distribution

Meta data required for upload¶

Study¶

  • the study is the top-level entry for a project, and it can contain multiple samples and experiments
  • it needs to be registered before uploading any data, and it needs to be referenced in the sample and experiment metadata
  • to register a study, go to https://www.ebi.ac.uk/ena/submit/webin/ and log in with your ENA account, then click on 'Register study'
  • it is mandatory to have a study/project title and description (abstract)
  • upon registration, we obtain a primary accession (e.g. PRJEB012345), and a secondary accession (e.g. ERP123456)

Samples¶

  • separate samples should be registered for each real-life sample (Webin: 'Register samples' button)
  • register one sample for each biological replicate
  • technical replicates which used the same real-world sample should be referenced using the same sample name
  • template TSV table can be downloaded from ENA Webin

Runs / experiments¶

  • how to annotate runs: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html
  • Runs / experiments annotate sequencing data files
  • One experiment can have multiple runs (e.g. technical replicates or different lanes)
  • template TSV table can be downloaded from ENA Webin

Submission process¶

The submission process follows several steps, the correct order is important!

  • step 1: register study
  • step 2: prepare and upload samples TSV table
  • step 3: submit .fastq.gz files using:
    • (1) Webin Online Submission
    • (2) Webin CLI tool

Option 1: Separate FTP fastq.gz and metadata upload¶

  • this upload is ideal for standard experiments like Amplicon-seq or RNA-Seq with single or paired-end reads
  • first, we need to create MD5 checksum files for each fastq.gz file
mkdir md5
for f in *.fastq.gz; do
  md5sum "$f" > "md5/${f}.md5"
done
  • in a terminal, connect to ENA Webin FTP service
lftp -d -u Webin-<user-id> ftp://webin2.ebi.ac.uk
  • when encountering problems, try to activate passive mode
set ftp:passive-mode on
  • upload files
list_files=$(ls *.fastq.gz)
mput $list_files
list_md5=$(ls md5/*.md5)
mput $list_md5
  • to remove erroneously uploaded files using file pattern (fastq.gz) use
mrm *.fastq.gz
  • disconnect with bye
  • upload the prepared run / experiment TSV table on Webin ENA dashboard
  • check if all runs/samples are connected correctly, done!

Option 2: Combined fastq and metadata upload using Webin CLI tool¶

  • this option is ideal for more complex experiments with multiple fastq-files per sample, e.g. paired-end reads with UMIs
  • the Webin CLI tool will upload fastq.gz files and metadata simultaneously, so no need for manual FTP or run table upload
  • instead, a corresponding JSON file needs to be prepared for each sample/experiment, which contains the metadata and paths to the fastq.gz files
  • on the server that stores the data, cd into the relevant submission dir
  • download the Webin CLI tool from Github
cd /<ena-submission-dir>
wget https://github.com/enasequence/webin-cli/releases/download/9.0.3/webin-cli-9.0.3.jar
  • take a look at the JSON template from ENA: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#json-manifest-file-format
  • it should look similar to this:
{
  "study": "<must-match-study-ID>",
  "sample": "<must-match-sample-submission-ID>",
  "name": "<human-readable-sample>",
  "platform": "ILLUMINA",
  "instrument": "Illumina NovaSeq 6000",
  "insert_size": "200",
  "libraryName": "<human-readable-sample>",
  "library-source": "TRANSCRIPTOMIC",
  "library_selection": "cDNA",
  "libraryStrategy": "PAIRED",
  "fastq": [
    {
      "value": "my_sample_R1.fastq.gz",
      "attributes": {
        "read_type": "paired"
      }
    },
    {
      "value": "my_sample_R2.fastq.gz",
      "attributes": {
        "read_type": "paired"
      }
    },
    {
      "value": "my_sample_R3.fastq.gz",
      "attributes": {
        "read_type": "umi_barcode"
      }
    }
  ]
}
  • create one JSON file per sample from the template using the python script prepare_json.py
  • it is filled with metadata information from a standard TSV run/experiment table that can be prepared beforehand
  • raw prefix is an optional path to the directory that contains the fastq.gz files (if not contained in TSV file)
python prepare_json.py \
  --tsv experiment.tsv \
  --raw_prefix <path-to-raw-data> \
  --out_prefix json
  • finally, validate the JSON file using the Webin CLI tool's -validate option
  • submit using the -submit option, which will upload fastq.gz and metadata to ENA
  • note: XML files are created in the process that represent the actual data submitted to ENA
FILES=$(ls ./json/*.json)
for f in $FILES; do
  java -jar webin-cli-9.0.3.jar \
    -context reads \
    -username Webin-<user-id> \
    -password <password> \
    -manifest "$f" \
    -validate # or 'submit'
done
  • check if all runs/samples are connected correctly, done!