Submit raw read data to ENA (European Nucleotide Archive)¶

Description¶

This small notebook shows examples of how to submit raw read data to the European Nucleotide Archive (ENA). The ENA is a database of nucleotide sequences and associated metadata, which is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with GenBank and the DNA Data Bank of Japan (DDBJ). ENA has a lot of different options to submit data, and also different data types. In this example we will focus only on fastq.gz read data, and two types of submission:

Webin-CLI: This is a command-line tool provided by ENA for submitting data. It allows you to submit more complex samples such as multi-fastq RNA-Seq
Webin Online Submission: The web-based interface for submitting (meta)-data. This is more suitable for simple samples such as single- and paired-end RNA-Seq or Amplicon-Seq.

ENA Documentation: https://ena-docs.readthedocs.io/

Requirements¶

no requirements except for a recent python version and java
all of this should be available on any recent linux distribution

Meta data required for upload¶

Study¶

the study is the top-level entry for a project, and it can contain multiple samples and experiments
it needs to be registered before uploading any data, and it needs to be referenced in the sample and experiment metadata
to register a study, go to https://www.ebi.ac.uk/ena/submit/webin/ and log in with your ENA account, then click on 'Register study'
it is mandatory to have a study/project title and description (abstract)
upon registration, we obtain a primary accession (e.g. PRJEB012345), and a secondary accession (e.g. ERP123456)

Samples¶

separate samples should be registered for each real-life sample (Webin: 'Register samples' button)
register one sample for each biological replicate
technical replicates which used the same real-world sample should be referenced using the same sample name
template TSV table can be downloaded from ENA Webin

Runs / experiments¶

how to annotate runs: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html
Runs / experiments annotate sequencing data files
One experiment can have multiple runs (e.g. technical replicates or different lanes)
template TSV table can be downloaded from ENA Webin

Submission process¶

The submission process follows several steps, the correct order is important!

step 1: register study
step 2: prepare and upload samples TSV table
step 3: submit .fastq.gz files using:
- (1) Webin Online Submission
- (2) Webin CLI tool

Option 1: Separate FTP `fastq.gz` and metadata upload¶

this upload is ideal for standard experiments like Amplicon-seq or RNA-Seq with single or paired-end reads
first, we need to create MD5 checksum files for each fastq.gz file

mkdir md5
for f in *.fastq.gz; do
  md5sum "$f" > "md5/${f}.md5"
done

in a terminal, connect to ENA Webin FTP service

lftp -d -u Webin-<user-id> ftp://webin2.ebi.ac.uk

when encountering problems, try to activate passive mode

set ftp:passive-mode on

upload files

list_files=$(ls *.fastq.gz)
mput $list_files
list_md5=$(ls md5/*.md5)
mput $list_md5

to remove erroneously uploaded files using file pattern (fastq.gz) use

mrm *.fastq.gz

disconnect with bye
upload the prepared run / experiment TSV table on Webin ENA dashboard
check if all runs/samples are connected correctly, done!

Option 2: Combined fastq and metadata upload using Webin CLI tool¶

this option is ideal for more complex experiments with multiple fastq-files per sample, e.g. paired-end reads with UMIs
the Webin CLI tool will upload fastq.gz files and metadata simultaneously, so no need for manual FTP or run table upload
instead, a corresponding JSON file needs to be prepared for each sample/experiment, which contains the metadata and paths to the fastq.gz files
on the server that stores the data, cd into the relevant submission dir
download the Webin CLI tool from Github

cd /<ena-submission-dir>
wget https://github.com/enasequence/webin-cli/releases/download/9.0.3/webin-cli-9.0.3.jar

take a look at the JSON template from ENA: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#json-manifest-file-format
it should look similar to this:

{
  "study": "<must-match-study-ID>",
  "sample": "<must-match-sample-submission-ID>",
  "name": "<human-readable-sample>",
  "platform": "ILLUMINA",
  "instrument": "Illumina NovaSeq 6000",
  "insert_size": "200",
  "libraryName": "<human-readable-sample>",
  "library-source": "TRANSCRIPTOMIC",
  "library_selection": "cDNA",
  "libraryStrategy": "PAIRED",
  "fastq": [
    {
      "value": "my_sample_R1.fastq.gz",
      "attributes": {
        "read_type": "paired"
      }
    },
    {
      "value": "my_sample_R2.fastq.gz",
      "attributes": {
        "read_type": "paired"
      }
    },
    {
      "value": "my_sample_R3.fastq.gz",
      "attributes": {
        "read_type": "umi_barcode"
      }
    }
  ]
}

create one JSON file per sample from the template using the python script prepare_json.py
it is filled with metadata information from a standard TSV run/experiment table that can be prepared beforehand
raw prefix is an optional path to the directory that contains the fastq.gz files (if not contained in TSV file)

python prepare_json.py \
  --tsv experiment.tsv \
  --raw_prefix <path-to-raw-data> \
  --out_prefix json

finally, validate the JSON file using the Webin CLI tool's -validate option
submit using the -submit option, which will upload fastq.gz and metadata to ENA
note: XML files are created in the process that represent the actual data submitted to ENA

FILES=$(ls ./json/*.json)
for f in $FILES; do
  java -jar webin-cli-9.0.3.jar \
    -context reads \
    -username Webin-<user-id> \
    -password <password> \
    -manifest "$f" \
    -validate # or 'submit'
done

check if all runs/samples are connected correctly, done!