Submit raw read data to ENA (European Nucleotide Archive)¶
Description¶
This small notebook shows examples of how to submit raw read data to the European Nucleotide Archive (ENA). The ENA is a database of nucleotide sequences and associated metadata, which is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with GenBank and the DNA Data Bank of Japan (DDBJ). ENA has a lot of different options to submit data, and also different data types. In this example we will focus only on fastq.gz read data, and two types of submission:
- Webin-CLI: This is a command-line tool provided by ENA for submitting data. It allows you to submit more complex samples such as multi-fastq RNA-Seq
- Webin Online Submission: The web-based interface for submitting (meta)-data. This is more suitable for simple samples such as single- and paired-end RNA-Seq or Amplicon-Seq.
ENA Documentation: https://ena-docs.readthedocs.io/
Requirements¶
- no requirements except for a recent
pythonversion andjava - all of this should be available on any recent linux distribution
Meta data required for upload¶
Study¶
- the study is the top-level entry for a project, and it can contain multiple samples and experiments
- it needs to be registered before uploading any data, and it needs to be referenced in the sample and experiment metadata
- to register a study, go to https://www.ebi.ac.uk/ena/submit/webin/ and log in with your ENA account, then click on 'Register study'
- it is mandatory to have a study/project title and description (abstract)
- upon registration, we obtain a primary accession (e.g. PRJEB012345), and a secondary accession (e.g. ERP123456)
Samples¶
- separate samples should be registered for each real-life sample (Webin: 'Register samples' button)
- register one sample for each biological replicate
- technical replicates which used the same real-world sample should be referenced using the same sample name
- template TSV table can be downloaded from ENA Webin
Runs / experiments¶
- how to annotate runs: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html
- Runs / experiments annotate sequencing data files
- One experiment can have multiple runs (e.g. technical replicates or different lanes)
- template TSV table can be downloaded from ENA Webin
Submission process¶
The submission process follows several steps, the correct order is important!
- step 1: register study
- step 2: prepare and upload samples TSV table
- step 3: submit
.fastq.gzfiles using:- (1) Webin Online Submission
- (2) Webin CLI tool
Option 1: Separate FTP fastq.gz and metadata upload¶
- this upload is ideal for standard experiments like Amplicon-seq or RNA-Seq with single or paired-end reads
- first, we need to create MD5 checksum files for each
fastq.gzfile
mkdir md5
for f in *.fastq.gz; do
md5sum "$f" > "md5/${f}.md5"
done
- in a terminal, connect to ENA Webin FTP service
lftp -d -u Webin-<user-id> ftp://webin2.ebi.ac.uk
- when encountering problems, try to activate passive mode
set ftp:passive-mode on
- upload files
list_files=$(ls *.fastq.gz)
mput $list_files
list_md5=$(ls md5/*.md5)
mput $list_md5
- to remove erroneously uploaded files using file pattern (
fastq.gz) use
mrm *.fastq.gz
- disconnect with
bye - upload the prepared run / experiment TSV table on Webin ENA dashboard
- check if all runs/samples are connected correctly, done!
Option 2: Combined fastq and metadata upload using Webin CLI tool¶
- this option is ideal for more complex experiments with multiple fastq-files per sample, e.g. paired-end reads with UMIs
- the Webin CLI tool will upload
fastq.gzfiles and metadata simultaneously, so no need for manual FTP or run table upload - instead, a corresponding JSON file needs to be prepared for each sample/experiment, which contains the metadata and paths to the
fastq.gzfiles - on the server that stores the data, cd into the relevant submission dir
- download the Webin CLI tool from Github
cd /<ena-submission-dir>
wget https://github.com/enasequence/webin-cli/releases/download/9.0.3/webin-cli-9.0.3.jar
- take a look at the JSON template from ENA: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#json-manifest-file-format
- it should look similar to this:
{
"study": "<must-match-study-ID>",
"sample": "<must-match-sample-submission-ID>",
"name": "<human-readable-sample>",
"platform": "ILLUMINA",
"instrument": "Illumina NovaSeq 6000",
"insert_size": "200",
"libraryName": "<human-readable-sample>",
"library-source": "TRANSCRIPTOMIC",
"library_selection": "cDNA",
"libraryStrategy": "PAIRED",
"fastq": [
{
"value": "my_sample_R1.fastq.gz",
"attributes": {
"read_type": "paired"
}
},
{
"value": "my_sample_R2.fastq.gz",
"attributes": {
"read_type": "paired"
}
},
{
"value": "my_sample_R3.fastq.gz",
"attributes": {
"read_type": "umi_barcode"
}
}
]
}
- create one JSON file per sample from the template using the python script
prepare_json.py - it is filled with metadata information from a standard TSV run/experiment table that can be prepared beforehand
raw prefixis an optional path to the directory that contains thefastq.gzfiles (if not contained in TSV file)
python prepare_json.py \
--tsv experiment.tsv \
--raw_prefix <path-to-raw-data> \
--out_prefix json
- finally, validate the JSON file using the Webin CLI tool's
-validateoption - submit using the
-submitoption, which will uploadfastq.gzand metadata to ENA - note: XML files are created in the process that represent the actual data submitted to ENA
FILES=$(ls ./json/*.json)
for f in $FILES; do
java -jar webin-cli-9.0.3.jar \
-context reads \
-username Webin-<user-id> \
-password <password> \
-manifest "$f" \
-validate # or 'submit'
done
- check if all runs/samples are connected correctly, done!