On downloading genome assemblies from NCBI GenBank and RefSeq

󰃭 2023-08-26

Recently, I wanted to download all genome assemblies of fungi from NCBI GenBank and RefSeq for making a Kraken database. I used my download-refseq-genomes program for downloading assemblies from RefSeq, which worked just fine:

download-refseq-genomes.pl -t fna -r -g -a 4751

(The NCBI taxon ID for fungi is 4751)

Similarly for GenBank assemblies, I wanted to use download-genbank-genomes for downloading assemblies marked as representative or reference assemblies, but that would again include assemblies that I already downloaded from RefSeq before. So the problem was to filter out assemblies that are not contained in RefSeq.

Luckily, the assembly_summary_genbank.txt file already has a column with the corresponding RefSeq accession number if it exists, so I extended download-genbank-genomes.pl with an option (-g) to skip assemblies that have a RefSeq accession:

download-genbank-genomes.pl -t fna -r -g -a 4751

There is also a solution to this problem using the datasets and dataformat tools from NCBI:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
dataformat tsv genome --fields accession,assminfo-paired-assm-accession

which will return this two-column TSV file:

Assembly Accession  Assembly Paired Assembly Accession
[...]
GCA_030573135.1
GCA_030561325.1
GCA_030574775.1
GCA_001049995.1
GCA_001189475.1     GCF_001189475.1
GCA_002759435.2
GCA_002775015.1     GCF_002775015.1
GCA_003013715.2     GCF_003013715.1
GCA_003014415.1
GCA_004287075.1
GCA_005234155.1
GCA_007168705.1
GCA_008275145.1
GCA_008729165.1
[...]

Now we can filter the lines for keeping only those without a RefSeq-ID in the second column and use xargs to download the desired assemblies via datasets again:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank |  \
dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
grep -v GCF_ | \
xargs datasets download genome accession

or even better: use the RefSeq accession if it exists, otherwise use the GenBank accession for downloading:

datasets summary genome taxon 4751 --as-json-lines --assembly-source genbank | \
dataformat tsv genome --elide-header --fields accession,assminfo-paired-assm-accession | \
perl -lane 'print $F[1] ? $F[1] : $F[0]' | \
xargs datasets download genome accession