Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmseqs download will no longer get the most up-to-date nr/nt databases #893

Open
charlesfoster opened this issue Oct 3, 2024 · 1 comment

Comments

@charlesfoster
Copy link

Expected Behavior

mmseqs download would be expected to download an up-to-date version of the target 'nr' and 'nt' databases.

Current Behavior

The download FASTA targets for the 'nr' and 'nt' databases are no longer being updated by NCBI. Explanation: focusing on 'NR' as an example, the URL in databases.sh points to https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. The README in that FTP location states:

In April 2024, the BLAST FASTA files in this directory will no longer be
available. You can easily generate FASTA files yourself from the formatted
BLAST databases by using the BLAST utility blastdbcmd that comes with the
standalone BLAST programs. See NCBI Insights for more details
https://ncbiinsights.ncbi.nlm.nih.gov/2024/01/25/blast-fasta-unavailable-on-ftp/

Indeed, the nr.gz file was last updated on 2024-02-07.

Looking in the parent directoy, the various NR database files have been updated as recently as 2024-10-02. Therefore, the download targets for mmseqs2 are out of date by about 8 months, and this problem will get worse over time.

NCBI's solution for users is to download the blast-format files and then generate their own FASTA files:

  • Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility;

Obviously this isn't ideal for many users, and it's been getting at least some hate.

Solution

Unless NCBI backflips on their decision, the only solution would be to change the mmseqs databases workflow for these databases to have an intermediate (slow) step of extracting a FASTA file before the mmseqs createdb step is run. However, this would obviously require extra dependencies, i.e. the blastdbcmd. Otherwise, I suppose you could host periodic builds of the databases on a server or something.

Just thought I should bring this to your attention in case you are unaware 😄

@milot-mirdita
Copy link
Member

I would recommend to just use UniProt instead of NR. it’s much better maintained, especially with the versioning. Any annotations against the NR are essentially unreproducible due to the lack of versioning by the NCBI.

I don’t plan on integrating the blast databases for the foreseeable future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants