Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format #34

Closed
apcamargo opened this issue Jun 25, 2023 · 3 comments
Closed

KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format #34

apcamargo opened this issue Jun 25, 2023 · 3 comments

Comments

@apcamargo
Copy link

apcamargo commented Jun 25, 2023

The MetaPhlAn output generated by KMCP is not the same as the one generated by MetaPhlAn. In the KMCP output, the taxid column only contains the taxid of the lowest taxonomic rank (e.g. 1224), while the one generated by MetaPhlAn contains the full lineage, separated by | (e.g. 2|1224).

This makes the KMCP output incompatible with TAXPASTA. Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

@apcamargo apcamargo changed the title MetaPhlAn output doesn't follow the MetaPhlAn file format KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format Jun 25, 2023
shenwei356 added a commit that referenced this issue Jun 26, 2023
@shenwei356
Copy link
Owner

Thanks for reporting this. I did not notice this for such a long time ...

Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

Skipping ranks should not cause that. Can you attach a file?

Another way is using taxonkit cami-filter (without setting -t, --taxids ) to recompute the abundance in CAMI format, which is one of the input formats of taxpasta.

@apcamargo
Copy link
Author

Sure! TAXPASTA filed in the step it checks the composition, specifically, at the part it checks if all taxa within a given rank sum up to 100%. I summed up the abundances manually and I saw that some ranks had summed abundances lower than that.

ERR7569999.metaphlan.txt
ERR7569998.metaphlan.txt
ERR7569997.metaphlan.txt

@shenwei356
Copy link
Owner

shenwei356 commented Jun 29, 2023

I see. Some ref genomes' lineages do not have all the 7 ranks, which is quiet normal I think. Maybe ask taxpasta to support this?

for r in k p c o f g s; do \
     echo -n "$r ";
     cat ERR7569997.metaphlan.txt  \
        | csvtk grep -H -r -p "${r}__[^\|]+$" \
        | csvtk summary -Ht -f 3:sum; \
done

k 100.00
p 85.88
c 68.43
o 55.63
f 41.11
g 19.43
s 100.00

k__Bacteria|p__Bacillota|s__Firmicutes bacterium UBA1422	1947935	0.038769	
k__Bacteria|p__Pseudomonadota|c__Betaproteobacteria|o__Burkholderiales|s__Burkholderiales bacterium	1891238	0.038604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants