-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package size issue: consider a different format for lemmatizer dictionaries #3258
Comments
Thanks for bringing this up and yes, I definitely agree. We've actually been thinking about solutions for this as well 👍 |
I think it could also be useful if this could be made optional. So that there would be a package without this included and you would have to initialize it so that you point to those files you provide from outside. |
I wish there was something like |
Then you can have two Python package. |
I think whether or not that it done, we should minify the data. The current size is unnecessary. I do not know how crucial the data is to normal spacy operation. Perhaps a "lite" package can be made, but that seems to me like a separate endeavor. |
Quick update on this – copying over my reply from #3983:
|
Howdy folks, just wanted to mention I'm working on this over here. I've moved most of the large language data files to gzipped json files and the uncompressed spaCy install is ~80MB now. It still needs more cleanup and testing but it should be ready for a first PR soon. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Feature description
The spacy package is almost 300 MB, which is a lot. This comes mostly from the
spacy/lang/*/lemmatizer
files which include huge lookup tables in uncompressed Python source files.Storing this as package_data, in gzip'd form, could save about 90%.
Could the feature be a custom component or spaCy plugin?
No. This would mean a change to this repository's structure, and the need for the maintainers to run a little script to create the gzip'd files from the source data before release. But going from 300 MB to 30 MB I think is worth it.
The text was updated successfully, but these errors were encountered: