Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package size issue: consider a different format for lemmatizer dictionaries #3258

Closed
remram44 opened this issue Feb 11, 2019 · 8 comments
Closed
Labels
enhancement Feature requests and improvements install Installation issues lang / all Global language data

Comments

@remram44
Copy link

remram44 commented Feb 11, 2019

Feature description

The spacy package is almost 300 MB, which is a lot. This comes mostly from the spacy/lang/*/lemmatizer files which include huge lookup tables in uncompressed Python source files.

Storing this as package_data, in gzip'd form, could save about 90%.

Could the feature be a custom component or spaCy plugin?

No. This would mean a change to this repository's structure, and the need for the maintainers to run a little script to create the gzip'd files from the source data before release. But going from 300 MB to 30 MB I think is worth it.

@ines ines added enhancement Feature requests and improvements install Installation issues lang / all Global language data labels Feb 11, 2019
@ines
Copy link
Member

ines commented Feb 11, 2019

Thanks for bringing this up and yes, I definitely agree. We've actually been thinking about solutions for this as well 👍

@mitar
Copy link

mitar commented Feb 11, 2019

I think it could also be useful if this could be made optional. So that there would be a package without this included and you would have to initialize it so that you point to those files you provide from outside.

@ines
Copy link
Member

ines commented Feb 12, 2019

So that there would be a package without this included and you would have to initialize it so that you point to those files you provide from outside.

I wish there was something like extras_require but for not installing certain submodules. So we could do something like pip install spacy[light]. But unfortunately, this only works for adding stuff (unless I'm missing something?). We briefly thought about moving the language data out into a separate package but this would make the installation more frustrating, especially for beginners, because things wouldn't just work out-of-the-box anymore.

@mitar
Copy link

mitar commented Feb 12, 2019

Then you can have two Python package. spacy-lite and spacy. And spacy depends on spacy-lite and adds data.

@remram44
Copy link
Author

I think whether or not that it done, we should minify the data. The current size is unnecessary.

I do not know how crucial the data is to normal spacy operation. Perhaps a "lite" package can be made, but that seems to me like a separate endeavor.

@ines
Copy link
Member

ines commented Jul 17, 2019

Quick update on this – copying over my reply from #3983:

I definitely agree that the package size is a problem and it's actually something we're actively working on at the moment.

There are some constraints that make it difficult to just allow a "spaCy light" installation (also see #3258), so we're starting by compressing the existing resources at build time to just make everything smaller. Next, we'll be moving the lookup tables and everything else that takes up space out of the library entirely – either by updating the lemmatization so lookups aren't needed (see #2668 and @guadi1994's great talk at spaCy IRL) or by allowing efficient lookups using external resources and model packages (see #3971). @polm will be working with us on the compression and efficient dictionary part of this btw 🙌

@polm
Copy link
Contributor

polm commented Aug 15, 2019

Howdy folks, just wanted to mention I'm working on this over here. I've moved most of the large language data files to gzipped json files and the uncompressed spaCy install is ~80MB now.

It still needs more cleanup and testing but it should be ready for a first PR soon.

@lock
Copy link

lock bot commented Sep 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements install Installation issues lang / all Global language data
Projects
None yet
Development

No branches or pull requests

4 participants