Package size issue: consider a different format for lemmatizer dictionaries #3258

remram44 · 2019-02-11T21:29:51Z

Feature description

The spacy package is almost 300 MB, which is a lot. This comes mostly from the spacy/lang/*/lemmatizer files which include huge lookup tables in uncompressed Python source files.

Storing this as package_data, in gzip'd form, could save about 90%.

Could the feature be a custom component or spaCy plugin?

No. This would mean a change to this repository's structure, and the need for the maintainers to run a little script to create the gzip'd files from the source data before release. But going from 300 MB to 30 MB I think is worth it.

The text was updated successfully, but these errors were encountered:

ines · 2019-02-11T22:07:46Z

Thanks for bringing this up and yes, I definitely agree. We've actually been thinking about solutions for this as well 👍

mitar · 2019-02-11T22:55:17Z

I think it could also be useful if this could be made optional. So that there would be a package without this included and you would have to initialize it so that you point to those files you provide from outside.

ines · 2019-02-12T12:13:47Z

So that there would be a package without this included and you would have to initialize it so that you point to those files you provide from outside.

I wish there was something like extras_require but for not installing certain submodules. So we could do something like pip install spacy[light]. But unfortunately, this only works for adding stuff (unless I'm missing something?). We briefly thought about moving the language data out into a separate package but this would make the installation more frustrating, especially for beginners, because things wouldn't just work out-of-the-box anymore.

mitar · 2019-02-12T17:13:55Z

Then you can have two Python package. spacy-lite and spacy. And spacy depends on spacy-lite and adds data.

remram44 · 2019-02-12T17:16:10Z

I think whether or not that it done, we should minify the data. The current size is unnecessary.

I do not know how crucial the data is to normal spacy operation. Perhaps a "lite" package can be made, but that seems to me like a separate endeavor.

ines · 2019-07-17T22:20:35Z

Quick update on this – copying over my reply from #3983:

I definitely agree that the package size is a problem and it's actually something we're actively working on at the moment.

There are some constraints that make it difficult to just allow a "spaCy light" installation (also see #3258), so we're starting by compressing the existing resources at build time to just make everything smaller. Next, we'll be moving the lookup tables and everything else that takes up space out of the library entirely – either by updating the lemmatization so lookups aren't needed (see #2668 and @guadi1994's great talk at spaCy IRL) or by allowing efficient lookups using external resources and model packages (see #3971). @polm will be working with us on the compression and efficient dictionary part of this btw 🙌

polm · 2019-08-15T14:57:59Z

Howdy folks, just wanted to mention I'm working on this over here. I've moved most of the large language data files to gzipped json files and the uncompressed spaCy install is ~80MB now.

It still needs more cleanup and testing but it should be ready for a first PR soon.

lock · 2019-09-19T13:42:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements install Installation issues lang / all Global language data labels Feb 11, 2019

ines mentioned this issue Feb 18, 2019

Debugger not working with Spanish language load #2673

Closed

ines mentioned this issue Mar 10, 2019

Question/Feature Request: Reducing spaCy package size #2851

Closed

ines mentioned this issue Apr 3, 2019

Feature Request: Seamless installation of language models from PyPI #3536

Closed

ines mentioned this issue Apr 17, 2019

Moving lemmatization lookup outside the library/lazy loading from json file #3603

Closed

This was referenced Jul 16, 2019

💫 Proposal: API for efficient serializable dictionaries and lookup tables #3971

Closed

Installing spacy with fewer languages to save disk space #3983

Closed

This was referenced Aug 18, 2019

Reduce size of language data #4140

Closed

Reduce size of language data #4141

Merged

ines closed this as completed Aug 20, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package size issue: consider a different format for lemmatizer dictionaries #3258

Package size issue: consider a different format for lemmatizer dictionaries #3258

remram44 commented Feb 11, 2019 •

edited

Loading

ines commented Feb 11, 2019

mitar commented Feb 11, 2019

ines commented Feb 12, 2019

mitar commented Feb 12, 2019

remram44 commented Feb 12, 2019

ines commented Jul 17, 2019

polm commented Aug 15, 2019

lock bot commented Sep 19, 2019

Package size issue: consider a different format for lemmatizer dictionaries #3258

Package size issue: consider a different format for lemmatizer dictionaries #3258

Comments

remram44 commented Feb 11, 2019 • edited Loading

Feature description

Could the feature be a custom component or spaCy plugin?

ines commented Feb 11, 2019

mitar commented Feb 11, 2019

ines commented Feb 12, 2019

mitar commented Feb 12, 2019

remram44 commented Feb 12, 2019

ines commented Jul 17, 2019

polm commented Aug 15, 2019

lock bot commented Sep 19, 2019

remram44 commented Feb 11, 2019 •

edited

Loading