Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import pynndescent lazily to speed up import #1090

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sliedes
Copy link

@sliedes sliedes commented Feb 1, 2024

Importing umap tended to take long; pynndescent is the culprit. This moves the substantial cost of importing pynndescent from import time to the first use. Among other things, this is likely to make it much nicer to run script.py --help and other operations that actually do not use UMAP on programs that nevertheless import UMAP.

Importing umap tended to take long; pynndescent is the culprit. This
moves the substantial cost of importing pynndescent from import time
to the first use. Among other things, this is likely to make it much
nicer to run `script.py --help` and other operations that actually do
not use UMAP on programs that import UMAP.
@sliedes
Copy link
Author

sliedes commented Feb 1, 2024

Related old issue: #631

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 2, 2024

I understand that import time can be annoying. On the other hand I'm not convinced that punting it to first use necessarily improves the experience. Long term I would like to have caching work better in numba and then pynndescent will take less time to load (as we can safely cache more) but that's a different problem.

I'm not entirely averse to this proposal, but would it be possible to pitch me on why you think this approach will be better?

@sliedes
Copy link
Author

sliedes commented Feb 5, 2024

Thank you for giving me the opportunity to pitch!

It's likely not uncommon for libraries using UMAP either conditionally or unconditionally to import it in a module; generally, it requires work (similar to the patch I proposed here) to do things more lazily. For example, I come from using BERTopic, which by default uses UMAP when fitting a model, but I'm not sure it needs it when just loading and using a model.

I view it as more of an interactivity and iteration time issue. When I execute a script that indirectly imports UMAP, I only get to see any trivial error messages before getting to the point where UMAP would be used after several seconds due to the import time. Just running python script.py --help takes several seconds.

With this change, I'd assume most such scripts would be able to perform the error-prone steps of command line argument parsing, data loading etc. without having to spend the time it takes to import pynndescent.

In the case where UMAP is actually not needed (like I think with BERTopic when loading a model), the execution time of the import is, of course, saved entirely.

Now, it would obviously be possible for all such modules to do the same thing to lazily import UMAP, but then each of them would need to do it separately. Generally, I agree with you that the ideal case would be fixing things in numba so that pynndescent can be loaded faster; failing that, pushing the laziness as deep in the chain as possible results in the most generally useful fix in the sense that every user of it will automatically get the benefits. That is, the second nicest thing would be if pynndescent could only do its compilation lazily, but I understand that due to numba limitations this is not currently feasible without, I think, a more involved wrapping there. Having said that, if you wish, I can try to develop a similar workaround for pynndescent.

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 8, 2024

You do make a decent case. On the other hand it will result in a significantly slower first run of UMAP because it will be deferring the import to that point, and as that import is hidden the delay won't be obvious to users. Let me think about it for a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants