Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance issue: .to_dict method slow in pandas below 2.2 #891

Closed
TendouArisu opened this issue Feb 29, 2024 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@TendouArisu
Copy link

Problem Description

Hello.
I have discovered a performance degradation in the .to_dict function of pandas version 1.5.3. And I noticed that some parts of the repository depend on the pandas version 1.5.3. I found that many files such as skrub/_table_vectorizer.py used the influenced api. There may be more files using the influenced api. I am not sure whether this performance problem in pandas will affect this repository. Here are some discussions on pandas GitHub related to this issue, including #50990 and #54824.

Feature Description

I would recommend considering an upgrade to a different version of pandas >= 2.2 or exploring other solutions to optimize the performance.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Alternative Solutions

No response

Additional Context

No response

@TendouArisu TendouArisu added the enhancement New feature or request label Feb 29, 2024
@jeromedockes
Copy link
Member

hello, thanks for investigating and reporting this!
a lot of this code is likely to change due to (i) adding support for polars dataframes, #888 and (ii) refactoring the table vectorizer, #877.

in any case 1.5.3 is a rather old version, so users can always update their version of pandas and benefit from the fast to_dict in recent versions.
1.5.3 is the oldest supported version, but skrub works with more recent versions of pandas too (including the latest release). by default pip and conda will install the latest version

@jeromedockes
Copy link
Member

I think this can be close because if a user is experiencing bad performance due to this they can just upgrade their pandas version (If I understand correctly). But if I misunderstood or am missing something feel free to reopen, @TendouArisu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants