Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataframe dispatch #888

Merged
merged 35 commits into from
Mar 1, 2024

Conversation

jeromedockes
Copy link
Member

The goal of this PR is to make it easy to support both polars and pandas in skrub.

To make a function compatible with both polars and pandas, we use the dispatch
decorator. The function then has an attribute specialize which we can use to
register implementations for polars or for pandas (or for other backends we may
add in the future).

Compared to the current approach of having a _pandas and a _polars module and a function _get_df_namespace which returns the module
corresponding to a dataframe, this has several advantages:

  • the callers don't need to know whether a function is explicitly dispatched or
    not in order to know where to import it from and how to call it. This allows
    to change how a function is implemented (eg introduce the dataframe API)
    without changing all call sites.
  • specializations may be defined in any skrub module, so all skrub modules are
    not forced to put their helpers that need to be dispatched in the _pandas
    and _polars modules. Functions can be grouped by functionality (as usual)
    rather than by backend.
  • a function and its specializations can be found next to each other.
>>> import pandas as pd

>>> from skrub._dispatch import dispatch
>>> @dispatch
... def drop_nulls(column):
...     raise NotImplementedError()

We can now register specializations for pandas and polars

>>> @drop_nulls.specialize("pandas")
... def _drop_nulls_pandas(column):
...     return column.dropna()


>>> @drop_nulls.specialize("polars")
... def _drop_nulls_polars(column):
...     return column.drop_nulls()


>>> df = pd.DataFrame(dict(A=[0, 1, None, 3]))
>>> df
     A
0  0.0
1  1.0
2  NaN
3  3.0
>>> drop_nulls(df)
     A
0  0.0
1  1.0
3  3.0
>>> import polars as pl
>>> polars_df = pl.from_pandas(df)
>>> polars_df
shape: (4, 1)
┌──────┐
│ A    │
│ ---  │
│ f64  │
╞══════╡
│ 0.0  │
│ 1.0  │
│ null │
│ 3.0  │
└──────┘
>>> drop_nulls(polars_df)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ f64 │
╞═════╡
│ 0.0 │
│ 1.0 │
│ 3.0 │
└─────┘

It is also possible to define a specialization specifically for dataframes or
for series.

>>> @dispatch
... def f(obj):
...     pass

>>> @f.specialize("pandas", "DataFrame")
... def _(df):
...     print("DataFrame")

>>> @f.specialize("pandas", "Column")
... def _(df):
...     print("Column")

>>> f(pd.DataFrame())
DataFrame
>>> f(pd.Series([0]))
Column

@jeromedockes
Copy link
Member Author

see more details in the dispatch module docstring

@MarcoGorelli
Copy link

hey - so, in #786 (comment) I was asked for my thoughts on this

first reaction to the dispatch mechanism - well done! this looks broadly useful, beyond skrub, perhaps it could be its own separate package? a lot of libraries are now trying to support both pandas and polars and so a community of people interested in being able to reuse code may well arise?

@jeromedockes
Copy link
Member Author

jeromedockes commented Feb 16, 2024 via email

@MarcoGorelli
Copy link

MarcoGorelli commented Feb 16, 2024

I'm tempted to repurpose what I'd worked on as a polars-api-compat, so you can write pandas/polars/cudf/modin-agnostic code with a subset of the Polars API. Maybe this would be useful to you? If we don't make any claims about trying to be make a Standard or anything like that, but just compatibility with the Polars API, then I don't think we would be stepping on the Consortium's feet. They've explicitly and repeatedly rejected the Polars-like expressions API anyway, this is clearly something they don't want, nothing to stop us from doing what would be useful to us.

The API I had in mind was:

def my_agnostic_function(df):
    dfx, plx = polars_api_compat.convert(df, api_version="0.20")

    # use some stable subset of Polars API, with `plx` as namespace
    result = (
        dfx.filter(plx.col("l_shipdate") <= var_1)
        .group_by("l_returnflag", "l_linestatus")
        .agg(
            plx.sum("l_quantity").alias("sum_qty"),
            plx.sum("l_extendedprice").alias("sum_base_price"),
            (plx.col("l_extendedprice") * (1 - plx.col("l_discount")))
            .sum()
            .alias("sum_disc_price"),
            (
                plx.col("l_extendedprice")
                * (1.0 - plx.col("l_discount"))
                * (1.0 + plx.col("l_tax"))
            )
            .sum()
            .alias("sum_charge"),
            plx.mean("l_quantity").alias("avg_qty"),
            plx.mean("l_extendedprice").alias("avg_price"),
            plx.mean("l_discount").alias("avg_disc"),
            plx.len().alias("count_order"),
        )
        .sort("l_returnflag", "l_linestatus")
    ).collect()

    # return result in original dataframe class
    return result.dataframe


my_agnostic_function(pandas.read_parquet("lineitem.parquet"))  # works, returns pandas dataframe
my_agnostic_function(polars.scan_parquet("lineitem.parquet"))  # works, returns polars dataframe

polars_api_compat would be a lightweight pure-Python package with no dependencies, it would just wrap pandas / cudf / modin (and in theory any other package) with a subset of the Polars API

I guess we could start using the dispatch internally in skrub and see how it plays out in practice. And then, if we find it convenient to work with, consider moving it out?

agree - start with getting something working, add tests, and then consider generalising 👍

@jeromedockes
Copy link
Member Author

I think that would be very useful, yes. Indeed a compatibility layer for a couple or a few packages might be achievable faster than a more general standard

@machow
Copy link

machow commented Feb 16, 2024

Hey! @MarcoGorelli pointed me here. I've been working on solutions this problem, and documented in a tool called databackend. It supports python singledispatch, without requiring the underlying library being dispatched on. (I've vendored it into a tool called Great Tables, and used it to create a polars/pandas function layer in great_tables._tbl_data.py).

I've been working on how to get good type hints, and think there's a pretty good solution (see this comment in this plum issue).

Happy to work more on this, but wanted to drop what I've got on this type of problem!

@jeromedockes
Copy link
Member Author

Thanks for sharing this, @machow! I guess if we look at it from a distance we landed on somewhat similar solutions, with a decorator for defining generic functions and registering single dispatch implementations without needing to import the actual types, + a private module of generic helper functions defined with this dispatch mechanism. The ABC approach you use in databackend has the advantage of allowing explicit isinstance checks as an alternative to calling generic functions, and allows registering implementations by providing type hints rather than an argument to the decorator.

@jeromedockes
Copy link
Member Author

In the short term for skrub, I suggest we move forward with the dispatch decorator and the few helper functions introduced in this PR and see how far that takes us. As this is an internal implementation detail, it should be easy to swap (or complement) this approach with a more complete external library of backend-agnostic dataframe functions in the future, whether it relies on generic functions / dispatch or on the dataframe_api_compat / polars_api_compat way of getting to the concrete function definitions

@jeromedockes
Copy link
Member Author

One thing that will need to be adjusted is that as it stands this PR defines a few functions that accept and return "Columns", ie pandas or polars Series.
This will have to be adapted for lazy frames and we will have to decide how to pass around information about a single column in a lazy frame, eg a pair of (LazyFrame, column name), (LazyFrame, expression), a LazyFrame with just 1 column, … or if we shouldn't rely on the concept of a column at all.
Still I think I'd rather save that for another PR and improve polars support in skrub iteratively, one small improvement at a time.

@glemaitre glemaitre self-requested a review February 19, 2024 14:17
skrub/_dispatch.py Outdated Show resolved Hide resolved
skrub/_dispatch.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link

Totally agree! If you start with this and get it working, then it's going to be a lot easier for me to take a step back and ask "what's skrub doing, and what can we abstract into an easily vendorable reusable solution?"

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts regarding the dispatcher.

skrub/_dispatch.py Outdated Show resolved Hide resolved
skrub/_dispatch.py Show resolved Hide resolved
skrub/_dispatch.py Outdated Show resolved Hide resolved

def _load_dataframe_module_info(name):
# if the module is not installed, import errors are propagated
if name == "pandas":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is an overkill for now, but I would register the supported version into a dict-like class. This might be one step closer to register another backend a just write the specialize version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe going towards over-engineering): We could something similar to scikit-learn with the set_output (https:/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py) by creating a manager: https:/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py#L183-L197

How it looks like, we return a dictionary for the moment, so it means that we don't really need a protocol, but we could use a dataclass instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to allow dynamically registering backends? unlike for scikit-learn's set_output, here adding a backend involves defining many functions and adding them to the appropriate skrub modules so I don't think that is likely to happen dynamically. So I guess my instinct would be to keep things simple until we need something more, but I may have misunderstood what is the goal of what you suggest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to allow dynamically registering backends?

Yes it was what I meant. But as I said, probably some over-engineering for the moment. Maybe using a dataclass instead of dict could be better to express what we those should return.

skrub/_dispatch.py Show resolved Hide resolved
return {
"module": pandas,
"types": {
"DataFrame": [pandas.DataFrame],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to have tuples instead of something mutable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! note the dict that contains it is mutable too, should I replace that with a MappingProxyType to make it less easily mutable, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only function that access this dict is the dispatch function itself so it's relatively easy to make sure it doesn't modify the dict

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict that contains it is mutable too

It is where I was thinking of a dataclass indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only function that access this dict is the dispatch function itself so it's relatively easy to make sure it doesn't modify the dict

Yep I'm not to worry about being modified. But having an immutable show to the reader that this is not supposed to be modified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! when something doesn't need to be hashable I tend to use tuples for more structured records (ie something I'd tend to unpack) and lists for sequences of arbitrary length (ie something I'd tend to iterate over) but I get your point about showing it's not supposed to be modified, I'll make the change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is where I was thinking of a dataclass indeed.

the "types" dict doesn't need to have the same keys for all backends though (eg there is no LazyFrame in pandas), so do you mean something like

@dataclass
class ModuleInfo:
    name: str
    types: dict[str, tuple[type]]

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was what I had in mind.

if generic_type_names is None:
generic_type_names = list(module_info["types"].keys())
if isinstance(generic_type_names, str):
generic_type_names = [generic_type_names]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tuple as well

@@ -0,0 +1,40 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to start the name of the file by _?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the file going to be discovered automatically by pytest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, that's a typo :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 19, 2024 via email

#


def test_skrub_namespace(df_module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be worth to provide the location in a module docstring of this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understood correctly: you mean state in test_common.__doc__ that df_module is a pytest fixture defined in conftest.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean state in test_common.doc that df_module is a pytest fixture defined in conftest.py?

Just on the top of the file, adding a multiline docstring with the info that you mentioned. "module docstring" was a bit vague.

@jeromedockes
Copy link
Member Author

Would it not be more judicious to consider them as categorical data

so does this mean is_categorical should return True for bool columns?

and in the tablevectorizer we could always special-case bool columns and just cast them to float or something like that

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 23, 2024 via email

@jeromedockes
Copy link
Member Author

as this is an internal addition (and is not used yet to add polars support in any of the skrub estimators) I don't think it needs a whatsnew entry

apart from that @glemaitre you can have another look

@jeromedockes
Copy link
Member Author

should to_numeric output float32 by default?

@glemaitre
Copy link
Member

should to_numeric output float32 by default?

You would tend to say no but there is maybe a consideration to have here. Are pandas and polars behaving the same?

@jeromedockes
Copy link
Member Author

You would tend to say no but there is maybe a consideration to have here. Are pandas and polars behaving the same?

it depends on the input; when converting strings they output int64 or float64

@glemaitre
Copy link
Member

I just found this issue: #870

If we want to go this way then, we can make np.float32 the primary data type.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@glemaitre
Copy link
Member

@GaelVaroquaux do you want to have a final look or we are good to merge as-is?

@glemaitre glemaitre merged commit e9e3e11 into skrub-data:main Mar 1, 2024
28 checks passed
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 3, 2024 via email

@MarcoGorelli
Copy link

Hey all - just wanted to raise that I'll likely be getting 3 months' funding to work on Narwhals with 2 interns 🥳

If this would be useful to you and there's anything you'd like to see in it, please do give me a shout - anything that'd be useful to open source projects is considered in-scope (so long as it's already in Polars and easy-enough to do in pandas). The current API reference is here: https://marcogorelli.github.io/narwhals/api-reference/. I think it covers most of what you have here?

I hope I'm not coming across as trying to "force" Narhwals onto you, I'm just raising this issue in case it may help you reduce your cross-dataframe maintenance and focus on skrub's main mission


By the way, London is not very far from Paris, if/when we organize a sprint, we'll ping you :)

I hope to be able to make it to PyData Paris, hopefully we can meet there!

@jeromedockes
Copy link
Member Author

Thanks for letting us know @MarcoGorelli ! that's great news. Indeed it seems to cover most of the subset of polars API we might need and we should definitely consider relying on it if/when the small set of private skrub functions we've added is not enough, maintaining it becomes a burden, using module-level functions becomes annoying and we want methods of generic dataframe objects as offered by narwhals, or we want to use something that exactly matches the polars API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants