Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Interpolationjoiner dataframe api #827

Conversation

jeromedockes
Copy link
Member

this changes the InterpolationJoiner to rely on the dataframe api (and some utilities added to skrub._dataframe) so that it works with polars

the tests are now parametrized with a fixture px that becomes pandas and polars

@jeromedockes
Copy link
Member Author

In the utilities we added to skrub._dataframe, I believe make_dataframe, make_series, join, could use the dataframe api instead

@jeromedockes
Copy link
Member Author

and it seems our oldest supported pandas version does not support the dataframe api?

@jeromedockes
Copy link
Member Author

I bumped the pandas version just to see if the CI runs but having the dataframe api requires pandas 2.1.0 release notes which dates from august 2023

@jeromedockes
Copy link
Member Author

@MarcoGorelli in case you have the time I'm sure you would have advice for better use of the dataframe API in this one!

@MarcoGorelli
Copy link

ooh, seeing you try this out has made my day! got some things I need to finish now but I'll take a careful look and see what we need to change upstream (I'm sure something will come up 😄 )

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 17, 2023 via email

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jeromedockes, some first comments.
Overall, I'm quite concerned by the complexity this PR introduces. Sure, using the dataframe-api-compat will be challenging for readability, but we need to be extra careful not to introduce even more challenging code to read and understand.
If the dataframe-api-compat is not mature enough yet, it might be wiser to wait rather than to have to change a lot of logic three months later. LMKWYT.



@pytest.fixture
def df(px):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 for having a fixture and a function name that also looks like the standard dataframe variable df. I'm confused when I read this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but note fixture names are the names of parameters in the test functions, so in test functions df will be a dataframe (and we never explicitly call a fixture). what name would you prefer, a longer name like example_dataframe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's not too much of a constraint, then yes, I would prefer dataframe or example_dataframe for instance.

import enum


class Selector(enum.Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of avoiding enums and having strings instead for simplicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings are already used for column names, so using a different type helps distinguish the 2. we could also have more complex selectors that support set operations as in polars but I thought it would be better to start with something simpler. we could also have 2 functions select and select_dtypes which both accept strings, although we might want support for selection by dtype in SelectCols transformers and we probably don't want to have a SelectColsByDtype separate transformer. how do you suggest we handle selection by dtypes?

def select(dataframe, columns):
return dataframe.select(columns)
return dataframe.select(_check_selector(columns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why is there a _check_selector function in _polars but not _pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helper wouldn't really be appropriate in pandas because in pandas depending on the selector we need to select with either [] or select_dtypes.
in polars we can just select() the output of _check_selector



def any_rowwise(dataframe):
return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not readable.

CATEGORICAL = enum.auto()


def std(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std is already the name of the standard deviation in all scientific modules. We must find another name for this function.

assignments = []
regression_table = aux_table.select_dtypes("number")
regression_table = ns.select(aux_table.dataframe, ns.Selector.NUMERIC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said earlier, a string (e.g., "numeric") would be nicer, IMHO.

Copy link

@MarcoGorelli MarcoGorelli Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataframe-api way to do this is

pdx = aux_table.__dataframe_namespace__()
aux_table.select(
    *[
        col.name
        for col in aux_table.columns_iter()
        if pdx.is_dtype(col.dtype, "numeric")
    ]
)

estimator = clone(estimator)
kept_rows = target_table.notnull().all(axis=1).to_numpy()
ns = skrubns(target_table.dataframe)
kept_rows = ~(std(ns.any_rowwise(target_table.is_null().dataframe)).to_array())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read, honestly.
Attempt to make this simpler:

ns, _ = get_df_namespace(target_table)
target_table = to_standard_df(target_table)
kept_rows = ns.any_rowwise(
    target_table.is_null().dataframe
)
kept_rows = ~to_standard_df(kept_rows).to_array()

WDYT?

Hopefully any_rowwise will be implemented in the future of the dataframe-api-compat and this snippet will be simplified to:

kept_rows = ~target_table.is_null().any_rowwise().to_array()

Also, @MarcoGorelli, having all_rowwise and a notnull methods in dataframe-api-compat would help!

Copy link

@MarcoGorelli MarcoGorelli Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can already use namespace.any_rowwise in dataframe-api-compat

for example, here's a function which keeps all rows where any element is greater than 0:

In [1]: import pandas as pd

In [2]: df_pd = pd.DataFrame({"a": [-1, 1, 3], "b": [-2, -1, 8]})

In [3]: 
   ...: def my_dataframe_agnostic_function(df):
   ...:     df = df.__dataframe_consortium_standard__()
   ...:     pdx = df.__dataframe_namespace__()
   ...:     df = df.filter(pdx.any_horizontal(*[col>0 for col in df.columns_iter()]))
   ...:     return df.dataframe
   ...: 

In [4]: my_dataframe_agnostic_function(df_pd)
Out[4]: 
   a  b
0  1 -1
1  3  8

I need to get this changed in the standard as DataFrame.any_rowwise doesn't really work if columns are backed by expressions (there's a reason it's a toplevel function in Polars)

Might be nice to have a way to do *[df.col(col_name)>0 for col_name in df.column_names]) more conveniently though. In Polars there's pl.all() - need to think of something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think here you just need DataFrame.drop_nulls? that's in both the Standard and in dataframe-api-compat now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

sure, I can add names for the intermediate steps. probably not the same name kept_rows for the kept rows and its complementary so maybe kept_rows = ~discarded_rows or something like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, here's a function which keeps all rows where any element is greater than 0:

when I run this I get AttributeError: 'Namespace' object has no attribute 'any_rowwise'; do I need to install the development version of one of the packages? I did notice the error message about using namespace.any_rowwise but couldn't find it in the API specification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I would like to build a boolean mask that I also need to apply to y which is a numpy array, so any_rowwise seemed like a good option

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only got this merged into the standard yesterday data-apis/dataframe-api#324

it's called any_horizontal - if you install the latest dataframe-api-compat you should have it

key_values = key_values[kept_rows]
Y = target_table.to_numpy()[kept_rows]
target_table = target_table.persist()
Y = target_table.to_array(None)[kept_rows]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability

Suggested change
Y = target_table.to_array(None)[kept_rows]
Y = target_table.to_array(dtype=None)[kept_rows]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to change this in the standard, I think the output dtype should always be inferrable from the dtypes of the columns

if Y_values.ndim == 1:
Y_values = Y_values[:, None]
cols = [
api_ns.column_from_1d_array(y.astype(type(y[0])), name=c, dtype=dt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read.

api_ns.column_from_sequence(
itertools.repeat(None, res["shape"][0]), name=c, dtype=dt
)
for c, dt in res["schema"].items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability

Suggested change
for c, dt in res["schema"].items()
for name, dtype in res["schema"].items()

@MarcoGorelli
Copy link

MarcoGorelli commented Nov 17, 2023

If the dataframe-api-compat is not mature enough yet, it might be wiser to wait rather than to have to change a lot of logic three months later

FWIW I'm aiming to tag the first non-beta version by February data-apis/dataframe-api#319. Til then, I'm extremely happy if people experiment with it, but I would caution against putting a lot of work into using it

@jeromedockes
Copy link
Member Author

after all we won't be using the dataframe API for this so it will be easier to just start a new branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants