Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Interpolationjoiner dataframe api #827

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ jobs:
- shell: bash {0}
run: |
cp $GITHUB_WORKSPACE/pyproject.toml .
cp $GITHUB_WORKSPACE/conftest.py .
$GITHUB_WORKSPACE/build_tools/github/test.sh
working-directory: ${{ runner.temp }}
name: 'Run tests'
Expand Down
2 changes: 1 addition & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Major changes

* :class:`InterpolationJoiner` was added to join two tables by using
machine-learning to infer the matching rows from the second table.
:pr:`742` by :user:`Jérôme Dockès <jeromedockes>`.
:pr:`742` and :pr:`827` by :user:`Jérôme Dockès <jeromedockes>`.

* Pipelines including :class:`TableVectorizer` can now be grid-searched, since
we can now call `set_params` on the default transformers of :class:`TableVectorizer`.
Expand Down
15 changes: 15 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import pandas
import pytest

DATAFRAME_MODULES = [pandas]
try:
import polars

DATAFRAME_MODULES.append(polars)
except ImportError:
pass


@pytest.fixture(params=DATAFRAME_MODULES)
def px(request):
return request.param
9 changes: 5 additions & 4 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ project_urls =
include_package_data = True
packages = find:
install_requires =
scikit-learn>=1.2.1
scikit-learn>=1.3.0
numpy>=1.23.5
scipy>=1.9.3
pandas>=1.5.3
pandas>=2.1.0
packaging>=23.1
dataframe-api-compat>=0.1.28
python_requires = >=3.10

[options.extras_require]
Expand Down Expand Up @@ -76,10 +77,10 @@ benchmarks =
# Overwrite the previous install_requires for CI testing purposes
# as defined in testing.yml.
min-py310 =
scikit-learn==1.2.1
scikit-learn==1.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this bump for scikit-learn? Is it related to Pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for the dataframe api we need a recent pandas, and due to breaking changes in pandas for the recent pandas version we need a recent scikit-learn version. I think this means we may have to wait a bit before using the dataframe api

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for the dataframe api we need a recent pandas

You can use an earlier version, but you'll have to write your own helper function to opt in to it, like:

def try_convert_to_standard_compliant_dataframe(df):
    if hasattr(df, '__dataframe_consortium_standard__'):
        return df.__dataframe_consortium_standard__(api_version='2023.11-beta')
    try:
        import pandas as pd
    except ModuleNotFoundError:
        pass
    else:
        if isinstance(df, pd.DataFrame):
            # __dataframe_consortium_standard__ was added in pandas 2.1.0, so we need this for earlier versions
            from dataframe_api_compat.pandas_standard import convert_to_standard_compliant_dataframe
            return convert_to_standard_compliant_dataframe(df)
    try:
        import polars as pl
    except ModuleNotFoundError:
        pass
    else:
        if isinstance(df, (pl.DataFrame, pl.LazyFrame)):
            # __dataframe_consortium_standard__ was added in Polars 0.18.13, so we need this for earlier versions
            from dataframe_api_compat.polars_standard import convert_to_standard_compliant_dataframe
            return convert_to_standard_compliant_dataframe(df)
    return df

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks! that will definitely help us start using the dataframe api sooner. I think it's not a problem to require a recent version of the dataframe-api-compat package, whereas we should support a few older releases of pandas and scikit-learn

numpy==1.23.5
scipy==1.9.3
pandas==1.5.3
pandas==2.1.0

[flake8]
# max line length for black
Expand Down
4 changes: 4 additions & 0 deletions skrub/_dataframe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from ._common import Selector, std, stdns
from ._namespace import get_df_namespace, skrubns

__all__ = ["get_df_namespace", "skrubns", "std", "stdns", "Selector"]
22 changes: 22 additions & 0 deletions skrub/_dataframe/_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import enum


class Selector(enum.Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of avoiding enums and having strings instead for simplicity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings are already used for column names, so using a different type helps distinguish the 2. we could also have more complex selectors that support set operations as in polars but I thought it would be better to start with something simpler. we could also have 2 functions select and select_dtypes which both accept strings, although we might want support for selection by dtype in SelectCols transformers and we probably don't want to have a SelectColsByDtype separate transformer. how do you suggest we handle selection by dtypes?

ALL = enum.auto()
NONE = enum.auto()
NUMERIC = enum.auto()
CATEGORICAL = enum.auto()


def std(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std is already the name of the standard deviation in all scientific modules. We must find another name for this function.

try:
return obj.__dataframe_consortium_standard__()
except AttributeError:
return obj.__column_consortium_standard__()


def stdns(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for this name

try:
return obj.__dataframe_consortium_standard__().__dataframe_namespace__()
except AttributeError:
return obj.__column_consortium_standard__().__column_namespace__()
5 changes: 5 additions & 0 deletions skrub/_dataframe/_namespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,8 @@ def get_df_namespace(*dfs):
"Only Pandas or Polars dataframes are currently supported, "
f"got {modules=!r}."
)


def skrubns(*dataframes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function worth its cognitive load? I fear it will make our code less readable for small gains.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough I can remove it. the gain is we almost never need the second item returned by get_df_namespace so I found myself repeatedly typing df, _ = get_df_namespace which is ok in an assignment but not possible in an expression which would require writing get_df_namespace[0] but then the 0 seems a bit magic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it makes sense. We could instead split the get_df_namespace in two functions, returning either the namespace or the skrub namespace, so that you don't have to type df, _ = ....
WDYT?

ns, _ = get_df_namespace(*dataframes)
return ns
48 changes: 47 additions & 1 deletion skrub/_dataframe/_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,22 @@

from skrub._utils import atleast_1d_or_none

from ._common import Selector

__all__ = [
"make_dataframe",
"make_series",
"aggregate",
"join",
"split_num_categ_cols",
"select",
"drop",
"Selector",
"concat_horizontal",
"any_rowwise",
"to_pandas",
]


def make_dataframe(X, index=None):
"""Convert an dictionary of columns into a Pandas dataframe.
Expand Down Expand Up @@ -329,4 +345,34 @@


def select(dataframe, columns):
return dataframe[columns]
if not isinstance(columns, Selector):
return dataframe[columns]
if columns is Selector.ALL:
return dataframe
elif columns is Selector.NONE:
return dataframe[[]]
elif columns is Selector.NUMERIC:
return dataframe.select_dtypes("number")
elif columns is Selector.CATEGORICAL:
return dataframe.select_dtypes(["object", "string", "category"])
# we have covered all items in the enumeration
assert False


def drop(dataframe, columns):
return dataframe.drop(select(dataframe, columns).columns.values, axis=1)


def any_rowwise(dataframe):
return dataframe.any(axis=1)


def concat_horizontal(dataframe, *other_dataframes):
other_dataframes = [
df.set_axis(dataframe.index, axis="index") for df in other_dataframes
]
return pd.concat([dataframe] + list(other_dataframes), axis=1)

Check warning on line 375 in skrub/_dataframe/_pandas.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_pandas.py#L375

Added line #L375 was not covered by tests

def to_pandas(dataframe):
return dataframe
61 changes: 60 additions & 1 deletion skrub/_dataframe/_polars.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,23 @@

from skrub._utils import atleast_1d_or_none

from ._common import Selector

__all__ = [
"POLARS_SETUP",
"make_dataframe",
"make_series",
"aggregate",
"join",
"split_num_categ_cols",
"select",
"drop",
"Selector",
"concat_horizontal",
"any_rowwise",
"to_pandas",
]


def make_dataframe(X, index=None):
"""Convert an dictionary of columns into a Polars dataframe.
Expand Down Expand Up @@ -263,5 +280,47 @@
return num_cols, categ_cols


def _check_selector(columns):
if not isinstance(columns, Selector):
return columns
if columns is Selector.ALL:
return cs.all()
elif columns is Selector.NONE:
return []
elif columns is Selector.NUMERIC:
return cs.numeric()
elif columns is Selector.CATEGORICAL:
return cs.string(include_categorical=True)
# we have covered all items in the enumeration
assert False


def select(dataframe, columns):
return dataframe.select(columns)
return dataframe.select(_check_selector(columns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Why is there a _check_selector function in _polars but not _pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helper wouldn't really be appropriate in pandas because in pandas depending on the selector we need to select with either [] or select_dtypes.
in polars we can just select() the output of _check_selector



def drop(dataframe, columns):
return dataframe.drop(_check_selector(columns))


def any_rowwise(dataframe):

Check warning on line 306 in skrub/_dataframe/_polars.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_polars.py#L306

Added line #L306 was not covered by tests
return _collect(dataframe.select(pl.any_horizontal(pl.all()))).get_column("any")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not readable.



def concat_horizontal(dataframe, *other_dataframes):
return pl.concat(
[_collect(dataframe)] + [_collect(df) for df in other_dataframes],
how="horizontal",
)


def _collect(dataframe):
if hasattr(dataframe, "collect"):
dataframe = dataframe.collect()
return dataframe


def to_pandas(dataframe):
if hasattr(dataframe, "collect"):
dataframe = dataframe.collect()
return dataframe.to_pandas()
43 changes: 43 additions & 0 deletions skrub/_dataframe/tests/test_all.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import pytest

from skrub._dataframe import Selector, skrubns


@pytest.fixture
def df(px):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 for having a fixture and a function name that also looks like the standard dataframe variable df. I'm confused when I read this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but note fixture names are the names of parameters in the test functions, so in test functions df will be a dataframe (and we never explicitly call a fixture). what name would you prefer, a longer name like example_dataframe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's not too much of a constraint, then yes, I would prefer dataframe or example_dataframe for instance.

return px.DataFrame(
{"ID": [2, 3, 7], "name": ["ab", "cd", "01"], "temp": [20.3, 40.9, 11.5]}
)


def test_select(df):
ns = skrubns(df)
assert list(ns.select(df, []).columns) == []
assert list(ns.select(df, ["name"]).columns) == ["name"]
assert list(ns.select(df, Selector.ALL).columns) == list(df.columns)
assert list(ns.select(df, Selector.NONE).columns) == []
assert list(ns.select(df, Selector.NUMERIC).columns) == ["ID", "temp"]
assert list(ns.select(df, Selector.CATEGORICAL).columns) == ["name"]


def test_drop(df):
ns = skrubns(df)
assert list(ns.drop(df, []).columns) == list(df.columns)
assert list(ns.drop(df, ["name"]).columns) == ["ID", "temp"]
assert list(ns.drop(df, Selector.ALL).columns) == []
assert list(ns.drop(df, Selector.NONE).columns) == list(df.columns)
assert list(ns.drop(df, Selector.NUMERIC).columns) == ["name"]
assert list(ns.drop(df, Selector.CATEGORICAL).columns) == ["ID", "temp"]


def test_concat_horizontal(df):
ns = skrubns(df)
df1 = (
df.__dataframe_consortium_standard__()
.rename_columns({c: f"{c}_1" for c in df.columns})
.dataframe
)
out = ns.concat_horizontal(df)
assert list(out.columns) == list(df.columns)
out = ns.concat_horizontal(df, df1)
assert list(out.columns) == list(df.columns) + list(df1.columns)
11 changes: 11 additions & 0 deletions skrub/_dataframe/tests/test_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from skrub._dataframe import skrubns, std, stdns


def test_std(px):
df = px.DataFrame({"A": [1, 2]})
assert hasattr(std(df), "dataframe")
assert hasattr(stdns(df), "dataframe_from_columns")
ns = skrubns(df)
s = ns.make_series([1, 2], name="A")
assert hasattr(std(s), "column")
assert hasattr(stdns(s), "dataframe_from_columns")
Loading
Loading