MAIN Improve `DatetimeEncoder` #784

Vincent-Maladiere · 2023-10-05T19:45:13Z

References
Addresses issues raised in #743 and detailed in #768.

What changes?

DatetimeEncoder exposes format_per_column_, which maps datetime parsable columns (seen during fit) to their first non-null entry. We aim to use this column selection in the TableVectorizer to unify the datetime column selection logic.
We now keep constant features, as removing them dynamically changes the shape of the output. This previous behavior might break pipelines requiring a specific shape and is surprising for the user.
On the other hand, we detect date columns (in opposition to datetime columns), e.g. formatted like "2023-01-01", and we only extract features up to the "day" level for those.
extract_until = None doesn't create an extra column "total_second". It simply doesn't extract features among {"year", ..., "nanoseconds"}.
The parameter add_total_second creates the "total_second" feature. The default is True.
The parameter errors either raises errors (errors="raise") or outputs NaT (errors="coerce") when non-datetime-parsable values are given during transform. The default is "coerce".

This PR required additional tests.

cc @LeoGrin @jeromedockes @GaelVaroquaux

jeromedockes · 2023-10-06T12:00:47Z

skrub/_datetime_encoder.py

+ np_dtypes_candidates = [np.object_, np.str_, np.datetime64]
+ if any(np.issubdtype(X.dtype, np_dtype) for np_dtype in np_dtypes_candidates):
+ try:
+ _ = pd.to_datetime(X)


I think to_datetime is stricter than DateTimeIndex which was used before (@LeoGrin also mentioned that in #768 here), as a result the docstring example prints an empty vectorized dataframe on my machine.

being a bit more strict is probably a good idea especially at first; using to_datetime instead of DateTimeIndex avoids issues such as this one, but the example needs to be updated

skrub/_datetime_encoder.py

jeromedockes · 2023-10-06T12:26:40Z

in a later PR the TableVectorizer will start using this module's is_datetime_parsable to identify datetime columns? Note there is some logic in the TableVectorizer to handle month first vs day first that may need to be moved into the DateTimeEncoder module

jeromedockes · 2023-10-06T12:33:35Z

I wonder if we could have a transformer that just parses string columns that contain dates and replaces them with columns with datetime64 dtype? This is one thing that is annoying to do in pandas if we don't know in advance which columns are dates

import pandas as pd

df = pd.DataFrame({"date": ["2023-10-06", "2023-10-07"]})
df["date"] = pd.to_datetime(df["date"]) # avoid having to do this

Vincent-Maladiere · 2023-10-06T13:30:29Z

in a later PR the TableVectorizer will start using this module's is_datetime_parsable to identify datetime columns?

My idea was to run enc = DateTimeEncoder().fit(X) in the TableVectorizer then look up enc.format_per_column_ to get the datetime parsable columns. WDYT?

is_datetime_parsable can also be used externally but is less convenient IMO.

Note there is some logic in the TableVectorizer to handle month first vs day first that may need to be moved into the DateTimeEncoder module

Yes, I need to add some of the the logic from _infer_date_format to handle dayfirst vs monthfirst better.

skrub/_datetime_encoder.py

Vincent-Maladiere · 2023-10-06T13:36:38Z

his is one thing that is annoying to do in pandas if we don't know in advance which columns are dates

I like this suggestion a lot, this is indeed very annoying. Does it need to be a transformer? I feel we could use a "switch-bait" function in the sense of fuzzy_join, like skrub.to_datetime(). WDYT? cc @GaelVaroquaux

Vincent-Maladiere · 2023-10-12T15:22:40Z

@jeromedockes I went for the switch-bait to_datetime with a rework on the backend side to decouple DatetimeEncoder from the column selection, the format guessing, and the parsing logic. LMKWYT :)

I need to update the tests.

jeromedockes

Thanks, it looks great! I wonder if we should make it for dataframes only (and possibly 2d arrays), and maybe give it a different names because not all columns are converted. I get your "switchbait" point but it may make the function more complex, 1d or less inputs are well handled by polars and pandas functions, and the skrub use case is DataFrames

I found a couple of issues, anything related to datetimes, timezones, and the interactions between standard library datetimes, pandas and polars is always a bit tricky :D

skrub/_datetime_encoder.py

jeromedockes · 2023-10-13T08:55:04Z

skrub/_datetime_encoder.py

+ X_split = {col: X_split[col_idx] for col_idx, col in enumerate(X.columns)}
+ X = pd.DataFrame(X_split, index=index)
+ # conversion is px is Polars, no-op if Pandas
+ return px.DataFrame(X)


could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy

also it is not due to the skrub code but the conversion fails when timezones are involved:

import polars as pl from skrub import to_datetime df = pl.DataFrame({"date": ["2023-10-12T17:36:50+01:00"]}) to_datetime(df) # pl.exceptions.ComputeError

but this seems to be more of a polars & pandas compatibility issue:

import pandas as pd import polars as pl df = pd.DataFrame({"date": ["2023-10-12T17:36:50+02:00"]}) df["date"] = pd.to_datetime(df["date"]) pl.DataFrame(df)

I'll open an issue on the polars repo

could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy

Polars is still an optional dependency, so I guess we can't

I'll open an issue on the polars repo

Thank you for this! Note that I haven't written tests for Polars yet, maybe in a subsequent PR to avoid obfuscating this one.

Polars is still an optional dependency, so I guess we can't

maybe not in this PR but in the polars/pandas namespace you added we could have a function to build a DataFrame from a dict of columns and an index (in the polars version the index would be ignored, possibly after checking that it is None)

Yes indeed! I'll create an issue for it.

I'll open an issue on the polars repo

I opened pola-rs/polars#11774

skrub/_datetime_encoder.py

Vincent-Maladiere · 2023-10-13T13:13:14Z

Thanks, it looks great! I wonder if we should make it for dataframes only (and possibly 2d arrays), and maybe give it a different names because not all columns are converted. I get your "switchbait" point but it may make the function more complex, 1d or less inputs are well handled by polars and pandas functions, and the skrub use case is DataFrames

Thanks! I get your idea, but we might end up with different behavior for series (pd.to_datetime) vs dataframes (skrub.to_datetime). For consistency, I'd advocate for handling both 1d and 2d outputs, so that the user can always rely on skrub.to_datetime, WDYT? I also simplified the logic for series, to be the same as dataframes.

I found a couple of issues, anything related to datetimes, timezones, and the interactions between standard library datetimes, pandas and polars is always a bit tricky :D

Thanks for pointing out the issues! I haven't tested it extensively with Polars yet.

Vincent-Maladiere · 2023-10-13T13:44:14Z

The CI errors with the old Pandas version.

This is linked to the missing "format=mixing" argument, recently introduced in Datetime parsing (PDEP-4): allow mixture of ISO formatted strings pandas-dev/pandas#50939.
Then, weirdly, nanoseconds were not supported by pandas._libs.tslibs.parsing.guess_datetime_format
We also have doctest errors everywhere. It seems very picky.

jeromedockes · 2023-10-13T14:15:47Z

Thanks! I get your idea, but we might end up with different behavior for series (pd.to_datetime) vs dataframes (skrub.to_datetime). For consistency, I'd advocate for handling both 1d and 2d outputs, so that the user can always rely on skrub.to_datetime, WDYT? I also simplified the logic for series, to be the same as dataframes.

yes that makes sense. skrub.to_datetime will be different than pd.to_datetime and handle only a subset of usecases, so we should be careful to document that and about how we handle parameters that don't apply (such as "unit")

for example pandas can convert to datetime a dataframe with separate columns for year, month, etc. whereas skrub looks at each column separately

import pandas as pd
from skrub import to_datetime

df = pd.DataFrame({"year": [2023, 2024], "month": [1, 2], "day": [12, 13]})
print("pandas:")
dt = pd.to_datetime(df)
print(dt)
print("""
------------
skrub:""")
sdt = to_datetime(df)
print(sdt)

pandas:
0   2023-01-12
1   2024-02-13
dtype: datetime64[ns]

------------
skrub:
   year  month  day
0  2023      1   12
1  2024      2   13

jeromedockes · 2023-10-13T14:20:03Z

also I like your choice to handle series basically as dataframes with one column and I wonder if we should have this harmonization for arrays and scalars too?

import pandas as pd
from skrub import to_datetime

s = pd.Series(["a", "b"])
print(to_datetime(s))
print("""
-----------------
""")
print(to_datetime(s.values))

0    a
1    b
dtype: object

-----------------

/home/jerome/workspace/backedup_repositories/skrub/skrub/_datetime_encoder.py:85: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  return pd.to_datetime(X, **kwargs)
DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

jeromedockes · 2023-10-13T14:23:50Z

We also have doctest errors everywhere. It seems very picky.

yes doctests compares the output as strings, so from this point of view "2022" and "2.022e+03" are not the same. I'll look into setting numpy and pandas formatting options so we get more sensible outputs

Vincent-Maladiere · 2023-11-03T11:03:01Z

One caveat of this first implementation of skrub.to_datetime is that when we pass a numpy array of object dtypes, with mixed-typed entries, we might end up with some numeric representations of the date that are not helpful.

import numpy as np
import pandas as pd
from skrub import to_datetime

df = pd.DataFrame(
    dict(
        a=["2020-01-01", "2021-01-01"],
        b=[2020, 2021],
    )
)

# OK
to_datetime(df)
# a	b
# 0	2020-01-01	2020
# 1	2021-01-01	2021

# Not very helpful
to_datetime(df.values)
# array([[1609459200000000000, 2020],
#       [1609545600000000000, 2021]], dtype=object)

# Better, but needs all columns to be datetime
to_datetime(df.values[:, [0]])
# array([['2020-01-01T00:00:00.000000000'],
#       ['2021-01-01T00:00:00.000000000']], dtype='datetime64[ns]')

# OK, by converting to pydatetime
X_split = list(df.values.T)
X_split[0] = pd.to_datetime(X_split[0]).to_pydatetime()
out = np.vstack(X_split).T
out
# array([[datetime.datetime(2020, 1, 1, 0, 0), 2020],
#      [datetime.datetime(2021, 1, 1, 0, 0), 2021]], dtype=object)

# Notice how we still lose the integer representation of the second column.
pd.DataFrame(out).dtypes
# 0    datetime64[ns]
# 1            object

That being said, we can't cast datetime.datetime or pd.Timestamp entries to integers or float, so maybe the numeric representation of numpy i.e. the output to_datetime(df.values) is not that bad, WDYT @jeromedockes?

Anyway, we convert dataframes to a set of numpy 1d array, which might be problematic for Polars datetime representation for example.

Therefore, we should consider improving the to_datetime logic in a subsequent PR to specialize the ndarray, Polars, and Pandas dataframes / series cases, instead of reducing everything to a set of 1d numpy array.

jeromedockes · 2023-11-06T10:33:14Z

# Not very helpful
to_datetime(df.values)
# array([[1609459200000000000, 2020],
#       [1609545600000000000, 2021]], dtype=object)

for this one I see a different result:

[[1577836800000000000 2020]
 [1609459200000000000 2021]]

which corresponds to the timestamps of the 2 dates. Why do you think that is not helpful?

jeromedockes · 2023-11-06T10:36:11Z

in the case where we have to output a numerical numpy array, the timestamp is probably the best we can do. but as you said in most cases we probably want dataframes as input and outputs, with proper datetime columns in the outputs

skrub/_datetime_encoder.py

jeromedockes · 2023-11-06T10:59:26Z

skrub/_datetime_encoder.py

+ """
+ Parameters
+ ----------
+ X_col : ndarray of shape ``(n_samples,)``


If missing values are not allowed in X_col I think we should mention it in the docstring

skrub/_datetime_encoder.py

jeromedockes · 2023-11-06T12:14:54Z

skrub/_datetime_encoder.py

+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore", category=UserWarning)
+ # pd.unique handles None
+ month_first_formats = pd.unique(vfunc(X_col, dayfirst=False))


Not sure if it could be a problem but note here we might end up with the string 'None' in the result, eg for _guess_datetime_format(np.asarray(["01/23/2021", ""])) we get array(['%m/%d/%Y', 'None'], dtype='<U8') in month_first_formats

skrub/_datetime_encoder.py

Co-authored-by: Jérôme Dockès <[email protected]>

Vincent-Maladiere · 2023-11-06T19:24:59Z

for this one I see a different result:

Yes, the results are just for illustration don't mind them.

which corresponds to the timestamps of the 2 dates. Why do you think that is not helpful?

More specifically, not helpful in the context of working with dataframes because we lose the datetime representation (using pd.to_datetime with timestamps is always tricky because of the unit kwarg, and this is not something we can do with skrub.to_datetime), but helpful for ML downstream applications.

jeromedockes

LGTM!

jeromedockes · 2023-11-08T15:42:49Z

@LeoGrin it has changed a bit since your approval, do you want to have another look? If not I will merge it

LeoGrin · 2023-11-08T16:13:06Z

@LeoGrin it has changed a bit since your approval, do you want to have another look? If not I will merge it

I'll have a look tonight, thanks!

skrub/tests/test_datetime_encoder.py

LeoGrin · 2023-11-08T21:58:59Z

skrub/_datetime_encoder.py

+
+ Parameters
+ ----------
+ X_col : ndarray of shape ``(n_samples,)``


Upping Jerome's comment "If missing values are not allowed in X_col I think we should mention it in the docstring"

LeoGrin · 2023-11-08T21:59:39Z

Just a few comments but otherwise LGTM, thanks!

Vincent-Maladiere · 2023-11-09T10:37:00Z

Let's merge then!

Vincent-Maladiere added 7 commits October 4, 2023 16:50

estimator refacto

c9069bc

revamp all tests from datetime_encoder

da7d678

update docstrings

0c08aad

update example

d57691c

Merge branch 'main' into refacto_datetime_encoder

581c834

split the transform method with _parse_datetime_cols

b39c2da

small typo in a comment

edf11dd

jeromedockes reviewed Oct 6, 2023

View reviewed changes

Vincent-Maladiere commented Oct 6, 2023

View reviewed changes

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved

add to_datetime and rework the backend

367d207

Vincent-Maladiere added 4 commits October 12, 2023 17:25

docstring typo

65657c3

docstring typo 2

53a04d2

docstring typo 3

998859c

add TODO

6bec3e6

jeromedockes reviewed Oct 13, 2023

View reviewed changes

Vincent-Maladiere added 2 commits October 13, 2023 14:29

enhance tests

d4b9cbc

Merge branch 'main' into refacto_datetime_encoder

3d87da3

apply Jerome's suggestions

8710fcb

Merge branch 'main' into refacto_datetime_encoder

4d11fb9

fix old pandas version errors

dff7b22

Vincent-Maladiere added 7 commits November 3, 2023 19:30

improve doc and add some tests

cd6672d

fix docstring format

1f1e128

Merge branch 'main' into refacto_datetime_encoder

d57eca5

make doctest happy

581fd88

fix min pandas tests

25d0457

fix tests for min pandas

bc72e81

make doctest happy

1f52d1e

jeromedockes reviewed Nov 6, 2023

View reviewed changes

Vincent-Maladiere commented Nov 6, 2023

View reviewed changes

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved

Update skrub/_datetime_encoder.py

0875958

Co-authored-by: Jérôme Dockès <[email protected]>

apply suggestions

4137f89

jeromedockes approved these changes Nov 8, 2023

View reviewed changes

LeoGrin reviewed Nov 8, 2023

View reviewed changes

skrub/tests/test_datetime_encoder.py Show resolved Hide resolved

LeoGrin reviewed Nov 8, 2023

View reviewed changes

Vincent-Maladiere added 2 commits November 9, 2023 10:34

missing remarks

0bf4896

fix min pandas version test

d5a4091

Vincent-Maladiere merged commit 2bda119 into skrub-data:main Nov 9, 2023
26 checks passed

Vincent-Maladiere deleted the refacto_datetime_encoder branch November 9, 2023 10:37

jeromedockes mentioned this pull request Nov 9, 2023

fix column name in example #817

Merged

This was referenced Nov 9, 2023

ENH Using to_datetime within the TableVectorizer #819

Merged

[ENH] make to_datetime more robust to bad inputs #823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAIN Improve `DatetimeEncoder` #784

MAIN Improve `DatetimeEncoder` #784

Vincent-Maladiere commented Oct 5, 2023 •

edited

Loading

jeromedockes Oct 6, 2023

jeromedockes commented Oct 6, 2023

jeromedockes commented Oct 6, 2023

Vincent-Maladiere commented Oct 6, 2023 •

edited

Loading

Vincent-Maladiere commented Oct 6, 2023

Vincent-Maladiere commented Oct 12, 2023 •

edited

Loading

jeromedockes left a comment •

edited

Loading

jeromedockes Oct 13, 2023

jeromedockes Oct 13, 2023

Vincent-Maladiere Oct 13, 2023

Vincent-Maladiere Oct 13, 2023

jeromedockes Oct 13, 2023

Vincent-Maladiere Oct 13, 2023 •

edited

Loading

jeromedockes Oct 16, 2023

Vincent-Maladiere commented Oct 13, 2023

Vincent-Maladiere commented Oct 13, 2023 •

edited

Loading

jeromedockes commented Oct 13, 2023

jeromedockes commented Oct 13, 2023

jeromedockes commented Oct 13, 2023

Vincent-Maladiere commented Nov 3, 2023 •

edited

Loading

jeromedockes commented Nov 6, 2023

jeromedockes commented Nov 6, 2023

jeromedockes Nov 6, 2023

jeromedockes Nov 6, 2023

Vincent-Maladiere commented Nov 6, 2023

jeromedockes left a comment

jeromedockes commented Nov 8, 2023

LeoGrin commented Nov 8, 2023

LeoGrin Nov 8, 2023

LeoGrin commented Nov 8, 2023

Vincent-Maladiere commented Nov 9, 2023

MAIN Improve DatetimeEncoder #784

MAIN Improve DatetimeEncoder #784

Conversation

Vincent-Maladiere commented Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

jeromedockes commented Oct 6, 2023

jeromedockes commented Oct 6, 2023

Vincent-Maladiere commented Oct 6, 2023 • edited Loading

Vincent-Maladiere commented Oct 6, 2023

Vincent-Maladiere commented Oct 12, 2023 • edited Loading

jeromedockes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 13, 2023

Vincent-Maladiere commented Oct 13, 2023 • edited Loading

jeromedockes commented Oct 13, 2023

jeromedockes commented Oct 13, 2023

jeromedockes commented Oct 13, 2023

Vincent-Maladiere commented Nov 3, 2023 • edited Loading

jeromedockes commented Nov 6, 2023

jeromedockes commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Nov 6, 2023

jeromedockes left a comment

Choose a reason for hiding this comment

jeromedockes commented Nov 8, 2023

LeoGrin commented Nov 8, 2023

Choose a reason for hiding this comment

LeoGrin commented Nov 8, 2023

Vincent-Maladiere commented Nov 9, 2023

MAIN Improve `DatetimeEncoder` #784

MAIN Improve `DatetimeEncoder` #784

Vincent-Maladiere commented Oct 5, 2023 •

edited

Loading

Vincent-Maladiere commented Oct 6, 2023 •

edited

Loading

Vincent-Maladiere commented Oct 12, 2023 •

edited

Loading

jeromedockes left a comment •

edited

Loading

Vincent-Maladiere Oct 13, 2023 •

edited

Loading

Vincent-Maladiere commented Oct 13, 2023 •

edited

Loading

Vincent-Maladiere commented Nov 3, 2023 •

edited

Loading