Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic DeltaTable error: type_coercion in Struct column in merge operation #1998

Closed
gustavodecarlo opened this issue Dec 28, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@gustavodecarlo
Copy link

Environment

Delta-rs version: rust-v0.16.5

Binding: python-v0.14.0

Environment: local machine

  • Cloud provider:
  • OS: macOS Sonoma 14.2.1
  • Other: python version 3.9.13

Bug

What happened:
_internal.DeltaError: Generic DeltaTable error: type_coercion
caused by
Error during planning: Failed to coerce then ([Struct([Field { name: "created_datetime", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "version", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "created_datetime", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "version", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "created_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "version", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "created_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "version", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), Struct([Field { name: "created_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "version", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])]) and else (None) to common types in CASE WHEN expression

What you expected to happen:

merge executed successfully

How to reproduce it:
Need to pip install both polars and deltalake
Same code in this repo: https:/gustavodecarlo/poc-python-deltars-upsert/blob/main/poc_python_deltars_upsert/dataflow.py#L12

from deltalake import DeltaTable, Schema
import polars as pl

df = pl.DataFrame(
        {
            "sales_order_id": ["1000", "1001", "1002", "1003"],
            "product": ["bike", "scooter", "car", "motorcycle"],
            "order_date": [
                datetime(2023, 1, 1),
                datetime(2023, 1, 5),
                datetime(2023, 1, 10),
                datetime(2023, 2, 1),
            ],
            "sales_price": [120.25, 2400, 32000, 9000],
            "paid_by_customer": [True, False, False, True],
            "metadata": [
                {
                    "created_datetime": "2023-01-10",
                    "version": "1"
                },
                {
                    "created_datetime": "2023-01-12",
                    "version": "1"
                },
                {
                    "created_datetime": "2023-01-10",
                    "version": "1"
                },
                {
                    "created_datetime": "2023-01-13",
                    "version": "1"
                }
            
            ]
        }
    )
    print(df)

    df.write_delta("data/sales_orders", mode="append")

    new_data = pl.DataFrame(
        {
            "sales_order_id": ["1002", "1004"],
            "product": ["car", "car"],
            "order_date": [datetime(2023, 1, 10), datetime(2023, 2, 5)],
            "sales_price": [30000.0, 40000.0],
            "paid_by_customer": [True, True],
            "metadata": [
                {
                    "created_datetime": "2023-01-10",
                    "version": "1"
                },
                {
                    "created_datetime": "2023-01-13",
                    "version": "1"
                }
            ]
        }
    )

    dt = DeltaTable("data/sales_orders")
    source = new_data.to_arrow()
    delta_schema = Schema.from_pyarrow(source.schema).to_pyarrow()
    source = source.cast(delta_schema)

    (
        dt.merge(
            source=source,
            predicate="s.sales_order_id = t.sales_order_id",
            source_alias="s",
            target_alias="t",
        )
        .when_matched_update_all()
        .when_not_matched_insert_all()
        .execute()
    )

    print(pl.read_delta("data/sales_orders"))
@gustavodecarlo gustavodecarlo added the bug Something isn't working label Dec 28, 2023
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Dec 28, 2023

Struct type coercion is not yet built in data fusion sadly.

I will expose the large_dtypes parameter so you can control that and set it to False, then it should work

@gustavodecarlo
Copy link
Author

Struct type coercion is not yet built in data fusion sadly.

I will expose the large_dtypes parameter so you can control that and set it to False, then it should work

Where i'm set this config? @ion-elgreco

@ion-elgreco
Copy link
Collaborator

@gustavodecarlo it will be on .merge()

@gustavodecarlo
Copy link
Author

@gustavodecarlo it will be on .merge()

Let me try. thanks for the insight

@ion-elgreco
Copy link
Collaborator

@gustavodecarlo it's not yet available. I will make it available in the next release though

@gustavodecarlo
Copy link
Author

@gustavodecarlo it's not yet available. I will make it available in the next release though

Thanks for the information let's wait. And this issue can we close?

ion-elgreco added a commit that referenced this issue Jan 2, 2024
# Description
This helps to avoid this
[error](#1998 )since you can
now set to large_dtypes=False.

Also once upstream in arrow-rs there is better type coercion, this param
should be able to be removed completely in the writer and merge
operation.
@gustavodecarlo
Copy link
Author

@ion-elgreco, worked with large_dtypes=False. Thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants