Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IN (...) clauses appear to be ignored in merge commands with S3 - extra partitions scanned #2726

Closed
MuneebBaderoen opened this issue Aug 4, 2024 · 0 comments · Fixed by #2807
Labels
bug Something isn't working

Comments

@MuneebBaderoen
Copy link

MuneebBaderoen commented Aug 4, 2024

Environment

Delta-rs version: 0.18.2

Binding: Python

Environment:

  • Cloud provider: AWS - S3
  • OS: macOS Sonoma 14.5
  • Other: Python 3.12

Bug

What happened:

It appears that when performing a merge operation, specifying partition_column IN ('value_1') as the initial predicate, whether it's a single value, or multiple values - the IN clause is ignored as if it doesn't exist, and no errors are raised.

A separate, but potentially related issue, is that when performing a merge operation, when specifying partition_column = 'value_1', sometimes I see additional partitions being queried from S3. The exact additional partition retrieved is non-deterministic, but there's always an extra one in the example setup I have. I set up the example to debug the IN clause behaviour described above, and spotted this along the way. Performing the same operation a second time queries only the exact partitions specified by the clauses in the predicate.

What you expected to happen:

  • When querying with a predicate that uses an IN clause, only the files for the partitions matching all clauses are requested from S3
  • When querying with a predicate that uses an = clause, only the files for the partitions matching all clauses are requested from S3

How to reproduce it:
MRE can be found here: https:/MuneebBaderoen/delta-rs-in-predicate-mre

More details:
Slack thread: https://delta-users.slack.com/archives/C013LCAEB98/p1722382232123799

The impact of this behaviour is that attempting to upsert data across two partitions (for example at the boundary of days, or the boundary of months) dramatically increases the volume of data downloaded from S3.

On the boundary between days, the implementation I have would upsert data for partition_day IN ('05', '06') - but this clause would be ignored, resulting in all data for all partitions in the month being downloaded. This is visible in the localstack logs in the MRE provided.

On the boundary between months, the implementation I have would attempt to upsert data for partition_month IN ('07', '08') partition_day IN ('31', '01') - but both IN clauses would be ignored, resulting in all data for all partitions in the year being downloaded. This is visible in the localstack logs in the MRE provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant