Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release GIL in deltalake.write_deltalake #2234

Closed
mrocklin opened this issue Mar 1, 2024 · 12 comments · Fixed by #2257
Closed

Release GIL in deltalake.write_deltalake #2234

mrocklin opened this issue Mar 1, 2024 · 12 comments · Fixed by #2257
Labels
enhancement New feature or request

Comments

@mrocklin
Copy link

mrocklin commented Mar 1, 2024

I'm running deltalake.write_deltalake many times in parallel and noticing that my Python process is freezing up a bit. I suspect that this function has a long period where it doesn't release the GIL. Is this true?

If so, I suspect that it's likely accidental, and could be easily changed. I'm not familiar with the rust-Python tooling, but most Python binding systems make this pretty easy.

@mrocklin mrocklin added the enhancement New feature or request label Mar 1, 2024
@ion-elgreco
Copy link
Collaborator

@mrocklin which engine are you using to write with?

With the rust engine I don't think we are releasing the GIL but shouldn't be to tricky to add. It's there for MERGE

@mrocklin
Copy link
Author

mrocklin commented Mar 1, 2024

It looks like this maybe also happens for compact (and presumably other operations)

@mrocklin
Copy link
Author

mrocklin commented Mar 1, 2024

Whatever's default. My code looks like this:

    deltalake.write_deltalake(
        outfile,
        df,
        mode="append",
        storage_options=STORAGE_OPTIONS,
        partition_by="date",
    )

@ion-elgreco
Copy link
Collaborator

The default uses the PyArrow engine.

You could try compiling a version where the called method in rust is wrapped inside py.allow_threads, feel free to open a PR :)

@mrocklin
Copy link
Author

mrocklin commented Mar 1, 2024

Alas I'm not familiar with the Rust compilation/build process, so it's unlikely that I'll do this work myself (hopefully it's ok to raise an issue without volunteering to do the work myself).

@franz101
Copy link
Contributor

franz101 commented Mar 4, 2024

mhh I tried:

  deltalake.write_deltalake(
        outfile,
        df,
        mode="append",
        storage_options=STORAGE_OPTIONS,
        partition_by="date",
        engine="rust"
    )

so you suggest to add this line
m.add_function(py.allow_threads(|| [search_sequential(contents, needle)](pyo3::wrap_pyfunction!(write_to_deltalake, m)?)))?;
to lib.rs correct?

@ion-elgreco
Copy link
Collaborator

@franz101 you can do the same thing as done for merge:

https:/delta-io/delta-rs/blob/main/python%2Fsrc%2Flib.rs#L520-L541

@franz101
Copy link
Contributor

franz101 commented Mar 4, 2024

Ah nice, is there any benefit of making it up to the user if the function should block GIL or not?

@mrocklin
Copy link
Author

mrocklin commented Mar 4, 2024

Ah nice, is there any benefit of making it up to the user if the function should block GIL or not?

Almost certainly not is my guess. If the code genuinely creates/destroys Python objects then it should not release the GIL. If it doesn't screw around with Python objects at all (my hope) then it should definitely release the GIL.

FWIW I don't know of any other project that makes GIL-holding optional. It's either a good idea or it isn't, entirely dependent on the wrapped code.

@mrocklin
Copy link
Author

mrocklin commented Mar 4, 2024

(also, thank you for looking into this!)

@franz101
Copy link
Contributor

franz101 commented Mar 6, 2024

(also, thank you for looking into this!)
Thank you for the one billion challenge (though the trend goes to one trillion now).

I opened the PR tests are passing fine.
#2257

My next PR if the demand is there to use custom PK SK for DynamoDB since for DynamoDB latency is lower if using an existing database

ion-elgreco pushed a commit that referenced this issue Mar 8, 2024
# Description
Release GIL in deltalake.write_deltalake by wrapping it in
py.allow_threads
# Related Issue(s)
- closes #2234
# Documentation
@mrocklin
Copy link
Author

mrocklin commented Mar 8, 2024

Thank you @franz101 and @ion-elgreco !

Thank you for the one billion challenge (though the trend goes to one trillion now).

Ha, yes, the 1TRC is fun. A non-trivial amount of time is listing the parquet files, so adding delta into the mix would probably shave off a non-trivial amount of time :)

Should I raise another issue for the functions other than write_deltalake?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants