Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paratext <-> Apache Arrow bridge #55

Open
wesm opened this issue Feb 17, 2017 · 1 comment
Open

Paratext <-> Apache Arrow bridge #55

wesm opened this issue Feb 17, 2017 · 1 comment

Comments

@wesm
Copy link
Collaborator

wesm commented Feb 17, 2017

@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).

The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https:/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?

@deads
Copy link
Contributor

deads commented Feb 20, 2017

Hi @wesm, This sounds like a very interesting idea. The paratext reader is still missing a lot of functionality that's available in pandas.read_csv so I imagine it will take some work to flesh out the feature matrix before deprecating read_csv. Full DateTime support and reading arbitrary objects will be a lot of work to get the details right. The chunking features and use_cols in Pandas should be much easier. Calling Python functions in multi-threaded code is deadly so a read_csv feature that takes in a Pure Python function will be problematic. Let's discuss further in person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants