Paratext <-> Apache Arrow bridge #55

wesm · 2017-02-17T19:52:33Z

@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).

The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https:/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?

The text was updated successfully, but these errors were encountered:

deads · 2017-02-20T05:09:23Z

Hi @wesm, This sounds like a very interesting idea. The paratext reader is still missing a lot of functionality that's available in pandas.read_csv so I imagine it will take some work to flesh out the feature matrix before deprecating read_csv. Full DateTime support and reading arbitrary objects will be a lot of work to get the details right. The chunking features and use_cols in Pandas should be much easier. Calling Python functions in multi-threaded code is deadly so a read_csv feature that takes in a Pure Python function will be problematic. Let's discuss further in person.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paratext <-> Apache Arrow bridge #55

Paratext <-> Apache Arrow bridge #55

wesm commented Feb 17, 2017

deads commented Feb 20, 2017 •

edited

Loading

Paratext <-> Apache Arrow bridge #55

Paratext <-> Apache Arrow bridge #55

Comments

wesm commented Feb 17, 2017

deads commented Feb 20, 2017 • edited Loading

deads commented Feb 20, 2017 •

edited

Loading