Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gandiva C++ Merge. #7

Closed

Conversation

praveenbingo
Copy link

Replicating the gandiva cpp tree structure into arrow.

Note : It does not build yet, the work is being done locally.

pravindra and others added 30 commits May 29, 2018 01:45
Bootstrap evaluation using llvm code generation

LLVM code generation is done using a mix of :
- glue IR code that loops over the vector, generates function
  calls (and)
- byte-code files generated from simple c++ functions using 
  clang (emit-llvm).
The glue-code and pre-compiled byte code are merged and 
optimized together.

Expressions are specified using a "tree builder" where each 
node is an arrow vector, or a binary/unary function.

During code generation, the expressions are "decomposed" so 
that the value array and bitmap array are evaluated separately 
to compute the expression result. This avoids the use of too 
many branch/conditional instructions (checks for "if null"), and
hence, can be vectorized efficiently.

Support added for arithmetic and logical expressions on 
numeric types.

Travis CI support added for build on ubuntu.
Separate out the public and private target dependencies.

For arrow, export an interface target. This avoids the need to add
include dirs for each dependency on arrow.

Removed dependency on gtest. Instead, build it as an external project.
This is the recommended practice for googletest.

For pre-compiled files, generate the bitcode files for each of them
independently and then, link them to generate a unified bitcode file.
Removed cpplint exceptions since there is no more sourcing of .cc
files.

Separate out the public include files from private includes, and 
add them in the dependency list in cmake.

pass the bytecode filepath from cmake (instead of /tmp)
…he#8)

* GDV-43: [C++] Introduce error codes as error handling strategy.

Introduced status codes and using the same as the error handling strategy.

The decision was taken to accommodate existing libraries that use error codes and
because Arrow also uses error codes and not exception.

Changed the signatures across the board for the same.
The pre-compiled functions takes an extra arg (bool *) to set the
result validity.

At decompose time, a local bitmap is assigned to track the result
validity bits for such functions.

At evaluate time, sufficient number of local bitmaps are allocated for
all the local bitmaps.

For the final computation of the expression validity, the input bitmaps
can be either one of the value-vector bitmaps, or a local bitmap.
Replaced MakeUnaryFunction, MakeBinaryFunction with a simpler
MakeFunction that takes a vector of args.
An if-else expression has three sub-expressions :
- condition
- then-expression
- else-expression
Each of these can again be a node in the expression tree.

The result validity of the if-else expression is saved in a local bitmap.

Also, moved all of the integ tests to a different folder (integ) so that there is no mix of include files.
- moved the expression decomposition logic from Node class
  to a visitor class (ExprDecomposer)
- moved node.h out of external includes
- renamed Evaluator to Projector
Added support for literals (int32, int64, float, double and bool).
In case of nested if-else conditions, eg.

if A
else if B
     else if C
          else D

The else parts of A & C will not update validity bitmaps.
Only the if parts and the terminal else (i.e D) update bitmaps.
Split gandiva into two sub modules : codegen & jni.
- codegen is the core having cpp APIs and LLVM
- jni deals with protobufs & interfacing with java
Dremio allocates the output vectors in java and passes the pointers
to gandiva. In that case, gandiva will use the passed in buffers.

Made Evaluate use ArrayData internally for output buffers, since Array
is expected to be immutable.
Also, added "Adapted from XX" comments in ci/travis
- Added definitions for other integer types (int8, int16)
- Added definitions for unsigned types
- Added a test for arithmetic ops on all int types
- The functions should be inlined in the pre-compiled library, but
not in the unit tests. Added a  compiler flag to control this.
* GDV-58: [CPP] Fix order of includes.

Fixing the order of includes to follow style guideline.
The order to follow is documented here : https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes
Also enabled the check in lint.
GDV-7: Gandiva Java APIs

Added the JNI Implementation of the Java APIs
Added Java based unit and integration tests
Use cmake to build gandiva_jni
Added pom.xml to build Java files
Validating the input schema and expressions during the projector build.
- tree builder api for and/or
- decomposer/validator for and/or
- code generator for and/or
- tests for and/or
- add tree-builder, codegen support for null literals
- moved the code for final bitmap computation to class BitMapAccumulator
add java bindings for null literals
Support date/time types in Java
Add cpp/Java tests for date/time types
Loading Gandiva dynamically in java bindings.
Packaging the dynamic library and byte code files in Gandiva JAR.
Introduced configuration object to customize Gandiva at runtime.
- Track offsets buffer for string/binary
- annotator/generator support for string/binary
- literal support for string/binary
Modified the build to package the gandiva jni as a stand alone library that
can be packaged in the Gandiva JAR.

Also producing two versions of gandiva core - a static and a shared one.

Fixed LLVM dependencies to be target based.
- added target "make stylecheck" to check style
- added target "make stylefix" to check style
- fixed README.md
- fixed ci script
- used stylefix to fix all existing style violations
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
Added microbenchmarks in both cpp and Java
vvellanki and others added 22 commits July 14, 2018 16:18
Support isnull, isnotnull, equal, and not_equal for date/time types
Support date/time types for less_than, less_than_or_equal_to, greater_than, greater_than_or_equal_to
Implement all extractXxx functions
- Switched to gcc-4.9, since the stdc++ linked with 4.8 doesn't work with llvm libs.
- Build arrow in travis instead of the conda build (the conda built libarrow.a has undefined symbols je_arror_allocx, ..)
- fixed an error in node.h that showed up when I toyed with clang compiler
Exporting supported data types and functions from Gandiva.
Added a JNI bridge to access this from the java layer.
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
Fixed the implementation of extract second from time.
* GDV-28: [C++] Add hash functions on all data types

* GDV-28: Fix stylecheck in travis to print diff

* GDV-28: pick clang-format from llvm-binary dir

* GDV-28: handle case when seed is null

* GDV-28: [C++] Fix a style check
Added support for literals and null for time types.
Class references are local by default and eligible for GC.

We would need to convert it to global reference on library load for it
to be safely used for the program lifetime.
Add support for timestampaddXxx functions
Add support for is_distinct_from, is_not_distinct_from, isnull, isnotnull, date_add/add, date_sub/subtract/date_diff, date_trunc_Xxx functions
…e#74)

* Temporarily matching what the dremio does for mod zero.
* Used the latest Arrow APIs for allocating buffers.
- similar to projection, filter is built for a specific schema and
  condition (i.e expression)
- the output of filter is a selection vector (Int16Array)
* Add java bindings for filter expr
* Mv selection vector impl to internal
Fixed some bugs in the filter code path.
Change the selection vector arrays as unsigned to match dremio.
1. Added lock to holder read to address potential race condition.
2. Fixed log message.
3, Addressed breaking arrow change.
1. In evaluate to lookup module, first do without lock and fallback only if
   module is not found.
2. Use release builds in travis.
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
* GDV-31:[Java][C++]Fixed concurrency issue in cache.

Modifications were happening in get without a mutex.
Wrote a test to verify and prevent regression.
Literal string coversion was ignoring types, leading
to mismatch in hashing of expressions.
- add a registry for "function holders" implemented in cpp
- the function holder is instantiated at expression decomposition time
- at eval time, the registered fn gets an extra param (the . function holder)
- To get around the java load issue, create a native library and load it in the LLVM module. 
   This module has the hooks for all the c++ function helpers.
- for files that are compiled in libgandiva_helpers, add into  gandiva::helpers namespace.
- merged status.cc into status.h
pprudhvi pushed a commit that referenced this pull request May 26, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). apache#7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes apache#7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants