Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Fix Compiled Parallel Aggregations #1211

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

turingcompl33t
Copy link
Contributor

@turingcompl33t turingcompl33t commented Sep 29, 2020

Closes #1099.

Parallel static aggregations fail in compiled mode. A simple procedure to reproduce follows:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

Executing the final query produces the following error messages:

[2020-09-17 18:31:34.843] [execution_logger] [error] ERROR IN MODULE:
Stored value type does not match pointer operand type!
  store %1* %16, %"class.terrier::execution::sql::IntegerSumAggregate"** %5
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %22, %"class.terrier::execution::sql::IntegerSumAggregate"** %8
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %5, %"class.terrier::execution::sql::IntegerSumAggregate"** %2
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %7, %"class.terrier::execution::sql::IntegerSumAggregate"** %3
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %31, %"class.terrier::execution::sql::IntegerSumAggregate"** %10
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %35, %"class.terrier::execution::sql::IntegerSumAggregate"** %12
 %"class.terrier::execution::sql::IntegerSumAggregate"*

The majority of this error message is produced by the Verifier::visitStoreInst() function in Verifier.cpp within LLVM. As the message suggests, we generate IR with a type mismatch between the stored value and the location (address) at which it is stored. The (simplified) syntax for an LLVM IR store looks like: store <ty> <value> <ty>* <pointer> and the LLVM type system (enforced in part by Verifier.cpp) requires that "the type of the operand must be a pointer to the first class type of the operand".

The issue arises in the MergeAggregates function. We produce the following bytecode for MergeAggregates:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return  

The error occurs at the final Lea bytecode: Lea &tmp4, pipelineState, 16. Within the LLVM engine, the Lea TPL instruction is implemented in terms of a store in LLVM IR.

We have that the pipelineState type looks like:

struct {
    struct {
        IntegerSumAggregate,
        IntegerSumAggregate
    }
}

While the type of local tmp4 is simply *IntegerSumAggregate.

To construct the store instruction from the bytecode, we perform the following procedure:

  • Get the offset from the 2nd (0-indexed) operand to the Lea instruction (16)
  • Compute the element within pipelineState that contains this offset; pipelineState is a structure with only a single member (itself a structure) so the result of this computation is the sole member of pipelineState with type: struct {IntegerSumAggregate, IntegerSumAggregate}
  • Compute the index of this element within the top level structure (pipelineState); this calculation results in 0 as the index
  • Compute a pointer to the element at the index computed above; this results in a pointer with type *struct {IntegerSumAggregate, IntegerSumAggregate}
  • Create a store instruction that stores this computed pointer to &tmp4

The resulting store IR therefore looks like (in pseudo-IR):

store *struct{IntegerSumAggregate, IntegerSumAggregate} val, **IntegerSumAggregate ptr

Verifier::visitStoreInst() contains an assertion that verifies (essentially) the following type constraint:

typeof(*ptr) == typeof(val)

We fail this assertion because we attempt to store a pointer to the structure struct {IntegerSumAggregate, IntegerSumAggregate} to a location that expects a pointer to a lone IntegerSumAggregate.

@turingcompl33t turingcompl33t added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. bug Something isn't working (correctness). Mark issues with this. labels Sep 29, 2020
@turingcompl33t
Copy link
Contributor Author

turingcompl33t commented Jan 21, 2021

Some preliminary results from the benchmarks supported by OLTPBench. About to run through TPC-H benchmarks with the setup suggested by @thepinetree.

Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:

  • 18 X 3404.27 MHz CPUs
  • L1 Data 32 KiB (x18)
  • L1 Instruction 32 KiB (x18)
  • L2 Unified 1024 KiB (x18)
  • L3 Unified 25344 KiB (x1)

Results for the benchmarks are averaged over 5 runs.

Average Bytecode Count

Benchmark Master Optimization Removed
TPC-C 124 141
Smallbank 58 67
TATP 197 227

Benchmarks (Interpreted)

Benchmark Master Optimization Removed
TPC-C 337 tps 337 tps
Smallbank 595 tps 590 tps
TATP 442 tps 426 tps

Benchmarks (Compiled)

Benchmark Master Optimization Removed
TPC-C 337 tps 327 tps
Smallbank 594 tps 588 tps
TATP 545 tps 537 tps

@turingcompl33t
Copy link
Contributor Author

Results from running TPC-H benchmarks comparing the current state of master and this PR branch. I ran these benchmarks using the "custom" TPC-H setup found in the latest commit on @thepinetree's branch. Scale factor is 1. Each query is executed a minimum of 100 times in each benchmark. Runtimes are averaged over 10 runs; the standard deviation among individual measurements is included following the average runtime (in parentheses).

Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:

  • 18 X 3404.27 MHz CPUs
  • L1 Data 32 KiB (x18)
  • L1 Instruction 32 KiB (x18)
  • L2 Unified 1024 KiB (x18)
  • L3 Unified 25344 KiB (x1)

Interpreted

Master PR Diff
77392ms (759) 77182ms (570) -0.27%

Compiled

Master PR Diff
56000ms (42) 55961ms (87) -0.07%

In both interpreted and compiled execution modes, we see a very slight decrease in the overall runtime of the benchmark. Obviously it makes no sense that adding additional byetcode to be interpreted / compiled makes the system run faster; I think the takeaway from these results is: the overhead of introducing the additional bytecode (rather, not eliding the Lea in circumstances in which it is possible) has such a negligible impact on overall performance that the difference is lost in the noise of the system, despite the fact that Google Benchmark works hard to give us reproducible results and all of the benchmarks were run multiple times independently.

Interested to hear what other's takes may be. I also think it may be worthwhile at this point to mark this "ready for CI" to determine if perfguard uncovers anything different, although I suppose it has also been showing spurious changes in benchmark performance, so it is difficult to arrive at a definitive answer.

@turingcompl33t
Copy link
Contributor Author

The other thing to consider here is the fact that, currently, we don't have any automatic testing mechanism to verify this functionality. For instance, we can now run queries like

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

where we couldn't before, but this behavior is not verified by either existing SQL tests or unit tests because neither of these forms of tests are run in anything other than interpreted execution mode. We have issues open for both of these: #1223 for the former and #1165 for the latter.

From what I can tell looking at the test infrastructures and corresponding relevant implementation details (in the DBMS itself) support for both of these will require some minor refactors:

  • Currently SQL tracefile tests are all run against a DBMS server instance that, while restarted between individual tracefile tests, is always restarted with the same set of server options (see e.g. here). We could refactor this to accept a set of configurations under which the DBMS should be run, or, perhaps more expedient, just add another for-loop around here to run the entirety of each tracefile test in whatever configurations we see fit.
  • For compiler unit tests we just hard-code the execution mode to Interpret (here). As Prashanth points out in the associated issue, it shouldn't be difficult to refactor the structure of the tests to execute all of the existing tests in multiple execution modes, but it does appear it will require some changes within the execution engine itself to account for the fact that bytecode handlers are loaded from a file whose location may vary (also pointed out in the associated issue) (stopgap may just be to add a post-build pre-test step that copies the file to the test directory?). There is also the issue of whether compiler_test.cpp should be "broken up" in some way - it is currently over 3,600 lines (the largest unit test file) which makes it somewhat harder to navigate.

@pmenon
Copy link
Member

pmenon commented Jan 22, 2021

Nice analysis! Just a few points:

  1. Can you provide performance + avg. bytecode counts for TPC-H queries individually?
  2. Overhead in other parts of the system are most likely hiding any potential overhead incurred by this change - TPC-H SF1 should complete far faster than 56s. Can you hand-code a TPL program that stresses your modification and compare with master?
  3. If you want to add more testing, pull the codegen harness from upstream TPL here. It runs codegen tests in all permutations of parallelism and execution modes.

@turingcompl33t
Copy link
Contributor Author

Thanks for the feedback @pmenon!

  1. Absolutely, should have done this in the first place, I'll run those and post them here.
  2. Perhaps my explanation above wasn't all that clear: these results aren't really directly comparable to TPC-H at scale factor 1 because in the benchmark setup I'm running each query (at least) 100 times - I'm not sure how many times each query is run in vanilla TPC-H? So the total runtime isn't really useful apart from the relative comparison between master and the PR branch. That said, I can absolutely work up a TPL program that stresses this change, but I'm not entirely sure what we would hope to learn from it... My thinking is that I can write an adversarial test program that defines an arbitrarily-deeply nested structure and performs a load from the 0-offset position, and we'd surely see the impact of removing the optimization (1 additional bytecode for each level) but what do we learn from this other than that it is possible to construct adversarial input? I would think that the more relevant question would be: "how often does this type of structure arise naturally in queries that we care about?" which is approximately what the benchmarks above are attempting to determine. If I'm misinterpreting your request or missing the utility of it, let me know! Either way I'll work on the relevant TPL.
  3. Will do.

@turingcompl33t
Copy link
Contributor Author

Bytecode counts for individual TPC-H queries. Need to qualify these results slightly because they are computed based on the query plans that @malin1993ml manually constructed (see this branch for reference).

Query Number Master Bytecode Count PR Bytecode Count Percentage Increase
1 837 931 +11%
4 912 1036 +13%
5 1790 2032 +13%
6 316 355 +12%
7 1814 2032 +12%
11 1749 1988 +13%
16 1145 1268 +10%
18 1643 1863 +13%
19 677 748 +10%

The percentage increase in the number of bytecode instructions is relatively consistent across the individual queries (or at least, more consistent than I anticipated). Furthermore, the increases are also larger in magnitude than I expected, at least based on the previous runtime benchmark results - a ~10% increase in the number of bytecode instructions seems less trivial than what I have been making it out to be. This is likely the type of result that you may have been expecting to see @pmenon, or at least the kind that you'd want to be aware of before anything gets merged.

I am not super familiar with the internals of the interpreter, so I can't say with any authority how worrying a 10% increase in bytecode instructions is in that context. However, I can definitely say that such an increase is worrying in the context of machine code. Each of these additional bytecodes corresponds to an additional Lea in the LLVM IR; there is of course the question of whether LLVM might be able to optimize any of these additional instructions away (do we run any optimization passes in non-adapative compiled mode?) so this might not translate directly to a 10% increase in machine instructions, but it is still reason enough to warrant a second look.

Is eliding this optimization prohibitively expensive? Should I revisit the solution? The alternative that comes to mind is somehow manually "collapsing" the inconsistencies at the LLVM type-system level during bytecode generation, but I haven't actually pursued this yet to see how feasible it would be.

@turingcompl33t
Copy link
Contributor Author

turingcompl33t commented Feb 12, 2021

Proposing a new approach to solving this problem. While the above does constitute a fix, it does not actually address the root cause. Namely, it is not the fact that we are eliding Leas that is causing the issue, but rather the fact that the Leas we do have do not function properly when confronted with nested structures, which happen to arise in parallel aggregations (but it is interesting that we don't see this phenomenon arise in other scenarios?). An example follows.

Per usual, the setup queries:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (0, 1), (1, 2);

And the query of interest:

SELECT SUM(a), SUM(b) FROM test;

For this query, we generate the following bytecode for MergeAggregates:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return 

So far, so good. As we have seen before, it is the final Lea instruction that leads to LLVM type system errors upon translation to IR, but it is more than this - it also computes the incorrect result. To see why we need to look at the translation of the Lea bytecode to its corresponding IR representation (a store instruction) in LLVMEngine::DefineFunction(). The relevant part of the function is quite concise (I've simplified it a little bit, but not much):

llvm::Type *pointee_type = args[1]->getType()->getPointerElementType();
int64_t offset = llvm::cast<llvm::ConstantInt>(args[2])->getSExtValue();
if (auto struct_type = llvm::dyn_cast<llvm::StructType>(pointee_type)) {
  const uint32_t elem_index = dl.getStructLayout(struct_type)->getElementContainingOffset(offset);
  llvm::Value *addr = ir_builder->CreateStructGEP(args[1], elem_index);
  ir_builder->CreateStore(addr, args[0]);
} 

So now apply this to the final Lea from the bytecode above. We are attempting to compute the address of the second IntegerSumAggregate member of the pipelineState local, where pipelineState looks like:

struct {
  struct {
    IntegerSumAggregate,
    IntegerSumAggregate
  }
}

The value of offset is 16 (an argument to the instruction). The index we compute at elem_index = ... is 0 though, and this is where the problem arises, because now we simply compute a pointer to index 0 within the outermost structure of pipelineState, which actually constitutes a pointer to the first IntegerSumAggregate member, rather than the second (as far as the subsequent IntegerSumAggregateMerge call is concerned.

So, I argue, we actually have a logic bug here in the translation of the Lea bytecode to the corresponding LLVM IR store.

Accordingly, the fix should occur here at the level of bytecode -> IR translation, rather than at the level of bytecode generation (as I was previously pursuing). I wrote a proof-of-concept fix that rewrites the logic here to account for nested structures. One complication that arises, however, is the fact that some of the types that are non-structural at the level of bytecode are nonetheless represented as structural types in the LLVM IR (e.g. SQL Integer types). This makes the code to resolve nested structures slightly more complex because we lose some of the needed type information when we go from AST types to LLVM types. To account for this, I added AST type information for locals to the FunctionLocalsMap that we instantiate when defining each function. Of course, there may be a more elegant solution available that I have not yet considered.

Any way it is implemented, this new approach has a number of benefits over the old one:

  • Doesn't impact interpreted query execution at all (generated bytecode is not modified)
  • Doesn't impact the execution time of the compiled query (the exact same IR will be generated); instead only the time required for compilation itself will be impacted, and this difference should be negligible as the number of iterations required to resolve the desired address for the store instruction scales linearly with the nesting depth of structures in the TPL bytecode

@apavlo apavlo requested a review from pmenon February 12, 2021 21:56
@pmenon
Copy link
Member

pmenon commented Feb 18, 2021

So, I argue, we actually have a logic bug here in the translation of the Lea bytecode to the corresponding LLVM IR store.

I disagree with this. Lea is meant to be a simple (potentially scaled) pointer adjustment. It is expected that the higher levels perform the nested drill-down using multiple Leas. The optimization during bytecode generation broke this contract - the logic during LLVM codegen is correct.

Two things:

  1. Did you collect the numbers from a smaller targeted benchmark to ascertain the potential overhead from the first (technically correct) solution? I'm curious as to what they are.
  2. While it's possible to retain the bytecode optimization by tweaking LLVM code generation, we'd need to be able to navigate down arbitrary nestings that mix real TPL structs and arrays, and "synthetic" structs like sql::Integer. If you have a branch I can take a look at, we can work on it.

@turingcompl33t
Copy link
Contributor Author

turingcompl33t commented Mar 19, 2021

Came back to this PR today in light of the fact that it is blocking a number of other PRs, namely #1470 and #1402. I sat down and fired up the usual query to break the DBMS and dump the bytecode so I could look at it:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

Imagine my surprise then when the final query returned the correct results...

 ?column? | ?column? 
----------+----------
        1 |        2
(1 row)

So someone beat me to the punch (not that it was that difficult, this PR has been open since September 2020...), the question was: who? and how?

Further investigation reveals that it was none other than @lmwnshn and his PR #1482. Now, hats off to Wan for his PR and the fact that it gets these queries to run successfully, but unfortunately it really circumvents the issue, rather than addressing it directly.

Prior to Wan's PR, we generate the following bytecode for MergeAggregates for the query above:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return      

And after Wan's PR, this bytecode becomes:

Function 0 <MergeAggregates>:
  Frame size 64 bytes (2 parameters, 8 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{TableVectorIterator,bool,[7]int8,struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate
    local           tmp5:  offset=48      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp6:  offset=56      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    Lea                                             local=&tmp2  local=pipelineState  i32=4312
  0x00000020    IntegerSumAggregateMerge                        local=tmp1  local=tmp2
  0x0000002c    Lea                                             local=&tmp3  local=queryState  i32=8
  0x0000003c    Lea                                             local=&tmp4  local=tmp3  i32=16
  0x0000004c    Lea                                             local=&tmp5  local=pipelineState  i32=4312
  0x0000005c    Lea                                             local=&tmp6  local=tmp5  i32=16
  0x0000006c    IntegerSumAggregateMerge                        local=tmp4  local=tmp6
  0x00000078    Return     

The salient point here is the updates to the way we compute the address of the second IntegerSumAggregate in the pipeline. In brief, prior to Wan's PR, we elided the fourth Lea in the second set of Lea instructions when computing this address because the member that we are loading is located at a 0 offset from the base of the structure. Eliding this Lea was leading to the incorrect result + type system error at the LLVM layer. However, with Wan's PR, we have the TableVectorIterator stashed in the pipelineState structure, and importantly it is located before the nested structure with the two IntegerSumAggregates. Because of this, the nested structure is no longer located at a 0 offset from the start of the structure in which it is embedded, the logic in codegen (VisitMemberExpr) cannot elide the Lea, and everything works as intended.

Upside: we can now execute compiled parallel aggregations successfully, but there is likely still a logic bug (or oversight) in code generation that we don't hit under any of the current tests.

@lmwnshn
Copy link
Contributor

lmwnshn commented Mar 19, 2021

at times like this, github needs a 💩 react

@turingcompl33t
Copy link
Contributor Author

turingcompl33t commented Mar 19, 2021

Per the discussion in standup, although the updates by @lmwnshn fixed the bug as we currently experience it, I am still going to look at addressing the root cause. High-level plan now is:

  • Hand code some TPL that triggers the bug
  • Fix the bug
  • Profit

@turingcompl33t turingcompl33t removed the request for review from pmenon August 23, 2021 13:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working (correctness). Mark issues with this. in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel aggregations return the wrong results in compiled mode.
4 participants