tool: lsm visualization tool #508

petermattis · 2020-01-17T19:54:44Z

Add a new lsm command which takes a MANIFEST as input and outputs a
d3-powered visualization of the evoluation of the LSM. Each version edit
in the MANIFEST becomes a single step in the visualization. The 7
levels of the LSM are visually depicted with each sstable represented as
a 1-pixel wide rectangle. The height of the rectangle is proportional to
the size in bytes of the sstable. The sstables are displayed in the same
order as they occur in the LSM. Note that the sstables from different
levels are NOT aligned according to their start and end keys (doing so
is also interesting, but it works against using the area of the
rectangle to indicate size).

petermattis · 2020-01-17T19:54:51Z

This change is

petermattis · 2020-01-17T19:56:17Z

I really need to comment lsm.js better (heck, this entire PR needs more comments), though I'm not sure how much I can help with understanding of d3 if you haven't been exposed to it before. I have a number of enhancements I'd like to make, though this is already useful and worth merging.

itsbilal

Mostly just looked at the JavaScript for now, and have some comments that should make the code simpler. Overall looks good. For a lot of the calculations and SVG hackery, I'm taking an "if it works, don't complain" approach.

Will look at the Go parts soon.

Reviewed 1 of 8 files at r1.
Reviewable status: 1 of 8 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @petermattis, and @sumeerbhola)

tool/data/lsm.js, line 1 at r1 (raw file):

// TODO(peter):

Starting this file with:
"use strict";

should enable JS strict mode. Gives you nice things like throwing an error if you use an undeclared variable, so it's probably a good idea to enable it.

tool/data/lsm.js, line 18 at r1 (raw file):

var levelWidth = 0;

{

JS variable scoping can be weird. The traditional, older way to prevent variables from leaking from a block is to do this:

(function(){
    <code>
})()

The way it's written right now, c and h become global variables since they're declared with var, which hoists them at the enclosing function, or global if there's no enclosing function. You could do block scoping with s/var/let which should work on almost all browsers these days, or you could enclose this in a function block.

tool/data/lsm.js, line 67 at r1 (raw file):

function styleWidth(e) {
    var width = +e.style("width").slice(0, -2);

If I'm reading this right, you're taking an array of widths corresponding to (potentially multiple) elements passed in and then converting it to a number with the + prefix somehow? It might be better to just use the [] operator to refer to a value instead of slicing it down to one value, or use more specific selectors when calling this function.

tool/data/lsm.js, line 101 at r1 (raw file):

        for (; this.index < index; this.index++) {
            var edit = data.Edits[this.index + 1];
            for (var level in edit.Deleted) {

Use for..of (newer approach) or edit.Deleted.forEach(function(i){...}) (older approach) for iterating through arrays. for..in is used to iterate between properties in an object, and while arrays are also objects and this works, I wouldn't be surprised if methods started popping up in level instead of values.

tool/data/lsm.js, line 104 at r1 (raw file):

                this.remove(level, edit.Deleted[level]);
            }
            for (var level in edit.Added) {

Same as before

tool/data/lsm.js, line 172 at r1 (raw file):

            if (fileNums.indexOf(l[i]) != -1) {
                l[i] = l[l.length - 1];
                l.pop();

Nit: You could just do l.splice(i,1). Has the added benefit of not messing up sorts.

tool/data/lsm.js, line 307 at r1 (raw file):

            // TODO(peter): Why do closures capture the wrong variable
            // if this is declared as "var indicators =". It is as-if
            // there is only a single variable defined outside of the

Answer to todo: use let instead of var.

tool/data/lsm.js, line 320 at r1 (raw file):

                .attr("opacity", 0)
                .attr("pointer-events", "all")
                .on("mousemove", function(i) {

This function seems to not depend on any variables inside this loop, and it's a pretty large event handler. Maybe define it in the object directly and just name it here? It'll reduce the levels of indentation required

You'd have to do some this-capturing since you have two levels of this: first, version is the outer-this referring to the outer object, and inside the handler, this refers to the element where this event occurred. You could probably do a function generator:

var version = {
...
   onClipMouseMove: function(i) {
       var version = this;
       var level = version.levels[i];
       return function() {
            // Here, version = outer this, and this = inner this = the rect
       }
   })
...
}

And in here, you could just do .on("mousemove", version.onClipMouseMove(i))

petermattis

TFTR!

I made some simplifications based on my newfound var/let knowledge.

Reviewable status: 1 of 8 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

tool/data/lsm.js, line 1 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Starting this file with:
"use strict";

should enable JS strict mode. Gives you nice things like throwing an error if you use an undeclared variable, so it's probably a good idea to enable it.

Done.

tool/data/lsm.js, line 18 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

JS variable scoping can be weird. The traditional, older way to prevent variables from leaking from a block is to do this:
(function(){
    <code>
})()
The way it's written right now, c and h become global variables since they're declared with var, which hoists them at the enclosing function, or global if there's no enclosing function. You could do block scoping with s/var/let which should work on almost all browsers these days, or you could enclose this in a function block.

TIL. I was super confused with the scoping behavior of var. I've switched the usage here to let.

tool/data/lsm.js, line 67 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

If I'm reading this right, you're taking an array of widths corresponding to (potentially multiple) elements passed in and then converting it to a number with the + prefix somehow? It might be better to just use the [] operator to refer to a value instead of slicing it down to one value, or use more specific selectors when calling this function.

I stole this code from the internets. I think it was written by Mike Bostock himself. I'd prefer to leave as-is.

tool/data/lsm.js, line 101 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Use for..of (newer approach) or edit.Deleted.forEach(function(i){...}) (older approach) for iterating through arrays. for..in is used to iterate between properties in an object, and while arrays are also objects and this works, I wouldn't be surprised if methods started popping up in level instead of values.

I thought for..in iterated over the keys, while for..of iterated over the values. I need the key for this loop. Is there a way to iterate over key+value?

tool/data/lsm.js, line 104 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Same as before

Ditto.

tool/data/lsm.js, line 172 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: You could just do l.splice(i,1). Has the added benefit of not messing up sorts.

That involves shifting a ton of elements, though. And the later sort is required for correctness regardless.

tool/data/lsm.js, line 307 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Answer to todo: use let instead of var.

Ah-ha! I could not find the answer on the internets this weekend. Function scope is confusing. Why would they do that?

Should I primarily being using let instead of var. Is there ever a reason to use var?

This comment wasn't accurate anymore as I actually need the indicator array now.

tool/data/lsm.js, line 320 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This function seems to not depend on any variables inside this loop, and it's a pretty large event handler. Maybe define it in the object directly and just name it here? It'll reduce the levels of indentation required

You'd have to do some this-capturing since you have two levels of this: first, version is the outer-this referring to the outer object, and inside the handler, this refers to the element where this event occurred. You could probably do a function generator:
var version = {
...
   onClipMouseMove: function(i) {
       var version = this;
       var level = version.levels[i];
       return function() {
            // Here, version = outer this, and this = inner this = the rect
       }
   })
...
}
And in here, you could just do .on("mousemove", version.onClipMouseMove(i))

Good idea, though I took a slightly different approach and just reference a single version.onMouseMove function instead of returning a closure as I didn't need the this = the rect functionality.

itsbilal

LGTM save for some minor comments

Reviewed 5 of 8 files at r1, 1 of 1 files at r2.
Reviewable status: 6 of 8 files reviewed, 4 unresolved discussions (waiting on @petermattis and @sumeerbhola)

tool/lsm.go, line 88 at r2 (raw file):

	l.Root.Flags().Var(&l.fmtKey, "key", "key formatter")
	l.Root.Flags().BoolVar(&l.embed, "embed", true, "embed javascript in HTML (disable for development)")

Is the purpose of this just so we can ship one HTML file over Slack or something and have the other side be able to view the visualization?

tool/data/lsm.css, line 69 at r2 (raw file):

}

.index {

Both .index and #index are being used. May as well put everything under the ID and get rid of the class? I only see one element using this class, and that one also has id index, and since IDs are more selective than classes, the ID properties will take precedence. In this case, that means width: 50px;.

tool/data/lsm.js, line 67 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I stole this code from the internets. I think it was written by Mike Bostock himself. I'd prefer to leave as-is.

Sounds good.

tool/data/lsm.js, line 101 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I thought for..in iterated over the keys, while for..of iterated over the values. I need the key for this loop. Is there a way to iterate over key+value?

Oh, this is an object, not an array. My bad. The numerical keys confused me.

for..of iterates over values of an iterable. An array is an iterable, an object is not. You can use Object.keys(obj) to make an iterable array out of an object's keys, and Object.entries(obj) to make an iterable of [key, value] pairs, which you can use like this:

for (let [key, value] of Object.entries(edit.Deleted)) {
 ...
}

The destructuring assignment seems well supported enough: https://caniuse.com/#feat=mdn-javascript_operators_destructuring

With for..in, it's generally a good idea to enclose the loop body with an if (edit.Deleted.hasOwnProperty(level)) { ... } (most JS linters require it), in case you start looping through prototype methods by mistake. I'm fine with any of the above, or even with leaving it as-is if it works. I just thought you were for..in-ing through an array which would have thrown member functions of Array into the loop.

tool/data/lsm.js, line 172 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

That involves shifting a ton of elements, though. And the later sort is required for correctness regardless.

Sounds good then.

tool/data/lsm.js, line 307 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ah-ha! I could not find the answer on the internets this weekend. Function scope is confusing. Why would they do that?

Should I primarily being using let instead of var. Is there ever a reason to use var?

This comment wasn't accurate anymore as I actually need the indicator array now.

I'm pretty sure entire books can be written about JS and "why would they do that" 😂

There isn't a reason to use var these days, but let wasn't really widely supported until 2017 so many resources online still refer to var: https://caniuse.com/#feat=let

tool/data/lsm.js, line 320 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Good idea, though I took a slightly different approach and just reference a single version.onMouseMove function instead of returning a closure as I didn't need the this = the rect functionality.

Ah I see. That makes sense.

So the i here is coming from the rect's event call, but happens to match the level number?

Add a new `lsm` command which takes a `MANIFEST` as input and outputs a d3-powered visualization of the evoluation of the LSM. Each version edit in the `MANIFEST` becomes a single step in the visualization. The 7 levels of the LSM are visually depicted with each sstable represented as a 1-pixel wide rectangle. The height of the rectangle is proportional to the size in bytes of the sstable. The sstables are displayed in the same order as they occur in the LSM. Note that the sstables from different levels are NOT aligned according to their start and end keys (doing so is also interesting, but it works against using the area of the rectangle to indicate size).

petermattis

Reviewable status: 6 of 8 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

tool/lsm.go, line 88 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Is the purpose of this just so we can ship one HTML file over Slack or something and have the other side be able to view the visualization?

Yes. The embedded CSS and Javascript is small enough in comparison to the data that this convenience is really nice. The reason for --embed=false is to simplify the javascript/css development workflow: you can just reload the html.

tool/data/lsm.css, line 69 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Both .index and #index are being used. May as well put everything under the ID and get rid of the class? I only see one element using this class, and that one also has id index, and since IDs are more selective than classes, the ID properties will take precedence. In this case, that means width: 50px;.

Heh, this was confusing. .index is the class of the parent div for #index (the input). I've switched to using #index-container for the parent div to make this clearer.

tool/data/lsm.js, line 101 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Oh, this is an object, not an array. My bad. The numerical keys confused me.

for..of iterates over values of an iterable. An array is an iterable, an object is not. You can use Object.keys(obj) to make an iterable array out of an object's keys, and Object.entries(obj) to make an iterable of [key, value] pairs, which you can use like this:
for (let [key, value] of Object.entries(edit.Deleted)) {
 ...
}
The destructuring assignment seems well supported enough: https://caniuse.com/#feat=mdn-javascript_operators_destructuring

With for..in, it's generally a good idea to enclose the loop body with an if (edit.Deleted.hasOwnProperty(level)) { ... } (most JS linters require it), in case you start looping through prototype methods by mistake. I'm fine with any of the above, or even with leaving it as-is if it works. I just thought you were for..in-ing through an array which would have thrown member functions of Array into the loop.

I'm going to leave as-is because it works, but really appreciate the explanation of the alternatives. Feel free to follow-up with cleanup PRs to further continue my javascript (re-)education.

tool/data/lsm.js, line 320 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Ah I see. That makes sense.

So the i here is coming from the rect's event call, but happens to match the level number?

The i here is coming from the data binding done in render. See the call to .data([i]) in the for (let i in this.levels) loop. I think looping over the levels like that is a bit of a wart. The real d3 way would be to bind this.levels, but I couldn't figure out how to make that work. I'm sure it is possible, just slightly beyond my d3/javascript abilities.

@jbowens

Refactors `manifest.Annotator` to use generics and a simplified API. This eliminates the need to perform pointer manipulation and unsafe typecasting when defining a new Annotator. This change also adds the concept of a "range annotation", which is a computation that aggregates some value over a specific key range within a level. Level-wide annotations are now computed internally as a range annotation with a key range spanning the whole level. Range annotations use the same B-tree caching behavior as regular annotations, so queries remain fast even with thousands of tables because they avoid a sequential iteration over a level's files. The goal of this change is to expand and improve the Annotator interface while not changing any existing behavior. However, there are a number of potential use cases for range annotations which could be added next: - Calculating the number of keys shadowed by a tombstone-dense key range, for use in the heuristic proposed at cockroachdb#3719 - Computing the total file size that a read compaction overlaps with, which is used to prevent read compactions that are too wide [here](https:/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930) - Estimating disk usage for a key range without having to iterate over files, which is done [here](https:/jbowens/pebble/blob/master/db.go#L2249) - Calculating average value size and compression ratio for a key range, which we [currently use when estimating the potential space that compacting point tombstones would reclaim](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350). Range annotations could also be used to implement the TODO from @jbowens. - Estimating the reclaimed space from compacting range deletions, for which we also [currently use sequential iteration](https:/jbowens/pebble/blob/master/table_stats.go#L557). - Because annotations are in-memory, if we can find a way to refactor those last two without using I/O at all, then this would eliminate the need to defer table stats collection to a separate goroutine for newly written tables. - Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics including the # of sstables, # of keys, etc. in chunks of the keyspace and visualize this on a graph showing overlapping ranges from each level. `BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared to the equivalent implementation of `orderStatistic`: ``` │ old │ new │ │ sec/op │ sec/op vs base │ NumFilesAnnotator-10 1.953µ ± 1% 1.618µ ± 3% -17.15% (p=0.002 n=6) │ old │ new │ │ B/op │ B/op vs base │ NumFilesAnnotator-10 536.0 ± 0% 544.0 ± 0% +1.49% (p=0.002 n=6) │ old │ new │ │ allocs/op │ allocs/op vs base │ NumFilesAnnotator-10 7.000 ± 0% 8.000 ± 0% +14.29% (p=0.002 n=6) ``` `BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast for arbitrary length ranges: ``` BenchmarkNumFilesRangeAnnotation-10 460471 2191 ns/op 944 B/op 7 allocs/op ```

@jbowens

Refactors `manifest.Annotator` to use generics and a simplified API. This eliminates the need to perform pointer manipulation and unsafe typecasting when defining a new Annotator. This change also adds the concept of a "range annotation", which is a computation that aggregates some value over a specific key range within a level. Level-wide annotations are now computed internally as a range annotation with a key range spanning the whole level. Range annotations use the same B-tree caching behavior as regular annotations, so queries remain fast even with thousands of tables because they avoid a sequential iteration over a level's files. The goal of this change is to expand and improve the Annotator interface while not changing any existing behavior. However, there are a number of potential use cases for range annotations which could be added next: - Calculating the number of keys shadowed by a tombstone-dense key range, for use in the heuristic proposed at cockroachdb#3719 - Computing the total file size that a read compaction overlaps with, which is used to prevent read compactions that are too wide [here](https:/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930) - Estimating disk usage for a key range without having to iterate over files, which is done [here](https:/jbowens/pebble/blob/master/db.go#L2249) - Calculating average value size and compression ratio for a key range, which we [currently use when estimating the potential space that compacting point tombstones would reclaim](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350). Range annotations could also be used to implement the TODO from @jbowens. - Estimating the reclaimed space from compacting range deletions, for which we also [currently use sequential iteration](https:/jbowens/pebble/blob/master/table_stats.go#L557). - Because annotations are in-memory, if we can find a way to refactor those last two without using I/O at all, then this would eliminate the need to defer table stats collection to a separate goroutine for newly written tables. - Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics including the # of sstables, # of keys, etc. in chunks of the keyspace and visualize this on a graph showing overlapping ranges from each level. `BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared to the equivalent implementation of `orderStatistic`: ``` │ old │ new │ │ sec/op │ sec/op vs base │ NumFilesAnnotator-10 1.953µ ± 1% 1.618µ ± 3% -17.15% (p=0.002 n=6) │ old │ new │ │ B/op │ B/op vs base │ NumFilesAnnotator-10 536.0 ± 0% 544.0 ± 0% +1.49% (p=0.002 n=6) │ old │ new │ │ allocs/op │ allocs/op vs base │ NumFilesAnnotator-10 7.000 ± 0% 8.000 ± 0% +14.29% (p=0.002 n=6) ``` `BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast for arbitrary length ranges: ``` BenchmarkNumFilesRangeAnnotation-10 460471 2191 ns/op 944 B/op 7 allocs/op ```

@jbowens

Refactors `manifest.Annotator` to use generics and a simplified API. This eliminates the need to perform pointer manipulation and unsafe typecasting when defining a new Annotator. This change also adds the concept of a "range annotation", which is a computation that aggregates some value over a specific key range within a level. Level-wide annotations are now computed internally as a range annotation with a key range spanning the whole level. Range annotations use the same B-tree caching behavior as regular annotations, so queries remain fast even with thousands of tables because they avoid a sequential iteration over a level's files. The goal of this change is to expand and improve the Annotator interface while not changing any existing behavior. However, there are a number of potential use cases for range annotations which could be added next: - Calculating the number of keys shadowed by a tombstone-dense key range, for use in the heuristic proposed at cockroachdb#3719 - Computing the total file size that a read compaction overlaps with, which is used to prevent read compactions that are too wide [here](https:/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930) - Estimating disk usage for a key range without having to iterate over files, which is done [here](https:/jbowens/pebble/blob/master/db.go#L2249) - Calculating average value size and compression ratio for a key range, which we [currently use when estimating the potential space that compacting point tombstones would reclaim](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350). Range annotations could also be used to implement the TODO from @jbowens. - Estimating the reclaimed space from compacting range deletions, for which we also [currently use sequential iteration](https:/jbowens/pebble/blob/master/table_stats.go#L557). - Because annotations are in-memory, if we can find a way to refactor those last two without using I/O at all, then this would eliminate the need to defer table stats collection to a separate goroutine for newly written tables. - Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics including the # of sstables, # of keys, etc. in chunks of the keyspace and visualize this on a graph showing overlapping ranges from each level. `BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared to the equivalent implementation of `orderStatistic`: ``` │ old │ new │ │ sec/op │ sec/op vs base │ NumFilesAnnotator-10 1.953µ ± 1% 1.618µ ± 3% -17.15% (p=0.002 n=6) │ old │ new │ │ B/op │ B/op vs base │ NumFilesAnnotator-10 536.0 ± 0% 544.0 ± 0% +1.49% (p=0.002 n=6) │ old │ new │ │ allocs/op │ allocs/op vs base │ NumFilesAnnotator-10 7.000 ± 0% 8.000 ± 0% +14.29% (p=0.002 n=6) ``` `BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast for arbitrary length ranges: ``` BenchmarkNumFilesRangeAnnotation-10 460471 2191 ns/op 944 B/op 7 allocs/op ```

@jbowens

Refactors `manifest.Annotator` to use generics and a simplified API. This eliminates the need to perform pointer manipulation and unsafe typecasting when defining a new Annotator. This change also adds the concept of a "range annotation", which is a computation that aggregates some value over a specific key range within a level. Level-wide annotations are now computed internally as a range annotation with a key range spanning the whole level. Range annotations use the same B-tree caching behavior as regular annotations, so queries remain fast even with thousands of tables because they avoid a sequential iteration over a level's files. The goal of this change is to expand and improve the Annotator interface while not changing any existing behavior. However, there are a number of potential use cases for range annotations which could be added next: - Calculating the number of keys shadowed by a tombstone-dense key range, for use in the heuristic proposed at cockroachdb#3719 - Computing the total file size that a read compaction overlaps with, which is used to prevent read compactions that are too wide [here](https:/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930) - Estimating disk usage for a key range without having to iterate over files, which is done [here](https:/jbowens/pebble/blob/master/db.go#L2249) - Calculating average value size and compression ratio for a key range, which we [currently use when estimating the potential space that compacting point tombstones would reclaim](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350). Range annotations could also be used to implement the TODO from @jbowens. - Estimating the reclaimed space from compacting range deletions, for which we also [currently use sequential iteration](https:/jbowens/pebble/blob/master/table_stats.go#L557). - Because annotations are in-memory, if we can find a way to refactor those last two without using I/O at all, then this would eliminate the need to defer table stats collection to a separate goroutine for newly written tables. - Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics including the # of sstables, # of keys, etc. in chunks of the keyspace and visualize this on a graph showing overlapping ranges from each level. `BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared to the equivalent implementation of `orderStatistic`: ``` │ old │ new │ │ sec/op │ sec/op vs base │ NumFilesAnnotator-10 1.953µ ± 1% 1.618µ ± 3% -17.15% (p=0.002 n=6) │ old │ new │ │ B/op │ B/op vs base │ NumFilesAnnotator-10 536.0 ± 0% 544.0 ± 0% +1.49% (p=0.002 n=6) │ old │ new │ │ allocs/op │ allocs/op vs base │ NumFilesAnnotator-10 7.000 ± 0% 8.000 ± 0% +14.29% (p=0.002 n=6) ``` `BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast for arbitrary length ranges: ``` BenchmarkNumFilesRangeAnnotation-10 460471 2191 ns/op 944 B/op 7 allocs/op ```

@jbowens

This change adds a "range annotation" feature to Annotators , which are computations that aggregate some value over a specific key range within within a level. Level-wide annotations are now computed internally as a range annotation with a key range spanning the whole level. Range annotations use the same B-tree caching behavior as regular annotations, so queries remain fast even with thousands of tables because they avoid a sequential iteration over a level's files. This PR only sets up range annotations without changing any existing behavior. However, there are a number of potential use cases for range annotations which could be added next: - Calculating the number of keys shadowed by a tombstone-dense key range, for use in the heuristic proposed at cockroachdb#3719 - Computing the total file size that a read compaction overlaps with, which is used to prevent read compactions that are too wide [here](https:/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930) - Estimating disk usage for a key range without having to iterate over files, which is done [here](https:/jbowens/pebble/blob/master/db.go#L2249) - Calculating average value size and compression ratio for a key range, which we [currently use when estimating the potential space that compacting point tombstones would reclaim](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350). Range annotations could also be used to implement the TODO from @jbowens. - Estimating the reclaimed space from compacting range deletions, for which we also [currently use sequential iteration](https:/jbowens/pebble/blob/master/table_stats.go#L557). - Because annotations are in-memory, if we can find a way to refactor those last two without using I/O at all, then this would eliminate the need to defer table stats collection to a separate goroutine for newly written tables. - Refactoring [`db.ScanStatistics`](https:/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/db.go#L2823), which could increase performance significantly over the current implementation where we scan every key in the DB. - Expand the LSM visualizer tool (cockroachdb#508) or the LSM viewer (cockroachdb#3339) to show aggregate statistics about key ranges. Range annotations would allow us to efficiently compute statistics including the # of sstables, # of keys, etc. in chunks of the keyspace and visualize this on a graph showing overlapping ranges from each level. `BenchmarkNumFilesRangeAnnotation` shows that range annotations are significantly faster than using `version.Overlaps` to aggregate over a key range: ``` pkg: github.com/cockroachdb/pebble/internal/manifest BenchmarkNumFilesRangeAnnotation/annotator-10 232282 4716 ns/op 112 B/op 7 allocs/op BenchmarkNumFilesRangeAnnotation/overlaps-10 2110 545482 ns/op 400 B/op 9 allocs/op``` ```

petermattis requested review from itsbilal and sumeerbhola January 17, 2020 19:54

petermattis force-pushed the pmattis/lsm-visualize branch 6 times, most recently from 107e3e8 to 49031b2 Compare January 21, 2020 15:43

itsbilal requested changes Jan 21, 2020

View reviewed changes

petermattis commented Jan 21, 2020

View reviewed changes

petermattis force-pushed the pmattis/lsm-visualize branch 4 times, most recently from 551d619 to 441745d Compare January 24, 2020 14:33

itsbilal approved these changes Jan 24, 2020

View reviewed changes

petermattis force-pushed the pmattis/lsm-visualize branch from 441745d to fd38155 Compare January 28, 2020 13:40

petermattis commented Jan 28, 2020

View reviewed changes

petermattis merged commit b8d57b9 into master Jan 28, 2020

petermattis deleted the pmattis/lsm-visualize branch January 28, 2020 14:55

anish-shanbhag mentioned this pull request Jul 15, 2024

manifest: add range annotations #3759

Merged

anish-shanbhag mentioned this pull request Jul 26, 2024

improve performance of aggregations over a key range using range annotations #3793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool: lsm visualization tool #508

tool: lsm visualization tool #508

petermattis commented Jan 17, 2020 •

edited

Loading

petermattis commented Jan 17, 2020

petermattis commented Jan 17, 2020

itsbilal left a comment

petermattis left a comment

itsbilal left a comment

petermattis left a comment

tool: lsm visualization tool #508

tool: lsm visualization tool #508

Conversation

petermattis commented Jan 17, 2020 • edited Loading

petermattis commented Jan 17, 2020

petermattis commented Jan 17, 2020

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

petermattis commented Jan 17, 2020 •

edited

Loading