Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic source #85649

Merged
merged 54 commits into from
May 10, 2022
Merged

Synthetic source #85649

merged 54 commits into from
May 10, 2022

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 1, 2022

image

This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:

{
  "mappings": {
    "_source": {
      "synthetic": true
    }
  }
}

And we just stop storing the _source field - kind of. When you go to access
the _source we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like _reindex.

Fetching the _source from doc values does slow down loading somewhat. See
numbers further down.

Supported fields

This only works for the following fields:

  • boolean
  • byte
  • date
  • double
  • float
  • geo_point (with precision loss)
  • half_float
  • integer
  • ip
  • keyword
  • long
  • scaled_float
  • short
  • text (when there is a keyword sub-field that is compatible with this feature)

Educated guesses

The synthetic source generator makes _source fields that are:

  • sorted alphabetically
  • as "objecty" as possible
  • pushes all arrays to the "leaf" fields
  • sorts most array values
  • removes duplicate text and keyword values

These are mostly artifacts of how doc values are stored.

sorted alphabetically

{
  "b": 1,
  "c": 2,
  "a": 3
}

becomes

{
  "a": 3,
  "b": 1,
  "c": 2
}

as "objecty" as possible

{
  "a.b": "foo"
}

becomes

{
  "a": {
    "b": "foo"
  }
}

pushes all arrays to the "leaf" fields

{
  "a": [
    {
      "b": "foo",
      "c": "bar"
    },
    {
      "c": "bort"
    },
    {
      "b": "snort"
    }
}

becomes

{
  "a" {
    "b": ["foo", "snort"],
    "c": ["bar", "bort"]
  }
}

sorts most array values

{
  "a": [2, 3, 1]
}

becomes

{
  "a": [1, 2, 3]
}

removes duplicate text and keyword values

{
  "a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}

becomes

{
  "a": ["bar", "baz", "foo"]
}

_recovery_source

Elasticsearch's shard "recovery" process needs _source sometimes. So does
cross cluster replication. If you disable source or filter it somehow we store
a _recovery_source field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce _recovery_source and relies on it for recovery. It's possible
to synthesize _source during recovery but we don't do it.

That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.

perf numbers

I loaded the entire tsdb data set with this change and the size:

           standard -> synthetic
store size  31.0 GB ->  7.0 GB  (77.5% reduction)
_source  24695.7 MB -> 47.6 MB  (99.8% reduction - synthetic is in _recovery_source)

A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.

With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
expect this performance impact is based on the number of doc values fields
in the index and how sparse they are.

@nik9000 nik9000 requested a review from romseygeek April 1, 2022 20:10
@nik9000
Copy link
Member Author

nik9000 commented Apr 5, 2022

I hacked together something to test the differences:

export from=$(curl -s -HContent-Type:application/json -uelastic:D2MVMBAUE0fUDu30A6yO -XPOST -k 'https://localhost:9201/tsdb/_search?size=0&pretty' -d'{
  "aggs": {
    "min": {
      "min": {
        "field": "@timestamp",
        "format": "epoch_millis"
      }
    }
  }
}' | jq -r .aggregations.min.value_as_string)
export to=$(curl -s -HContent-Type:application/json -uelastic:D2MVMBAUE0fUDu30A6yO -XPOST -k 'https://localhost:9201/tsdb/_search?size=0&pretty' -d'{
  "aggs": {
    "max": {
      "max": {
        "field": "@timestamp",
        "format": "epoch_millis"
      }
    }
  }
}' | jq -r .aggregations.max.value_as_string)
for date in $(seq $from 1000 $to); do
  for id in $(curl -s -HContent-Type:application/json -uelastic:D2MVMBAUE0fUDu30A6yO -XPOST -k 'https://localhost:9201/tsdb/_search?size=10000&pretty' -d'{
    "stored_fields": ["__none__"],
    "sort": {"@timestamp": "desc"},
    "query": {
      "range": {
        "@timestamp": {
          "gte": '$date',
          "format": "epoch_millis"
        }
      }
    }
  }' | jq -r .hits.hits[]._id); do
    echo $date $id
    diff \
      <(curl -s -HContent-Type:application/json -uelastic:D2MVMBAUE0fUDu30A6yO -XGET -k 'https://localhost:9201/tsdb/_doc/'$id | jq 'del(._seq_no)' -S) \
      <(curl -s -HContent-Type:application/json -uelastic:D2MVMBAUE0fUDu30A6yO -XGET -k 'https://localhost:9200/tsdb/_doc/'$id | jq 'del(._seq_no)' -S)
  done
done | tee diffs

Which spits out:

1619630303410 aJUt5LaG4q6Jz3sDAAABeR4M1KU
1619630303410 JFpDQXhLo-DNlumpAAABeR4M1KU
1619630303410 uwFiL6jzbCaBLd-LAAABeR4M1KU
1619630303410 DDQFZr8fpaviwfqcAAABeR4M1GI
36c36
<               "pct": 0.0076478614634146345
---
>               "pct": 0.008
40c40
<               "pct": 0.003919529
---
>               "pct": 0.004
72c72
<               "pct": 0.4777109375
---
>               "pct": 0.47800000000000004
75c75
<               "pct": 0.015884058091486793
---
>               "pct": 0.016
97c97
<         "start_time": "2021-04-29T08:18:44Z"
---
>         "start_time": "2021-04-29T08:18:44.000Z"
1619630303410 iHhCyXEez-elWsWXAAABeR4M1GI
36c36
<               "pct": 0.2534879995
---
>               "pct": 0.253
40c40
<               "pct": 0.2534879995
---
>               "pct": 0.253
72c72
<               "pct": 0.01896543080120626
---
>               "pct": 0.019
75c75
<               "pct": 0.01896543080120626
---
>               "pct": 0.019
97c97
<         "start_time": "2021-04-29T14:31:13Z"
---
>         "start_time": "2021-04-29T14:31:13.000Z"
1619630303410 wWEXn97ymH76klIbAAABeR4M1GI
36c36
<               "pct": 0.165416355
---
>               "pct": 0.165
40c40
<               "pct": 0.00827081775
---
>               "pct": 0.008
72c72
<               "pct": 0.619265625
---
>               "pct": 0.619
75c75
<               "pct": 0.010295400826531081
---
>               "pct": 0.01
97c97
<         "start_time": "2021-04-07T10:08:31Z"
---
>         "start_time": "2021-04-07T10:08:31.000Z"

The test data has values like "pct": 0.010295400826531081 and has configured the mapping to use a scaled float with a scaling factor of 1000 so the what we actually store in doc values is "pct": 0.01 - and that is what we put in synthetic source.

@nik9000
Copy link
Member Author

nik9000 commented Apr 5, 2022

One thing I've noticed that we probably don't want but I don't know how to get rid of is copy_to - if you use copy_to to make an index of foo.bar.message at message then doc values for both will have doc values for it and the _source will contain it twice. This is ok for now, but would prevent us from using the source for recovery.

@nik9000
Copy link
Member Author

nik9000 commented Apr 6, 2022

One thing I've noticed that we probably don't want but I don't know how to get rid of is copy_to - if you use copy_to to make an index of foo.bar.message at message then doc values for both will have doc values for it and the _source will contain it twice. This is ok for now, but would prevent us from using the source for recovery.

I've forbidden copy_to for synthetic source indices in this PR. We can figure out how to allow it later.

Two skipped
@nik9000
Copy link
Member Author

nik9000 commented May 5, 2022

@romseygeek could you have another look at this? I've pushed some extra testing for round trips and it all passes. Well, sort of. I have to stub out a little of it because of mystery precision things. But I think we can get those in a follow up change.

@@ -203,4 +206,24 @@ protected void randomFetchTestFieldConfig(XContentBuilder b) throws IOException
protected boolean allowsNullValues() {
return false; // null is an error for constant keyword
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have enough test cases that have to implement these four identical 'empty' methods, that maybe it's worth consolidating them into a NoSyntheticSourceTest interface with default methods and the test cases can just implement them?

@nik9000
Copy link
Member Author

nik9000 commented May 9, 2022

For those following along at home this used to be activated with enabled: synthetic and but now it is activated with synthetic: true. I'm debating with a few folks about which is better. But, because this is behind a feature flag, I think it's safe to merge it either way. And, since the code currently supports synthetic: true, that's what I'd like to merge in the first cut.

@nik9000
Copy link
Member Author

nik9000 commented May 9, 2022

Now that this is merged I've moved the follow up work to a meta issue: #86603

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for all the back and forth, let's get this merged and look at the follow-ups.

@nik9000
Copy link
Member Author

nik9000 commented May 10, 2022

I have some perf numbers from a hack that turns off _recovery_source. This shows the potential indexing speed improvement we could get from using synthetic source on the recovery side:

|                 Min Throughput | 18024.6    | 19495.8     | 1471.23    | docs/s |  +8.16% |
|                Mean Throughput | 19426.9    | 21727.1     | 2300.2     | docs/s | +11.84% |
|              Median Throughput | 19169.7    | 21310.8     | 2141.08    | docs/s | +11.17% |
|                 Max Throughput | 22742.4    | 26960.5     | 4218.16    | docs/s | +18.55% |
|       Cumulative indexing time |   829.772  |   768.031   |  -61.741   |    min |  -7.44% |
|          Cumulative merge time |   235.641  |   230.36    |   -5.28152 |    min |  -2.24% |
| Cumulative merge throttle time |    39.8483 |    61.9718  |   22.1234  |    min | +55.52% |
|        Cumulative refresh time |    11.0899 |     7.02558 |   -4.06432 |    min | -36.65% |
|          Cumulative flush time |    41.3977 |    30.4483  |  -10.9495  |    min | -26.45% |

The short version is about 11% improvement in docs per second in TSDB, probably more in non-TSDB. Significantly faster merges, flushes, and refreshes - at least in TSDB, probably much faster in non-TSDB.

TSDB in it's current form has a somewhat inefficient indexing pipeline, mostly because it can never skip the _id lookup on write. We will fix that eventually, but for now TSDB is known slower to write. So the 11% speed boost on write here will likely jump once that slowness is resolved. I'm running a test against a non-TSDB index now to see.

The merge time is funny to read - it looks like a 2% speed up, but I believe a lot of that speed up is being throttled. See the 55% bump in merge throttling time. My guess is that we're looking at a reduction in load from merge in the 25% range, similar to flush and refresh.

Here's what the disk looks like with _recovery_source enabled:

Device   r/s     w/s  rMB/s     wMB/s  ... wareq-sz  svctm  %util
md0     0.00  176.67   0.00     30.34  ...   175.88   0.00   0.00
md0     0.00  157.33   0.00     31.66  ...   206.07   0.00   0.00
md0     0.00  196.67   0.00     35.70  ...   185.91   0.00   0.00
md0     0.00  288.00   0.00     62.00  ...   220.44   0.00   0.00
md0     0.00  185.00   0.00     41.12  ...   227.62   0.00   0.00
md0     0.00  126.33   0.00     27.97  ...   226.71   0.00   0.00
md0     0.00  192.00   0.00     25.95  ...   138.38   0.00   0.00
md0     0.00  208.64   0.00     47.55  ...   233.36   0.00   0.00
md0     0.00 1167.00   0.00    206.48  ...   181.18   0.00   0.00
md0     0.00  206.64   0.00     23.55  ...   116.69   0.00   0.00
md0     0.00  221.67   0.00     30.54  ...   141.06   0.00   0.00
md0     0.00  158.33   0.00     29.90  ...   193.36   0.00   0.00
md0     0.00  208.00   0.00     33.23  ...   163.59   0.00   0.00
md0     0.00  266.33   0.00     70.11  ...   269.54   0.00   0.00
md0     0.00  122.67   0.00     12.91  ...   107.77   0.00   0.00
md0     0.00  184.67   0.00     28.96  ...   160.57   0.00   0.00
md0     0.00  951.67   0.00    103.28  ...   111.13   0.00   0.00
md0     0.00  214.00   0.00     31.92  ...   152.72   0.00   0.00
md0     0.00  184.00   0.00     31.07  ...   172.93   0.00   0.00

Note the bursty writes. Here's what it looks like without _recovery_source:

Device   r/s     w/s  rMB/s     wMB/s   ... wareq-sz  svctm  %util
md0     0.00  252.00   0.00     43.25   ...   175.74   0.00   0.00
md0     0.00  703.00   0.00     51.49   ...    75.00   0.00   0.00
md0     0.00  250.00   0.00     41.11   ...   168.40   0.00   0.00
md0     0.00  194.67   0.00     44.45   ...   233.82   0.00   0.00
md0     0.00  192.00   0.00     44.57   ...   237.71   0.00   0.00
md0     0.00  176.00   0.00     41.98   ...   244.23   0.00   0.00
md0     0.00  157.67   0.00     26.27   ...   170.62   0.00   0.00
md0     0.00  854.67   0.00     77.48   ...    92.83   0.00   0.00
md0     0.00  174.67   0.00     38.78   ...   227.35   0.00   0.00
md0     0.00  186.67   0.00     40.56   ...   222.51   0.00   0.00
md0     0.00  174.67   0.00     36.90   ...   216.34   0.00   0.00
md0     0.00  219.67   0.00     44.61   ...   207.96   0.00   0.00
md0     0.00  187.67   0.00     41.46   ...   226.21   0.00   0.00
md0     0.00   79.33   0.00     13.53   ...   174.67   0.00   0.00
md0     0.00  670.33   0.00     71.00   ...   108.46   0.00   0.00
md0     0.00  307.67   0.00     33.58   ...   111.75   0.00   0.00
md0     0.00  182.00   0.00     44.03   ...   247.74   0.00   0.00
md0     0.00  219.00   0.00     46.32   ...   216.57   0.00   0.00
md0     0.00  210.67   0.00     41.19   ...   200.21   0.00   0.00
md0     0.00  204.00   0.00     48.01   ...   241.01   0.00   0.00
md0     0.00  101.33   0.00     18.07   ...   182.63   0.00   0.00
md0     0.00  773.33   0.00    105.11   ...   139.18   0.00   0.00
md0     0.00  287.67   0.00     28.89   ...   102.82   0.00   0.00
md0     0.00  176.00   0.00     40.39   ...   234.99   0.00   0.00
md0     0.00  209.67   0.00     40.71   ...   198.82   0.00   0.00
md0     0.00  209.67   0.00     41.95   ...   204.86   0.00   0.00
md0     0.00  145.67   0.00     25.39   ...   178.52   0.00   0.00
md0     0.00  203.33   0.00     38.60   ...   194.41   0.00   0.00
md0     0.00  737.00   0.00     66.14   ...    91.90   0.00   0.00

The writes are less bursty. Still bursty, but less so. I believe the infrastructure that I used to run this captured graphs of the this data over a longer period of time, but I don't know how to access it. I'm digging.

Edit:
Here is the indexing performance for non-tsdb indices:

|                 Min Throughput | 54252.1     | 63616.7     | 9364.55    | docs/s |  +17.26% |
|                Mean Throughput | 55221.9     | 64975.2     | 9753.27    | docs/s |  +17.66% |
|              Median Throughput | 55085.7     | 65084.5     | 9998.85    | docs/s |  +18.15% |
|                 Max Throughput | 56526.7     | 66064.2     | 9537.54    | docs/s |  +16.87% |
|       Cumulative indexing time |   266.365   |   223.578   |  -42.7869  |    min |  -16.06% |
|          Cumulative merge time |   110.918   |    90.5038  |  -20.4143  |    min |  -18.40% |
| Cumulative merge throttle time |     1.19758 |     0.76905 |   -0.42853 |    min |  -35.78% |
|        Cumulative refresh time |     1.71403 |     1.25598 |   -0.45805 |    min |  -26.72% |
|          Cumulative flush time |     7.67595 |     5.885   |   -1.79095 |    min |  -23.33% |

This one is better in the neighborhood of 17.5% rather than 11%.

@nik9000 nik9000 merged commit a589456 into elastic:master May 10, 2022
@nik9000 nik9000 mentioned this pull request May 10, 2022
50 tasks
@nik9000
Copy link
Member Author

nik9000 commented May 10, 2022

I got charts! Here's disk write for non-tsdb indices:
image

It's a stacked line graph writes on all physical disks on the machine. So md0 above is basically the top most line. The first run has _recovery_source and the second one doesn't. It writes faster and hits the disk less hard.

@nik9000
Copy link
Member Author

nik9000 commented May 10, 2022

Here's the TSDB run:
image

This time the second run has _recovery_source and the first one doesn't. It's the same picture thing - turning off _recovery_source hits the disk less hard and increases write speed.

@ruslaniv
Copy link

ruslaniv commented Dec 2, 2022

Is there any way to disable creation of _recovery_source because of this:
#82595 (comment)

@nik9000
Copy link
Member Author

nik9000 commented Dec 2, 2022

We've talked a little about this - rebuilding the _source on the fly using synthetic _source. At the time we decided it wasn't worth it because folks were looking at doing other kinds of replication. I believe they are still working on that. In that replication mechanism we wouldn't need _recovery_source at all. That'd be lovely. No synthetic _source required. I still think that's a good plan.

@ruslaniv
Copy link

ruslaniv commented Dec 6, 2022

Nik, thank you for your answer!
Do you think the issue of "dangling" _recovery_source could be addressed in the near future? Right now this issue is causing our index to grow to 250Gb instead of 50Gb. Not only this is wasting 200Gb of disk space which is not critical, but the index no longer fits available RAM which is very critical.

@nik9000
Copy link
Member Author

nik9000 commented Dec 6, 2022

Not only this is wasting 200Gb of disk space which is not critical, but the index no longer fits available RAM which is very critical.

Bleh. And _source is stored next to _id and friends so you'll end up paging it in even if you weren't intending to load it from disk. Lovely. It looks like @DaveCTurner is talking to you on the linked issue about the dangling _recovery_source which is a good sign. He should be able to figure out what's going on for you.

I do think _recovery_source is being used much more now - mostly because folks are removing dense vectors from the _source, but partly because of synthetic _source. I wouldn't be surprised if we found more "fun" things in it now - but it should work as he describes. I read a lot of that code when working on this. But computers are sneaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.