Searches against a large number of unavailable shards result in very large responses #90622

original-brownbear · 2022-10-04T05:14:19Z

Searching a large number of unavailable shards through e.g. the * pattern while a large cluster is recovering from a full restart or so, leads to extremely large responses containing an exception for each shard.

This example shows a 375M on heap response for ~25k unavailable shards being returned. Over the wire, this serialises with a similar size and we'd have 700M+ peak heap usage for a search request when the valid search response in a green cluster might be much smaller than this.
It seems to me that we could mostly resolve this by not returning the stack trace in this specific case of the shard not available exception (it doesn't seem valuable for users and we made a similar fix around unavailable shards in snapshots state responses)?

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-10-04T05:14:48Z

Pinging @elastic/es-search (Team:Search)

benwtrent · 2022-10-14T17:23:23Z

@original-brownbear

It seems originally we did not record these failures at all but behavior was changed in: #64337

So, it seems a good middle ground is as you said, let's not return the trace in these class of failures.

Sounds fair?

benwtrent · 2022-10-14T17:37:11Z

Well, including unavailable shards as shard failures, but adjusting their exception serialization is a bit more complicated.

Will need to discuss the best way to approach this.

Some silly ideas:

simply don't return them as shard failures (effectively revert: Do not skip not available shard exception in search response #64337)
Adjust ShardSearchFailure to not require an exception (bigger change, bwc considerations). @ywelsch seems to indicate this option here: Should we consider "Shard not available exception" as regular search failures #47700 (comment)
Add a new "unavailable shards" section (seems like too big of a change to me, bwc considerations)
Keep serializing the stacktrace between nodes, but when writing XContent, don't write the stacktrace or suppressed exceptions for unavailable shard failures (seems easiest, but still costs when serializing between nodes...)

…ailable (#91365) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes #90622

…ailable (elastic#91365) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes elastic#90622

…ailable (#91365) (#92907) When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable. closes #90622

original-brownbear added >bug :Search/Search Search-related issues that do not fall into other categories labels Oct 4, 2022

elasticsearchmachine added the Team:Search Meta label for search team label Oct 4, 2022

original-brownbear mentioned this issue Oct 5, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

benwtrent self-assigned this Oct 14, 2022

benwtrent removed their assignment Oct 14, 2022

benwtrent self-assigned this Nov 7, 2022

benwtrent mentioned this issue Nov 7, 2022

Reduce memory required for search responses when many shards are unavailable #91365

Merged

original-brownbear mentioned this issue Nov 7, 2022

Deserialize responses on the handling thread-pool #91367

Merged

benwtrent closed this as completed in #91365 Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searches against a large number of unavailable shards result in very large responses #90622

Searches against a large number of unavailable shards result in very large responses #90622

original-brownbear commented Oct 4, 2022

elasticsearchmachine commented Oct 4, 2022

benwtrent commented Oct 14, 2022

benwtrent commented Oct 14, 2022 •

edited

Loading

Searches against a large number of unavailable shards result in very large responses #90622

Searches against a large number of unavailable shards result in very large responses #90622

Comments

original-brownbear commented Oct 4, 2022

elasticsearchmachine commented Oct 4, 2022

benwtrent commented Oct 14, 2022

benwtrent commented Oct 14, 2022 • edited Loading

benwtrent commented Oct 14, 2022 •

edited

Loading