Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searches against a large number of unavailable shards result in very large responses #90622

Closed
Tracked by #77466
original-brownbear opened this issue Oct 4, 2022 · 3 comments · Fixed by #91365
Closed
Tracked by #77466
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@original-brownbear
Copy link
Member

Searching a large number of unavailable shards through e.g. the * pattern while a large cluster is recovering from a full restart or so, leads to extremely large responses containing an exception for each shard.

This example shows a 375M on heap response for ~25k unavailable shards being returned. Over the wire, this serialises with a similar size and we'd have 700M+ peak heap usage for a search request when the valid search response in a green cluster might be much smaller than this.
It seems to me that we could mostly resolve this by not returning the stack trace in this specific case of the shard not available exception (it doesn't seem valuable for users and we made a similar fix around unavailable shards in snapshots state responses)?

@original-brownbear original-brownbear added >bug :Search/Search Search-related issues that do not fall into other categories labels Oct 4, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Oct 4, 2022
@benwtrent benwtrent self-assigned this Oct 14, 2022
@benwtrent
Copy link
Member

@original-brownbear

It seems originally we did not record these failures at all but behavior was changed in: #64337

So, it seems a good middle ground is as you said, let's not return the trace in these class of failures.

Sounds fair?

@benwtrent
Copy link
Member

benwtrent commented Oct 14, 2022

Well, including unavailable shards as shard failures, but adjusting their exception serialization is a bit more complicated.

Will need to discuss the best way to approach this.

Some silly ideas:

@benwtrent benwtrent removed their assignment Oct 14, 2022
@benwtrent benwtrent self-assigned this Nov 7, 2022
benwtrent added a commit that referenced this issue Jan 13, 2023
…ailable (#91365)

When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. 

This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable.


closes #90622
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jan 13, 2023
…ailable (elastic#91365)

When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. 

This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable.


closes elastic#90622
elasticsearchmachine pushed a commit that referenced this issue Jan 13, 2023
…ailable (#91365) (#92907)

When there are many shards unavailable, we repeatably store the exact same stack trace and exception. The only difference is the exception message. 

This commit fixes this by slightly modifying the created exception to not provide a stacktrace or print its stacktrace as a "reason" when a shard is unavailable.


closes #90622
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants