Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] SimpleNestedIT tests using nested sort are flaky for concurrent segment search #11187

Closed
jed326 opened this issue Nov 13, 2023 · 6 comments · Fixed by #11377
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@jed326
Copy link
Collaborator

jed326 commented Nov 13, 2023

Describe the bug
The following tests are flaky for concurrent segment search:

  • testSortNestedWithNestedFilter
  • testNestedSortWithMultiLevelFiltering
  • testLeakingSortValues
  • testNestedSortingWithNestedFilterAsFilter
  • testSimpleNestedSortingWithNestedFilterMissing
  • testSimpleNestedSorting

See: https://build.ci.opensearch.org/job/gradle-check/29891

java.lang.AssertionError: Count is 0 hits but 3 was expected.  Total shards: 2 Successful shards: 1 & 1 shard failures:
 shard [[5Kj_N13LQg2fToy_tNYNEg][test][0]], reason [RemoteTransportException[[node_s1][127.0.0.1:39609][indices:data/read/search[phase/query/id]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: NotSerializableExceptionWrapper[no_such_element_exception: null]; ], cause [NotSerializableExceptionWrapper[no_such_element_exception: null]
	at java.util.LinkedList.removeFirst(LinkedList.java:281)
	at java.util.LinkedList.pop(LinkedList.java:812)
	at org.opensearch.index.query.support.NestedScope.previousLevel(NestedScope.java:69)
	at org.opensearch.search.sort.SortBuilder.resolveNestedQuery(SortBuilder.java:250)
	at org.opensearch.search.sort.SortBuilder.resolveNested(SortBuilder.java:200)
	at org.opensearch.search.sort.SortBuilder.resolveNested(SortBuilder.java:196)
	at org.opensearch.search.sort.FieldSortBuilder.nested(FieldSortBuilder.java:583)
	at org.opensearch.search.sort.FieldSortBuilder.build(FieldSortBuilder.java:409)
	at org.opensearch.search.sort.SortBuilder.buildSort(SortBuilder.java:166)
	at org.opensearch.search.sort.FieldSortBuilder.getMinMaxOrNullInternal(FieldSortBuilder.java:629)
	at org.opensearch.search.sort.FieldSortBuilder.getMinMaxOrNullForSegment(FieldSortBuilder.java:624)
	at org.opensearch.search.internal.ContextIndexSearcher.canMatchSearchAfter(ContextIndexSearcher.java:512)
	at org.opensearch.search.internal.ContextIndexSearcher.canMatch(ContextIndexSearcher.java:504)
	at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:295)
	at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:280)
	at org.apache.lucene.search.IndexSearcher.lambda$search$1(IndexSearcher.java:715)
	at java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1583)

All tests show the same shard failure

These tests have all been muted for concurrent segment search case in #11181 so we do not expect to see any test failures in Gradle Check

@jed326 jed326 added bug Something isn't working untriaged labels Nov 13, 2023
@jed326 jed326 added flaky-test Random test failure that succeeds on second run and removed untriaged labels Nov 13, 2023
@neetikasinghal
Copy link
Contributor

can u add the stacktrace for the failing tests, the gradle check url just has failure for testSimpleNestedSortingWithNestedFilterMissing

@jed326
Copy link
Collaborator Author

jed326 commented Nov 14, 2023

It's the same query phase exception for all of the tests, will add it to the issue description

@jed326 jed326 self-assigned this Nov 14, 2023
@jed326 jed326 changed the title [BUG] SimpleNestedIT tests using nested sort are flaky for concurrent segment search [Concurrent Segment Search] SimpleNestedIT tests using nested sort are flaky for concurrent segment search Nov 14, 2023
@jed326
Copy link
Collaborator Author

jed326 commented Nov 14, 2023

Previous sort related issue #8510 looks like it was related to some race condition in collect. From stack trace error is coming from canMatch here which should happen before.

@jed326
Copy link
Collaborator Author

jed326 commented Nov 15, 2023

The problem here is that the NestedScope() object is being shared across slices so the stack traversal via LinkedList is not working correctly. Specifically the issue comes from it being used during canMatchon each slice however the NestedScope object is used throughout other parts of search as well such as during build query.

Will follow-up with some potential solutions.

@jed326
Copy link
Collaborator Author

jed326 commented Nov 28, 2023

I think these tests themselves have actually been fixed by #11249, however the underlying race condition issue still exists.

Now the tests do not go down the canMatchSearchAfter path so the NestedScope object does not get used on the segment slice. Will need to do some more research on said change. I think we will also need to introduce a new test case that uses search_after as well.

@reta
Copy link
Collaborator

reta commented Nov 30, 2023

@jed326 another (new) variation here #11413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants