Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The thread context is not properly cleared and messes up the traces #10789

Closed
reta opened this issue Oct 20, 2023 · 14 comments · Fixed by #10873
Closed

[BUG] The thread context is not properly cleared and messes up the traces #10789

reta opened this issue Oct 20, 2023 · 14 comments · Fixed by #10873
Assignees
Labels
bug Something isn't working v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@reta
Copy link
Collaborator

reta commented Oct 20, 2023

Describe the bug
We have the issue with cleaning up the thread context upon certain transport action invocations, the thread context keeps holding the spans from previous invocations, messing up the traces.

To Reproduce
Consider this simple PUT request to create an index:

  curl -X PUT -H "Content-Type: application/json"     http://localhost:9200/test51     -d '{
          "mappings": {
              "properties": {
                  "field": { "type": "date", "format": "epoch_second" }
              }
          },
          "settings": {
              "number_of_shards": 2,
              "number_of_replicas": 2
          }
      }'

It generates the following trace:

Screenshot from 2023-10-20 11-09-24

Now wait just a bit and observe the same trace is growing:

Screenshot from 2023-10-20 11-09-50

And growing:

Screenshot from 2023-10-20 11-21-49

The reason for that is that thread context was not cleaned up and the background tasks still picking the last span as the parent, attaching more and more spans to it.

Expected behavior
The thread context must be properly cleaned up.

Plugins
OpenTelementry

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Any

Additional context

opensearch.experimental.feature.telemetry.enabled: true
telemetry.tracer.sampler.probability: 1.0
telemetry.otel.tracer.span.exporter.class: io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter
telemetry.tracer.enabled: true 
telemetry.feature.tracer.enabled: true

CC @Gaganjuneja this is serious one

@reta reta added bug Something isn't working untriaged and removed untriaged labels Oct 20, 2023
@Gaganjuneja
Copy link
Contributor

Let me take a look.

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Oct 20, 2023
@reta
Copy link
Collaborator Author

reta commented Oct 20, 2023

Let me take a look.

@Gaganjuneja please let me know if you need help

@Gaganjuneja
Copy link
Contributor

@reta This issue seems to be happening because of below ThreadContext::stashContext which keeps the state when it's triggered from the create index cluster operation. It's happening at couple of other places as well.

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

@reta
Copy link
Collaborator Author

reta commented Oct 23, 2023

Thanks @Gaganjuneja , so it seems like the thread context and tracing state are in conflict. One of the options to explore - the stash should move the trace context from current context to new context (but that could cause other issues), will be looking into that this week.

@reta
Copy link
Collaborator Author

reta commented Oct 26, 2023

@Gaganjuneja it took me a lot of time but I think I clearly understand what is happening. The culprit is #10291: the local transport hands off the request using thread pool and at one of the places (still hunting the exact one) it captures the span from the current thread and never cleans it up. It causes this issue to manifest.

The fix I did for now:

See please #10873

@Gaganjuneja
Copy link
Contributor

@reta, thanks for putting this up. I am on vacation and take a deeper look once back. I have take a glance and looks like we need to remove the threadcontext state in most of the cases except headers.

@reta
Copy link
Collaborator Author

reta commented Oct 26, 2023

@reta, thanks for putting this up. I am on vacation and take a deeper look once back. I have take a glance and looks like we need to remove the threadcontext state in most of the cases except headers.

Thanks @Gaganjuneja , the thread context is not necessarily the problem (I think), its management is: it is based on thread locals so we would do the similar things anyway, I think it will become more clear when we pick #10291

@Gaganjuneja
Copy link
Contributor

@reta, Are you able to find the place where the span is not getting cleaned up. I want to understand better why you think the issue is because of local transport. If the similar hand off happens in the non-local transport then also we will end up in the same situation (after #10873) if we are still storing the span inside the ThreadContext?

@reta
Copy link
Collaborator Author

reta commented Nov 9, 2023

@reta, Are you able to find the place where the span is not getting cleaned up. I want to understand better why you think the issue is because of local transport.

@Gaganjuneja It think I know the suspect (TransportService::sendLocalRequest) and it looks to me that in case of local transport the callbacks we expect to be called aren't called (I suspect this is because of DirectResponseChannel but I left the investigation till #10291)

If the similar hand off happens in the non-local transport then also we will end up in the same situation (after #10873) if we are still storing the span inside the ThreadContext?

I don't think this is a generic handoff problem but really specific to local transport (at least, I haven't seen any messed traces after the fix but surely we are instrumenting less now).

@Gaganjuneja
Copy link
Contributor

@reta, I deep dove and found the issue. Still need to find the fix for it.

This is happening while the Index creation and particularly during the createShard. The flow looks like that

IndexShard:syncRetentionLeases -> retentionLeaseSyncer:backgroundSync -> RetentionLeaseBackgroundSyncAction:backgroundSync

Here,

final ThreadContext threadContext = threadPool.getThreadContext();

  1. Stashes the ThreadContext
  2. Executes the TransportAction (backgroundSyncAction) which attaches/detaches the Span and clears up the ThreadContext.
  3. Stashed context.restore() restores the context and span is back in the ThreadContext.
  4. After this there is a call to AbstractAsyncTask::rescheduleIfNecessary from the below code path

Here it schedules the task

cancellable = threadPool.schedule(this, interval, getThreadPool());

Scheduled task gets executed in the current threads context, which copies the stale span and keeps it. These scheduled tasks run at a particular frequency and keep on having the stale span as a parent span.

public ScheduledCancellable schedule(Runnable command, TimeValue delay, String executor) {

I think this is a specific case where hand off is not working as expected but we should definitely handle this from the framework itself. Looking forward to your thoughts on this.

@reta
Copy link
Collaborator Author

reta commented Nov 15, 2023

Thanks @Gaganjuneja

@reta, I deep dove and found the issue. Still need to find the fix for it.

I believe you are seeing the consequences of the problem: the RetentionLeaseBackgroundSyncAction is, as it says, a background task run periodically. This task should not see any traces (to say it in other words - the ThreadContext it stashes should have no spans).

Have your tried to see what is happening when using the code from #10873?

@Gaganjuneja
Copy link
Contributor

RetentionLeaseBackgroundSyncAction started by a createShard call and on the start itself it takes the calling thread's context and stores it for all further scheduled executions. Following code creates the AsyncRetentionLeaseSyncTask task which internally schedules this action and at this point the current thread's context has the span from the incoming indexing request.

this.retentionLeaseSyncTask = new AsyncRetentionLeaseSyncTask(this);

Have your tried to see what is happening when using the code from #10873?

Yes, I tried this fix. Here the issue is not visible but If I go and debug the ThreadContext state then it still has the stale state but since we are not instrumenting the local transport so it's not visible.

@reta
Copy link
Collaborator Author

reta commented Nov 16, 2023

Yes, I tried this fix. Here the issue is not visible but If I go and debug the ThreadContext state then it still has the stale state but since we are not instrumenting the local transport so it's not visible.

Cool, thank you, so I think we could merge it (since at least it does not mess traces) and work on the fix as part of the #10291, right now the feature is unusable.

@Gaganjuneja
Copy link
Contributor

Yes, we can meanwhile go ahead with #10873 as it contains some good code refactoring and isolation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants