[BUG] The thread context is not properly cleared and messes up the traces #10789

reta · 2023-10-20T15:49:32Z

Describe the bug
We have the issue with cleaning up the thread context upon certain transport action invocations, the thread context keeps holding the spans from previous invocations, messing up the traces.

To Reproduce
Consider this simple PUT request to create an index:

  curl -X PUT -H "Content-Type: application/json"     http://localhost:9200/test51     -d '{
          "mappings": {
              "properties": {
                  "field": { "type": "date", "format": "epoch_second" }
              }
          },
          "settings": {
              "number_of_shards": 2,
              "number_of_replicas": 2
          }
      }'

It generates the following trace:

Now wait just a bit and observe the same trace is growing:

And growing:

The reason for that is that thread context was not cleaned up and the background tasks still picking the last span as the parent, attaching more and more spans to it.

Expected behavior
The thread context must be properly cleaned up.

Plugins
OpenTelementry

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Any

Additional context

opensearch.experimental.feature.telemetry.enabled: true
telemetry.tracer.sampler.probability: 1.0
telemetry.otel.tracer.span.exporter.class: io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter
telemetry.tracer.enabled: true 
telemetry.feature.tracer.enabled: true

CC @Gaganjuneja this is serious one

The text was updated successfully, but these errors were encountered:

Gaganjuneja · 2023-10-20T15:52:49Z

Let me take a look.

reta · 2023-10-20T16:12:46Z

Let me take a look.

@Gaganjuneja please let me know if you need help

Gaganjuneja · 2023-10-22T10:00:15Z

@reta This issue seems to be happening because of below ThreadContext::stashContext which keeps the state when it's triggered from the create index cluster operation. It's happening at couple of other places as well.

OpenSearch/server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java

Line 380 in a09047a

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

OpenSearch/server/src/main/java/org/opensearch/cluster/service/MasterService.java

Line 982 in a09047a

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

OpenSearch/server/src/main/java/org/opensearch/index/seqno/RetentionLeaseBackgroundSyncAction.java

Line 123 in a09047a

try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {

reta · 2023-10-23T13:12:47Z

Thanks @Gaganjuneja , so it seems like the thread context and tracing state are in conflict. One of the options to explore - the stash should move the trace context from current context to new context (but that could cause other issues), will be looking into that this week.

reta · 2023-10-26T00:44:52Z

@Gaganjuneja it took me a lot of time but I think I clearly understand what is happening. The culprit is #10291: the local transport hands off the request using thread pool and at one of the places (still hunting the exact one) it captures the span from the current thread and never cleans it up. It causes this issue to manifest.

The fix I did for now:

do not instrument any local transport actions (will be covered by [Tracing Instrumentation] Add instrumentation at Local Transport Layer #10291)
always populate current context from the HTTP or TCP transport headers
change the current context only within span scope

See please #10873

Gaganjuneja · 2023-10-26T05:47:30Z

@reta, thanks for putting this up. I am on vacation and take a deeper look once back. I have take a glance and looks like we need to remove the threadcontext state in most of the cases except headers.

reta · 2023-10-26T11:43:29Z

@reta, thanks for putting this up. I am on vacation and take a deeper look once back. I have take a glance and looks like we need to remove the threadcontext state in most of the cases except headers.

Thanks @Gaganjuneja , the thread context is not necessarily the problem (I think), its management is: it is based on thread locals so we would do the similar things anyway, I think it will become more clear when we pick #10291

Gaganjuneja · 2023-11-09T12:34:27Z

@reta, Are you able to find the place where the span is not getting cleaned up. I want to understand better why you think the issue is because of local transport. If the similar hand off happens in the non-local transport then also we will end up in the same situation (after #10873) if we are still storing the span inside the ThreadContext?

reta · 2023-11-09T18:12:14Z

@reta, Are you able to find the place where the span is not getting cleaned up. I want to understand better why you think the issue is because of local transport.

@Gaganjuneja It think I know the suspect (TransportService::sendLocalRequest) and it looks to me that in case of local transport the callbacks we expect to be called aren't called (I suspect this is because of DirectResponseChannel but I left the investigation till #10291)

If the similar hand off happens in the non-local transport then also we will end up in the same situation (after #10873) if we are still storing the span inside the ThreadContext?

I don't think this is a generic handoff problem but really specific to local transport (at least, I haven't seen any messed traces after the fix but surely we are instrumenting less now).

Gaganjuneja · 2023-11-15T11:17:35Z

@reta, I deep dove and found the issue. Still need to find the fix for it.

This is happening while the Index creation and particularly during the createShard. The flow looks like that

IndexShard:syncRetentionLeases -> retentionLeaseSyncer:backgroundSync -> RetentionLeaseBackgroundSyncAction:backgroundSync

Here,

OpenSearch/server/src/main/java/org/opensearch/index/seqno/RetentionLeaseBackgroundSyncAction.java

Line 122 in 675dd41

final ThreadContext threadContext = threadPool.getThreadContext();

Stashes the ThreadContext
Executes the TransportAction (backgroundSyncAction) which attaches/detaches the Span and clears up the ThreadContext.
Stashed context.restore() restores the context and span is back in the ThreadContext.
After this there is a call to AbstractAsyncTask::rescheduleIfNecessary from the below code path

OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/AbstractAsyncTask.java

Line 98 in 675dd41

public synchronized void rescheduleIfNecessary() {

Here it schedules the task

OpenSearch/server/src/main/java/org/opensearch/common/util/concurrent/AbstractAsyncTask.java

Line 109 in 675dd41

cancellable = threadPool.schedule(this, interval, getThreadPool());

Scheduled task gets executed in the current threads context, which copies the stale span and keeps it. These scheduled tasks run at a particular frequency and keep on having the stale span as a parent span.

OpenSearch/server/src/main/java/org/opensearch/threadpool/ThreadPool.java

Line 462 in 675dd41

 public ScheduledCancellable schedule(Runnable command, TimeValue delay, String executor) { 

I think this is a specific case where hand off is not working as expected but we should definitely handle this from the framework itself. Looking forward to your thoughts on this.

reta · 2023-11-15T19:55:52Z

Thanks @Gaganjuneja

@reta, I deep dove and found the issue. Still need to find the fix for it.

I believe you are seeing the consequences of the problem: the RetentionLeaseBackgroundSyncAction is, as it says, a background task run periodically. This task should not see any traces (to say it in other words - the ThreadContext it stashes should have no spans).

Have your tried to see what is happening when using the code from #10873?

Gaganjuneja · 2023-11-16T12:20:34Z

RetentionLeaseBackgroundSyncAction started by a createShard call and on the start itself it takes the calling thread's context and stores it for all further scheduled executions. Following code creates the AsyncRetentionLeaseSyncTask task which internally schedules this action and at this point the current thread's context has the span from the incoming indexing request.

OpenSearch/server/src/main/java/org/opensearch/index/IndexService.java

Line 292 in 5b505ec

this.retentionLeaseSyncTask = new AsyncRetentionLeaseSyncTask(this);

Have your tried to see what is happening when using the code from #10873?

Yes, I tried this fix. Here the issue is not visible but If I go and debug the ThreadContext state then it still has the stale state but since we are not instrumenting the local transport so it's not visible.

reta · 2023-11-16T13:09:30Z

Yes, I tried this fix. Here the issue is not visible but If I go and debug the ThreadContext state then it still has the stale state but since we are not instrumenting the local transport so it's not visible.

Cool, thank you, so I think we could merge it (since at least it does not mess traces) and work on the fix as part of the #10291, right now the feature is unusable.

Gaganjuneja · 2023-11-16T16:21:46Z

Yes, we can meanwhile go ahead with #10873 as it contains some good code refactoring and isolation.

reta added bug Something isn't working untriaged and removed untriaged labels Oct 20, 2023

github-actions bot added the untriaged label Oct 20, 2023

reta mentioned this issue Oct 20, 2023

[Meta][Tracing Framework Instrumentation] #8557

Open

7 tasks

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Oct 20, 2023

reta self-assigned this Oct 23, 2023

reta mentioned this issue Oct 23, 2023

[BUG] The thread context is not properly cleared and messes up the traces #10873

Merged

8 tasks

This was referenced Nov 3, 2023

Indexing Tracing instrumentation : auto-create indices and threadool queue waiting #11084

Open

[BUG] [Indexing Path] [Tracing] Span adding already ended span as parent. #10090

Closed

reta closed this as completed in #10873 Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The thread context is not properly cleared and messes up the traces #10789

[BUG] The thread context is not properly cleared and messes up the traces #10789

reta commented Oct 20, 2023

Gaganjuneja commented Oct 20, 2023

reta commented Oct 20, 2023

Gaganjuneja commented Oct 22, 2023

reta commented Oct 23, 2023

reta commented Oct 26, 2023

Gaganjuneja commented Oct 26, 2023

reta commented Oct 26, 2023 •

edited

Loading

Gaganjuneja commented Nov 9, 2023

reta commented Nov 9, 2023 •

edited

Loading

Gaganjuneja commented Nov 15, 2023

reta commented Nov 15, 2023 •

edited

Loading

Gaganjuneja commented Nov 16, 2023

reta commented Nov 16, 2023

Gaganjuneja commented Nov 16, 2023

[BUG] The thread context is not properly cleared and messes up the traces #10789

[BUG] The thread context is not properly cleared and messes up the traces #10789

Comments

reta commented Oct 20, 2023

Gaganjuneja commented Oct 20, 2023

reta commented Oct 20, 2023

Gaganjuneja commented Oct 22, 2023

reta commented Oct 23, 2023

reta commented Oct 26, 2023

Gaganjuneja commented Oct 26, 2023

reta commented Oct 26, 2023 • edited Loading

Gaganjuneja commented Nov 9, 2023

reta commented Nov 9, 2023 • edited Loading

Gaganjuneja commented Nov 15, 2023

reta commented Nov 15, 2023 • edited Loading

Gaganjuneja commented Nov 16, 2023

reta commented Nov 16, 2023

Gaganjuneja commented Nov 16, 2023

reta commented Oct 26, 2023 •

edited

Loading

reta commented Nov 9, 2023 •

edited

Loading

reta commented Nov 15, 2023 •

edited

Loading