feat: Adding queue latency metrics for ActiveJob #414

jdlubrano · 2023-04-13T22:12:20Z

Hello 👋 . I'm looking for feedback on whether OpenTelemetry would be willing to support some metrics for ActiveJobs that I have found useful throughout my career. If the maintainers have no interest in supporting these or think they are generally a bad idea, then no hard feelings ☺️ .

I think the best way that I can explain these metrics is through a scenario. So, suppose a User needs to reset their password. Typically they click a link or button or whatever and eventually a job gets enqueued to send the User an email to reset their password. Now, we don't want our User waiting around very long to receive that email because that's an awfully frustrating user experience. Typically my teams have come up with SLOs like, "We should send a User a reset password email within 5 minutes of them requesting one."

Sometimes things can go wrong when we try to send an email. For example:

Our email provider could be having a bad day and our job may need to get retried a few times
Our job queue could be backed up such that our job spends a lot of time waiting for a worker to pick it up

(Note: It can also take an email service provider an unexpectedly long time to send an email even after our application submits a request, but I don't know of a reasonable way to measure that...I consider that to be well outside of the scope of this PR 😀.)

So, essentially, when the User requests a password reset, how long does it take for our system to initiate that email? There's a (hopefully small) amount of time for the User's request to reach our application server, there's a (hopefully negligible) amount of time for the background job to get enqueued, there's the time the job spends enqueued, the time the job spends waiting to be retried (if necessary), and the time it takes to run the job. The metrics introduced here allow us to measure the last three legs of that journey.

For my own use, I'm trying to use the attributes introduced in this PR to enable SLOs around enqueued_and_retry_duration_ms + duration_ms for my ActiveJobs. I'm curious how much or how little of that would be of interest to the OTel community.

Thank you in advance for any feedback that you have!

Shortcomings

Elapsed times captured in this PR rely on Ruby Time.now, which is not ensured to be monotonic and not ensured to be consistent between web and worker processes.

Example Trace Attributes from Honeycomb

(Please note that the messaging.active_job.enqueue_to_performed_duration_ms is a derived column and is not included in this PR. It can be if that's of interest!)

linux-foundation-easycla · 2023-04-13T22:12:26Z

The committers listed above are authorized under a signed CLA.

✅ login: jdlubrano / name: Joel Lubrano (90a7098, 39bf5b3, 9169005)

…tributes

jdlubrano · 2023-04-14T16:05:49Z

...tion/active_job/lib/opentelemetry/instrumentation/active_job/patches/active_job_callbacks.rb

@@ -81,6 +86,10 @@ def job_attributes(job)
 otel_attributes.compact
 end

+ def milliseconds_since(timestamp)


I'm a little concerned about "polluting" the ActiveJob method namespace with this method, so I'm open to alternatives. It would also seem okay to me to inline this logic within the job_attributes method 🤷‍♂️

jdlubrano · 2023-04-14T16:07:44Z

instrumentation/active_job/test/test_helper.rb

+ retry_on StandardError, wait: 0, attempts: 2
+
+ def perform
+ sleep(0.01)


I initially tried setting wait on this job, but then the retry attempt was not running in the tests. I think that makes sense as the retry would go back into the queue. I don't love throwing a sleep in here, and I am open to suggestions. This seemed like the simplest way to simulate the passage of time, for the first_enqueued_at tests, though.

jdlubrano · 2023-04-14T16:10:11Z

...tion/active_job/lib/opentelemetry/instrumentation/active_job/patches/active_job_callbacks.rb

+ 'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),
+ 'messaging.active_job.first_enqueued_at' => job.first_enqueued_at,
+ 'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at),


I have no strong opinion on the naming of enqueued_duration_ms or enqueued_and_retry_duration_ms. My other thought was enqueued_wait_ms and enqueued_and_retry_wait_ms. I chose duration because OTel already captures a duration_ms metric for spans, so I opted for consistency with the naming here.

jdlubrano · 2023-04-14T16:10:34Z

...tion/active_job/lib/opentelemetry/instrumentation/active_job/patches/active_job_callbacks.rb

@@ -72,6 +73,10 @@ def job_attributes(job)
 'messaging.destination' => job.queue_name,
 'messaging.message_id' => job.job_id,
 'messaging.active_job.provider_job_id' => job.provider_job_id,
+ 'messaging.active_job.enqueued_at' => job.enqueued_at,
+ 'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),


The time that the job spent enqueued before this execution.

jdlubrano · 2023-04-14T16:11:23Z

...tion/active_job/lib/opentelemetry/instrumentation/active_job/patches/active_job_callbacks.rb

+ 'messaging.active_job.enqueued_at' => job.enqueued_at,
+ 'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),
+ 'messaging.active_job.first_enqueued_at' => job.first_enqueued_at,
+ 'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at),


The time since the job was first enqueued. This captures:

Time spent enqueued

Time spent waiting to be retried

Duration of prior executions

This should not capture:

Time elapsed because the Job was enqueued with a wait or wait_until parameter

arielvalentin

@jdlubrano Thank you for taking the time to contribute to OTel Ruby instrumentation.

We are not currently accepting instrumentation changes that adds metric data as span attributes. Normally I would recommend that we capture these as OTel Metrics but our implementation is incomplete at this time.

Some options that I am able to recommend:

Represent latency metrics as spans or span events. It will be symmetric with how we represent this data with other instrumentations, e.g. Sidekiq¹. The challenge here I think is that there are little known semantics around representing intermediaries, e.g. retries or time spent in the queue https:/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md
Enrich or create custom spans: Trickier to do but in your ApplicationJob code perhaps you can grab the current span and add these attributes. The challenge I see here is that your custom implementation may not be compatible with new features your vendor may build around Intermediary message semconv and instrumentation.

https:/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/sidekiq/lib/opentelemetry/instrumentation/sidekiq/middlewares/server/tracer_middleware.rb#L44 ↩

jdlubrano · 2023-04-14T18:06:53Z

Thanks for the suggestions @arielvalentin ! Closing this to pursue the options that you recommended within my own application 👍

arielvalentin · 2023-04-14T19:09:49Z

Thank you for understanding. I really hope this does not do discourage you from submitting future contributions!

stevenharman · 2023-05-09T17:52:32Z

Firstly, thank you, both, for this discussion. I'm once again on my semi-regular journey to figure out a semantic and consistent way to extract latency data from traces for background jobs/messaging. 😆

Regarding the first alternative proposed above: latency by way of Span Events. That is currently how Sidekiq works, but it still might not be enough to be usable for SLOs, triggers, etc… depending on your OTel vendor. I haven't checked in a while, but to my knowledge, Honeycomb, for example, cannot extract those event timestamps into something you can use to calculate latency (e.g., a derived column).

Given that, I'm now considering the second option you proposed, @arielvalentin, wherein we enrich the current span (in Sidekiq-land, via a Server Middleware) to add the attributes ourselves. I agree that a downside of this is drift between our custom attributes/code and potential features our vendor (i.e., Honeycomb) might build around the messaging Semantic Convention. But that seems a worthy tradeoff given that those features don't exist right now. If they ever do show up, we could adapt our custom instrumentation to support them, or remove our custom instrumentation in favor of whatever SemConv things happens.

The other downside, of course, is that we're all left to build this bespoke custom attributes machinery ourselves. Which might be preferable to an opentelemetry-ruby-contrib-contrib Gem. 😄

robbkidd · 2023-08-24T22:45:05Z

I just fiddled with using OTel Collector Contrib (v0.83.0) to hoist times from these span events onto the parent span with the transform processor. Maybe more experimentation can happen there to see what fits and to inform suggestions for semantic conventions around jobs and/or messaging.

It went something like this:

A Rails app with these bits inside:

class HomeController < ApplicationController
  def home
    HardJob.perform_in(2.seconds) if params[:run_job] == 'true'
    …
  end
  … 
end

class HardJob
  include Sidekiq::Job

  def perform(*args); sleep(0.03) ; end
end

Run ApacheBench to abuse it:

❯ ab -n 10000 -c 20 "http://localhost:3000/?run_job=true"

OTel Collector Contrib running with …

# collector config snippet
processors:
  … 
  transform:
    error_mode: ignore
    trace_statements:
      - context: spanevent
        statements:
          - set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
          - set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
          - set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"

👈🏻 (and if you'd like to remove those span events after getting what data you want off them)

processors:
  batch:
  transform/joblatency:
    error_mode: ignore
    trace_statements:
      - context: spanevent
        statements:
          - set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
          - set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
          - set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
  filter/dropspanevents:
    error_mode: ignore
    traces:
      spanevent:
        - 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"'
        - 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"'

service:
  pipelines:
    traces:
      receivers: […] 
      processors: [transform/joblatency, filter/dropspanevents, batch]
      exporters: […]

🎉

feat: Adding queue latency metrics for ActiveJob

90a7098

jdlubrano added 2 commits April 14, 2023 11:05

feat: Add first_enqueued_at and enqueued_and_retry_duration_ms job at…

39bf5b3

…tributes

fix: Remove job_finished_attributes

9169005

jdlubrano commented Apr 14, 2023

View reviewed changes

jdlubrano marked this pull request as ready for review April 14, 2023 17:03

jdlubrano requested review from fbogsany, mwear, robertlaurin, dazuma, ericmustin, arielvalentin, ahayworth, plantfansam and robbkidd as code owners April 14, 2023 17:03

arielvalentin requested changes Apr 14, 2023

View reviewed changes

jdlubrano closed this Apr 14, 2023

This was referenced Aug 25, 2023

Sidekiq Instrumentation a little spammy with span events #405

Closed

Add OpenTelemetry instrumentation (opt-in) mastodon/mastodon#20338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adding queue latency metrics for ActiveJob #414

feat: Adding queue latency metrics for ActiveJob #414

jdlubrano commented Apr 13, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 13, 2023 •

edited

Loading

jdlubrano Apr 14, 2023

jdlubrano Apr 14, 2023

jdlubrano Apr 14, 2023

jdlubrano Apr 14, 2023

jdlubrano Apr 14, 2023 •

edited

Loading

arielvalentin left a comment

jdlubrano commented Apr 14, 2023

arielvalentin commented Apr 14, 2023

stevenharman commented May 9, 2023

robbkidd commented Aug 24, 2023 •

edited

Loading

feat: Adding queue latency metrics for ActiveJob #414

feat: Adding queue latency metrics for ActiveJob #414

Conversation

jdlubrano commented Apr 13, 2023 • edited Loading

Shortcomings

Example Trace Attributes from Honeycomb

linux-foundation-easycla bot commented Apr 13, 2023 • edited Loading

jdlubrano Apr 14, 2023

Choose a reason for hiding this comment

jdlubrano Apr 14, 2023

Choose a reason for hiding this comment

jdlubrano Apr 14, 2023

Choose a reason for hiding this comment

jdlubrano Apr 14, 2023

Choose a reason for hiding this comment

jdlubrano Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

arielvalentin left a comment

Choose a reason for hiding this comment

Footnotes

jdlubrano commented Apr 14, 2023

arielvalentin commented Apr 14, 2023

stevenharman commented May 9, 2023

robbkidd commented Aug 24, 2023 • edited Loading

A Rails app with these bits inside:

Run ApacheBench to abuse it:

OTel Collector Contrib running with …

🎉

jdlubrano commented Apr 13, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 13, 2023 •

edited

Loading

jdlubrano Apr 14, 2023 •

edited

Loading

robbkidd commented Aug 24, 2023 •

edited

Loading