Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding queue latency metrics for ActiveJob #414

Conversation

jdlubrano
Copy link

@jdlubrano jdlubrano commented Apr 13, 2023

Hello 👋 . I'm looking for feedback on whether OpenTelemetry would be willing to support some metrics for ActiveJobs that I have found useful throughout my career. If the maintainers have no interest in supporting these or think they are generally a bad idea, then no hard feelings ☺️ .

I think the best way that I can explain these metrics is through a scenario. So, suppose a User needs to reset their password. Typically they click a link or button or whatever and eventually a job gets enqueued to send the User an email to reset their password. Now, we don't want our User waiting around very long to receive that email because that's an awfully frustrating user experience. Typically my teams have come up with SLOs like, "We should send a User a reset password email within 5 minutes of them requesting one."

Sometimes things can go wrong when we try to send an email. For example:

  • Our email provider could be having a bad day and our job may need to get retried a few times
  • Our job queue could be backed up such that our job spends a lot of time waiting for a worker to pick it up

(Note: It can also take an email service provider an unexpectedly long time to send an email even after our application submits a request, but I don't know of a reasonable way to measure that...I consider that to be well outside of the scope of this PR 😀.)

So, essentially, when the User requests a password reset, how long does it take for our system to initiate that email? There's a (hopefully small) amount of time for the User's request to reach our application server, there's a (hopefully negligible) amount of time for the background job to get enqueued, there's the time the job spends enqueued, the time the job spends waiting to be retried (if necessary), and the time it takes to run the job. The metrics introduced here allow us to measure the last three legs of that journey.

For my own use, I'm trying to use the attributes introduced in this PR to enable SLOs around enqueued_and_retry_duration_ms + duration_ms for my ActiveJobs. I'm curious how much or how little of that would be of interest to the OTel community.

Thank you in advance for any feedback that you have!

Shortcomings

  • Elapsed times captured in this PR rely on Ruby Time.now, which is not ensured to be monotonic and not ensured to be consistent between web and worker processes.

Example Trace Attributes from Honeycomb

image

(Please note that the messaging.active_job.enqueue_to_performed_duration_ms is a derived column and is not included in this PR. It can be if that's of interest!)

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 13, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@@ -81,6 +86,10 @@ def job_attributes(job)
otel_attributes.compact
end

def milliseconds_since(timestamp)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about "polluting" the ActiveJob method namespace with this method, so I'm open to alternatives. It would also seem okay to me to inline this logic within the job_attributes method 🤷‍♂️

Comment on lines +29 to +32
retry_on StandardError, wait: 0, attempts: 2

def perform
sleep(0.01)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried setting wait on this job, but then the retry attempt was not running in the tests. I think that makes sense as the retry would go back into the queue. I don't love throwing a sleep in here, and I am open to suggestions. This seemed like the simplest way to simulate the passage of time, for the first_enqueued_at tests, though.

Comment on lines +77 to +79
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),
'messaging.active_job.first_enqueued_at' => job.first_enqueued_at,
'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion on the naming of enqueued_duration_ms or enqueued_and_retry_duration_ms. My other thought was enqueued_wait_ms and enqueued_and_retry_wait_ms. I chose duration because OTel already captures a duration_ms metric for spans, so I opted for consistency with the naming here.

@@ -72,6 +73,10 @@ def job_attributes(job)
'messaging.destination' => job.queue_name,
'messaging.message_id' => job.job_id,
'messaging.active_job.provider_job_id' => job.provider_job_id,
'messaging.active_job.enqueued_at' => job.enqueued_at,
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time that the job spent enqueued before this execution.

'messaging.active_job.enqueued_at' => job.enqueued_at,
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at),
'messaging.active_job.first_enqueued_at' => job.first_enqueued_at,
'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at),
Copy link
Author

@jdlubrano jdlubrano Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time since the job was first enqueued. This captures:

  • Time spent enqueued
  • Time spent waiting to be retried
  • Duration of prior executions

This should not capture:

  • Time elapsed because the Job was enqueued with a wait or wait_until parameter

Copy link
Collaborator

@arielvalentin arielvalentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdlubrano Thank you for taking the time to contribute to OTel Ruby instrumentation.

We are not currently accepting instrumentation changes that adds metric data as span attributes. Normally I would recommend that we capture these as OTel Metrics but our implementation is incomplete at this time.

Some options that I am able to recommend:

  • Represent latency metrics as spans or span events. It will be symmetric with how we represent this data with other instrumentations, e.g. Sidekiq1. The challenge here I think is that there are little known semantics around representing intermediaries, e.g. retries or time spent in the queue https:/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md
  • Enrich or create custom spans: Trickier to do but in your ApplicationJob code perhaps you can grab the current span and add these attributes. The challenge I see here is that your custom implementation may not be compatible with new features your vendor may build around Intermediary message semconv and instrumentation.

Footnotes

  1. https:/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/sidekiq/lib/opentelemetry/instrumentation/sidekiq/middlewares/server/tracer_middleware.rb#L44

@jdlubrano
Copy link
Author

Thanks for the suggestions @arielvalentin ! Closing this to pursue the options that you recommended within my own application 👍

@jdlubrano jdlubrano closed this Apr 14, 2023
@arielvalentin
Copy link
Collaborator

Thank you for understanding. I really hope this does not do discourage you from submitting future contributions!

@stevenharman
Copy link
Contributor

Firstly, thank you, both, for this discussion. I'm once again on my semi-regular journey to figure out a semantic and consistent way to extract latency data from traces for background jobs/messaging. 😆

Regarding the first alternative proposed above: latency by way of Span Events. That is currently how Sidekiq works, but it still might not be enough to be usable for SLOs, triggers, etc… depending on your OTel vendor. I haven't checked in a while, but to my knowledge, Honeycomb, for example, cannot extract those event timestamps into something you can use to calculate latency (e.g., a derived column).

Given that, I'm now considering the second option you proposed, @arielvalentin, wherein we enrich the current span (in Sidekiq-land, via a Server Middleware) to add the attributes ourselves. I agree that a downside of this is drift between our custom attributes/code and potential features our vendor (i.e., Honeycomb) might build around the messaging Semantic Convention. But that seems a worthy tradeoff given that those features don't exist right now. If they ever do show up, we could adapt our custom instrumentation to support them, or remove our custom instrumentation in favor of whatever SemConv things happens.

The other downside, of course, is that we're all left to build this bespoke custom attributes machinery ourselves. Which might be preferable to an opentelemetry-ruby-contrib-contrib Gem. 😄

@robbkidd
Copy link
Member

robbkidd commented Aug 24, 2023

I just fiddled with using OTel Collector Contrib (v0.83.0) to hoist times from these span events onto the parent span with the transform processor. Maybe more experimentation can happen there to see what fits and to inform suggestions for semantic conventions around jobs and/or messaging.

It went something like this:

A Rails app with these bits inside:

class HomeController < ApplicationController
  def home
    HardJob.perform_in(2.seconds) if params[:run_job] == 'true'
    
  end
   
end

class HardJob
  include Sidekiq::Job

  def perform(*args); sleep(0.03) ; end
end

Run ApacheBench to abuse it:

❯ ab -n 10000 -c 20 "http://localhost:3000/?run_job=true"

OTel Collector Contrib running with …

# collector config snippet
processors:
  
  transform:
    error_mode: ignore
    trace_statements:
      - context: spanevent
        statements:
          - set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
          - set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
          - set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
👈🏻 (and if you'd like to remove those span events after getting what data you want off them)
processors:
  batch:
  transform/joblatency:
    error_mode: ignore
    trace_statements:
      - context: spanevent
        statements:
          - set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
          - set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
          - set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
  filter/dropspanevents:
    error_mode: ignore
    traces:
      spanevent:
        - 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"'
        - 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"'

service:
  pipelines:
    traces:
      receivers: […] 
      processors: [transform/joblatency, filter/dropspanevents, batch]
      exporters: […]

🎉

Screenshot 2023-08-24 at 6 40 05 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants