SMTP worker can become slow when too many invalid emails are queued up #402

koenmtb1 · 2020-05-10T12:53:37Z

Describe the bug
When the server is unable to connect to the SMTP server it wil spam the error and using a lot of resources by retrying very frequently.

Reproducing the bug

Steps to reproduce the behavior:

Setup Kratos with incorrect SMTP details.
Add verify hook to after registration.
Do a registration through self-service API.

Server logs
(Sensitive details have been changed)

5/10/2020 2:47:00 PM time="2020-05-10T12:47:00Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:01 PM time="2020-05-10T12:47:01Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:02 PM time="2020-05-10T12:47:02Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:03 PM time="2020-05-10T12:47:03Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:04 PM time="2020-05-10T12:47:04Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:05 PM time="2020-05-10T12:47:05Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true

This continues on and on.

Server configuration

dsn: OMITTED
identity:
  traits:
    default_schema_url: file://home/ory/user.schema.json
selfservice:
  strategies:
    password:
      enabled: true
  logout:
    redirect_to: OMITTED
  settings:
    request_lifespan: 1h
    privileged_session_max_age: 1h
    after:
      password:
        default_return_to: OMITTED
      profile:
        default_return_to: OMITTED
  verify:
    request_lifespan: 1h
    link_lifespan: 24h
  login:
    request_lifespan: 1h
    after:
      default_return_to: OMITTED
  registration:
    request_lifespan: 10m
    after:
      default_return_to: OMITTED
      password:
        hooks:
        - hook: verify
        - hook: session
courier:
  smtp:
    connection_uri: OMITTED
    from_address: OMITTED
serve:
  admin:
    host: 0.0.0.0
    port: 4434
  public:
    host: 0.0.0.0
    port: 4433
urls:
  self:
    public: OMITTED
    admin: OMITTED
  mfa_ui: OMITTED
  login_ui: OMITTED
  settings_ui: OMITTED
  registration_ui: OMITTED
  error_ui: OMITTED
  verify_ui: OMITTED
  default_return_to: OMITTED
  whitelisted_return_to_urls: OMITTED
secrets:
  session:
    - OMITTED
hashers:
  argon2:
    OMITTED

Expected behavior
For it to try a few times increasing the time between them every time and maybe giving up after a few times. Somehow let the administrator know that it's not working correctly, but without spamming the console.

Environment

Version: v0.2.1-alpha.1
Environment: Kubernetes using oryd/kratos:latest docker image.

Additional context

The text was updated successfully, but these errors were encountered:

aeneasr · 2020-05-10T13:01:26Z

Yeah I think we could introduce exponential backoff here!

jakhog · 2020-07-12T15:09:21Z

I started working on this, and enabling the already in-place exponential back-off when Kratos is not able to connect to the SMPT server is fairly easy. However, I started thinking about other per-message errors (not sure what they would be, maybe malformed emails or too large bodies), and started playing with adding an exponential backoff to each message, storing state in the DB. Then I suddenly realised that Kratos is supposed to be scale horizontally by sharing the DB, and that this could cause all sorts of race conditions and strange behaviour.

So, my question becomes: is it necessary to have a courier service in Kratos that queues outgoing messages before sending them to a 3rd party service?

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

If the purpose of the courier queue is to keep track of the messages that was sent, most services (e.g. SendGrid) offers this feature by default. We could of course keep the log in the DB for audit purposes as well, if that is of interest.

aeneasr · 2020-07-14T09:16:31Z

Thank you for your thoughtful comment and for working towards a contribution!

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

Message delivery is asynchronous, waiting for message delivery in UX is bad because we might end up waiting 60 seconds or more to get an aknowlegement that the message was delivered (or we will never know).

Because we don't want to depend on other 3rd party services such as RabbitMQ to keep things easy, we'll have to deal with this in Kratos (which is already the case). I don't think there are possible race conditions because we use transactions. In the rare case where a race condition might still occur (e.g. simoultaneous down to the nanosecond SELECT) we'll just end up sending two emails, which isn#t great, but it's also not a dealbreaker!

jakhog · 2020-07-14T19:23:58Z

Oh ok, I didn't know that the SMPT message posting was that slow. But then it makes total sense to to it in the background.

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished. The now and then double emails aren't that bad, but consistently sending duplicates looks a bit strange for an end user. I guess the easiest solution for now is just to document that only one of the instances should be configured with the SMPT courier, and its up to the admin to make sure of that? Then in the future one could look at some sort of leader-election using the DB so that the config can be the same for all instances.

Another problem however is if sending fails for a message - so not the unable to connect failure but for whatever reason (email too big?) a message fails. In the current implementation, that message would be retried every second which spams the logs. The bigger problem though is that if 8 of these un-sendable messages end up at the front of the queue, Kratos would never send any more emails, always retrying those 8 every time. I have a few suggestions for a fix (or a combination):

Store retry-state per message, with a backoff strategy and retry count to reduce the log spam and give up at some point.
Continue fetching groups of 8 messages until there is no more in the queue (essentially paging the queue)
Mark messages that are rejected by the SMPT server as failed, and never retry.
The third one is definitely the easiest, and I don't know if there is that much point in retrying per email?

aeneasr · 2020-07-15T10:02:54Z

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished

Nice find! This is definitely an issue. I think that we originally planned to have this as it's own sub-command (e.g. kratos courier watch) but we abandoned that idea in favor of a simpler set-up. In that scenario, you would have only one cron job that watches the queue, but that's definitely no longer the case.

In my opinion, we should have four message states:

queued
processing (goes back to queue on error, goes to sent on success)
sent

In the future, we might also keep track of errors and put the message in a permanent error state if continued delivery fails.

ro-tex · 2021-01-13T17:48:11Z

Are there any workarounds for this issue while we wait for a permanent fix?

We are looking to start using Kratos in production very soon but we have 8 machines right now and we're looking to double that in the next 6 months. Sending more than a dozen duplicates of each email is not really an option for us.

We are considering removing the email credentials from all but one machine. This will make it a single point of failure, so it's definitely not a great choice but it might just make it possible for us to operate. Can you advise on whether that's a viable strategy? I am not familiar with the inner workings of Kratos, so it's hard for me to anticipate the possible problems this might create e.g. multiple nodes failing to send the email and marking it as "bad" in the DB, so it never gets sent, etc.

aeneasr · 2021-01-14T08:39:04Z

The best thing would be to create a PR with a fix! That way you have a long-term solution, help out the community, and don't spend time on hacks which potentially mess up your system.

My idea here was to remove the background worker and create a new CLI command called e.g. kratos watch courier or kratos worker which basically executes the linked function in the foreground.

This would then need to be properly documented in several places - e.g. quickstart, guides, concepts. We would also need to update the docker-compose example.

The last thing is, if DSN=memory, the mail server should actually run as part of kratos serve (so as it is today) and reject to start kratos watch courier if that is true. Because this means that the data is kept in the kratos memory process, which other processes don't have access to!

aeneasr · 2022-07-21T09:24:15Z

@jonas-jonas this might be a good next issue for you, let's discuss it when you're in munich :)

aeneasr · 2022-07-21T09:26:58Z

See also #1598

jonas-jonas · 2022-07-21T14:49:12Z

@aeneasr sounds great, really looking forward to that!

jonas-jonas · 2022-08-01T13:40:17Z

I think this issue is already solved, coincidentally by me, 5 months ago: #2257

The PR introduced a new configuration key (courier.message_ttl), that is used to set up a "time-to-live" for failed (read: unsent) messages, so that they are removed from the queue after the specified time.

An endpoint to query these could be added by querying all messages with the status MessageStatusAbandoned. Should we do that in a follow-up ticket?

aeneasr · 2022-08-01T13:45:57Z

Oh, indeed! I'm wondering what a good value would be for message_ttl? Could we also use a retry count?

Also, it seems like this setting is currently not used in Ory Cloud - if you search for the package "dogfood" in the backoffice you will find the kratos configuration - could you please set a reasonable message TTL there?

This PR replaces the `courier.message_ttl` configuration option with a `courier.message_retries` option to limit how often the sending of a message is retried before it is marked as `abandoned`. BREAKING CHANGES: This is a breaking change, as it removes the `courier.message_ttl` config key and replaces it with a counter `courier.message_retries`. Closes ory#402 Closes ory#1598

aeneasr added this to the v0.5.0-alpha.1 milestone May 25, 2020

aeneasr added feat New feature or request. good first issue A good issue to tackle when being a novice to the project. labels May 25, 2020

aeneasr modified the milestones: v0.5.0-alpha.1, v0.6.0-alpha.1 Jul 27, 2020

aeneasr mentioned this issue Aug 13, 2020

Password reset emails sent twice by each of the two kratos pods in my cluster #652

Closed

aeneasr mentioned this issue Oct 2, 2020

All emails are sent multiple times if kratos deployed with multiple instances over 1 DB #732

Closed

mattbonnell mentioned this issue Jan 31, 2021

fix: Courier message dequeuing race condition #1024

Merged

5 tasks

aeneasr modified the milestones: v0.6.0-alpha.1, v0.7.0-alpha.1, v0.8.0-alpha.1 Apr 27, 2021

aeneasr modified the milestones: v0.8.0-alpha.1, v0.9.0-alpha.1 Jul 13, 2021

zepatrik mentioned this issue Jul 28, 2021

Courier: Failing message stalls queue #1598

Closed

aeneasr modified the milestones: v0.9.0-alpha.1, v0.10.0-alpha.1, Incubating Release Aug 17, 2021

aeneasr modified the milestones: Incubating Release, Stable Release May 30, 2022

aeneasr changed the title ~~SMTP Error spams the server logs~~ SMTP worker can become slow when too many invalid emails are queued up Jul 21, 2022

aeneasr assigned jonas-jonas Aug 2, 2022

jonas-jonas mentioned this issue Aug 2, 2022

feat: replace message_ttl with static max retry count #2638

Merged

7 tasks

aeneasr closed this as completed in b341756 Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMTP worker can become slow when too many invalid emails are queued up #402

SMTP worker can become slow when too many invalid emails are queued up #402

koenmtb1 commented May 10, 2020

aeneasr commented May 10, 2020

jakhog commented Jul 12, 2020

aeneasr commented Jul 14, 2020

jakhog commented Jul 14, 2020

aeneasr commented Jul 15, 2020

ro-tex commented Jan 13, 2021

aeneasr commented Jan 14, 2021 •

edited

Loading

aeneasr commented Jul 21, 2022

aeneasr commented Jul 21, 2022

jonas-jonas commented Jul 21, 2022

jonas-jonas commented Aug 1, 2022

aeneasr commented Aug 1, 2022

SMTP worker can become slow when too many invalid emails are queued up #402

SMTP worker can become slow when too many invalid emails are queued up #402

Comments

koenmtb1 commented May 10, 2020

Additional context

aeneasr commented May 10, 2020

jakhog commented Jul 12, 2020

aeneasr commented Jul 14, 2020

jakhog commented Jul 14, 2020

aeneasr commented Jul 15, 2020

ro-tex commented Jan 13, 2021

aeneasr commented Jan 14, 2021 • edited Loading

aeneasr commented Jul 21, 2022

aeneasr commented Jul 21, 2022

jonas-jonas commented Jul 21, 2022

jonas-jonas commented Aug 1, 2022

aeneasr commented Aug 1, 2022

aeneasr commented Jan 14, 2021 •

edited

Loading