Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMTP worker can become slow when too many invalid emails are queued up #402

Closed
koenmtb1 opened this issue May 10, 2020 · 12 comments
Closed
Assignees
Labels
feat New feature or request. good first issue A good issue to tackle when being a novice to the project.
Milestone

Comments

@koenmtb1
Copy link
Contributor

Describe the bug
When the server is unable to connect to the SMTP server it wil spam the error and using a lot of resources by retrying very frequently.

Reproducing the bug

Steps to reproduce the behavior:

  1. Setup Kratos with incorrect SMTP details.
  2. Add verify hook to after registration.
  3. Do a registration through self-service API.

Server logs
(Sensitive details have been changed)

5/10/2020 2:47:00 PM time="2020-05-10T12:47:00Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:01 PM time="2020-05-10T12:47:01Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:02 PM time="2020-05-10T12:47:02Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:03 PM time="2020-05-10T12:47:03Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:04 PM time="2020-05-10T12:47:04Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:05 PM time="2020-05-10T12:47:05Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" [email protected] smtp_server="mail.example.com:465" smtp_ssl_enabled=true

This continues on and on.

Server configuration

dsn: OMITTED
identity:
  traits:
    default_schema_url: file://home/ory/user.schema.json
selfservice:
  strategies:
    password:
      enabled: true
  logout:
    redirect_to: OMITTED
  settings:
    request_lifespan: 1h
    privileged_session_max_age: 1h
    after:
      password:
        default_return_to: OMITTED
      profile:
        default_return_to: OMITTED
  verify:
    request_lifespan: 1h
    link_lifespan: 24h
  login:
    request_lifespan: 1h
    after:
      default_return_to: OMITTED
  registration:
    request_lifespan: 10m
    after:
      default_return_to: OMITTED
      password:
        hooks:
        - hook: verify
        - hook: session
courier:
  smtp:
    connection_uri: OMITTED
    from_address: OMITTED
serve:
  admin:
    host: 0.0.0.0
    port: 4434
  public:
    host: 0.0.0.0
    port: 4433
urls:
  self:
    public: OMITTED
    admin: OMITTED
  mfa_ui: OMITTED
  login_ui: OMITTED
  settings_ui: OMITTED
  registration_ui: OMITTED
  error_ui: OMITTED
  verify_ui: OMITTED
  default_return_to: OMITTED
  whitelisted_return_to_urls: OMITTED
secrets:
  session:
    - OMITTED
hashers:
  argon2:
    OMITTED

Expected behavior
For it to try a few times increasing the time between them every time and maybe giving up after a few times. Somehow let the administrator know that it's not working correctly, but without spamming the console.

Environment

  • Version: v0.2.1-alpha.1
  • Environment: Kubernetes using oryd/kratos:latest docker image.

Additional context

@aeneasr
Copy link
Member

aeneasr commented May 10, 2020

Yeah I think we could introduce exponential backoff here!

@aeneasr aeneasr added this to the v0.5.0-alpha.1 milestone May 25, 2020
@aeneasr aeneasr added feat New feature or request. good first issue A good issue to tackle when being a novice to the project. labels May 25, 2020
@jakhog
Copy link
Contributor

jakhog commented Jul 12, 2020

I started working on this, and enabling the already in-place exponential back-off when Kratos is not able to connect to the SMPT server is fairly easy. However, I started thinking about other per-message errors (not sure what they would be, maybe malformed emails or too large bodies), and started playing with adding an exponential backoff to each message, storing state in the DB. Then I suddenly realised that Kratos is supposed to be scale horizontally by sharing the DB, and that this could cause all sorts of race conditions and strange behaviour.

So, my question becomes: is it necessary to have a courier service in Kratos that queues outgoing messages before sending them to a 3rd party service?

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

If the purpose of the courier queue is to keep track of the messages that was sent, most services (e.g. SendGrid) offers this feature by default. We could of course keep the log in the DB for audit purposes as well, if that is of interest.

@aeneasr
Copy link
Member

aeneasr commented Jul 14, 2020

Thank you for your thoughtful comment and for working towards a contribution!

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

Message delivery is asynchronous, waiting for message delivery in UX is bad because we might end up waiting 60 seconds or more to get an aknowlegement that the message was delivered (or we will never know).

Because we don't want to depend on other 3rd party services such as RabbitMQ to keep things easy, we'll have to deal with this in Kratos (which is already the case). I don't think there are possible race conditions because we use transactions. In the rare case where a race condition might still occur (e.g. simoultaneous down to the nanosecond SELECT) we'll just end up sending two emails, which isn#t great, but it's also not a dealbreaker!

@jakhog
Copy link
Contributor

jakhog commented Jul 14, 2020

Oh ok, I didn't know that the SMPT message posting was that slow. But then it makes total sense to to it in the background.

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished. The now and then double emails aren't that bad, but consistently sending duplicates looks a bit strange for an end user. I guess the easiest solution for now is just to document that only one of the instances should be configured with the SMPT courier, and its up to the admin to make sure of that? Then in the future one could look at some sort of leader-election using the DB so that the config can be the same for all instances.

Another problem however is if sending fails for a message - so not the unable to connect failure but for whatever reason (email too big?) a message fails. In the current implementation, that message would be retried every second which spams the logs. The bigger problem though is that if 8 of these un-sendable messages end up at the front of the queue, Kratos would never send any more emails, always retrying those 8 every time. I have a few suggestions for a fix (or a combination):

  1. Store retry-state per message, with a backoff strategy and retry count to reduce the log spam and give up at some point.
  2. Continue fetching groups of 8 messages until there is no more in the queue (essentially paging the queue)
  3. Mark messages that are rejected by the SMPT server as failed, and never retry.
    The third one is definitely the easiest, and I don't know if there is that much point in retrying per email?

@aeneasr
Copy link
Member

aeneasr commented Jul 15, 2020

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished

Nice find! This is definitely an issue. I think that we originally planned to have this as it's own sub-command (e.g. kratos courier watch) but we abandoned that idea in favor of a simpler set-up. In that scenario, you would have only one cron job that watches the queue, but that's definitely no longer the case.

In my opinion, we should have four message states:

  1. queued
  2. processing (goes back to queue on error, goes to sent on success)
  3. sent

In the future, we might also keep track of errors and put the message in a permanent error state if continued delivery fails.

@ro-tex
Copy link

ro-tex commented Jan 13, 2021

Are there any workarounds for this issue while we wait for a permanent fix?

We are looking to start using Kratos in production very soon but we have 8 machines right now and we're looking to double that in the next 6 months. Sending more than a dozen duplicates of each email is not really an option for us.

We are considering removing the email credentials from all but one machine. This will make it a single point of failure, so it's definitely not a great choice but it might just make it possible for us to operate. Can you advise on whether that's a viable strategy? I am not familiar with the inner workings of Kratos, so it's hard for me to anticipate the possible problems this might create e.g. multiple nodes failing to send the email and marking it as "bad" in the DB, so it never gets sent, etc.

@aeneasr
Copy link
Member

aeneasr commented Jan 14, 2021

The best thing would be to create a PR with a fix! That way you have a long-term solution, help out the community, and don't spend time on hacks which potentially mess up your system.

My idea here was to remove the background worker and create a new CLI command called e.g. kratos watch courier or kratos worker which basically executes the linked function in the foreground.

This would then need to be properly documented in several places - e.g. quickstart, guides, concepts. We would also need to update the docker-compose example.

The last thing is, if DSN=memory, the mail server should actually run as part of kratos serve (so as it is today) and reject to start kratos watch courier if that is true. Because this means that the data is kept in the kratos memory process, which other processes don't have access to!

@aeneasr aeneasr changed the title SMTP Error spams the server logs SMTP worker can become slow when too many invalid emails are queued up Jul 21, 2022
@aeneasr
Copy link
Member

aeneasr commented Jul 21, 2022

@jonas-jonas this might be a good next issue for you, let's discuss it when you're in munich :)

@aeneasr
Copy link
Member

aeneasr commented Jul 21, 2022

See also #1598

@jonas-jonas
Copy link
Member

@aeneasr sounds great, really looking forward to that!

@jonas-jonas
Copy link
Member

I think this issue is already solved, coincidentally by me, 5 months ago: #2257

The PR introduced a new configuration key (courier.message_ttl), that is used to set up a "time-to-live" for failed (read: unsent) messages, so that they are removed from the queue after the specified time.

An endpoint to query these could be added by querying all messages with the status MessageStatusAbandoned. Should we do that in a follow-up ticket?

@aeneasr
Copy link
Member

aeneasr commented Aug 1, 2022

Oh, indeed! I'm wondering what a good value would be for message_ttl? Could we also use a retry count?

Also, it seems like this setting is currently not used in Ory Cloud - if you search for the package "dogfood" in the backoffice you will find the kratos configuration - could you please set a reasonable message TTL there?

@aeneasr aeneasr closed this as completed in b341756 Aug 4, 2022
peturgeorgievv pushed a commit to senteca/kratos-fork that referenced this issue Jun 30, 2023
This PR replaces the `courier.message_ttl` configuration option with a `courier.message_retries` option to limit how often the sending of a message is retried before it is marked as `abandoned`. 

BREAKING CHANGES: This is a breaking change, as it removes the `courier.message_ttl` config key and replaces it with a counter `courier.message_retries`.

Closes ory#402
Closes ory#1598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request. good first issue A good issue to tackle when being a novice to the project.
Projects
None yet
Development

No branches or pull requests

5 participants