-
-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMTP worker can become slow when too many invalid emails are queued up #402
Comments
Yeah I think we could introduce exponential backoff here! |
I started working on this, and enabling the already in-place exponential back-off when Kratos is not able to connect to the SMPT server is fairly easy. However, I started thinking about other per-message errors (not sure what they would be, maybe malformed emails or too large bodies), and started playing with adding an exponential backoff to each message, storing state in the DB. Then I suddenly realised that Kratos is supposed to be scale horizontally by sharing the DB, and that this could cause all sorts of race conditions and strange behaviour. So, my question becomes: is it necessary to have a courier service in Kratos that queues outgoing messages before sending them to a 3rd party service? With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent). If the purpose of the courier queue is to keep track of the messages that was sent, most services (e.g. SendGrid) offers this feature by default. We could of course keep the log in the DB for audit purposes as well, if that is of interest. |
Thank you for your thoughtful comment and for working towards a contribution!
Message delivery is asynchronous, waiting for message delivery in UX is bad because we might end up waiting 60 seconds or more to get an aknowlegement that the message was delivered (or we will never know). Because we don't want to depend on other 3rd party services such as RabbitMQ to keep things easy, we'll have to deal with this in Kratos (which is already the case). I don't think there are possible race conditions because we use transactions. In the rare case where a race condition might still occur (e.g. simoultaneous down to the nanosecond SELECT) we'll just end up sending two emails, which isn#t great, but it's also not a dealbreaker! |
Oh ok, I didn't know that the SMPT message posting was that slow. But then it makes total sense to to it in the background. However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished. The now and then double emails aren't that bad, but consistently sending duplicates looks a bit strange for an end user. I guess the easiest solution for now is just to document that only one of the instances should be configured with the SMPT courier, and its up to the admin to make sure of that? Then in the future one could look at some sort of leader-election using the DB so that the config can be the same for all instances. Another problem however is if sending fails for a message - so not the unable to connect failure but for whatever reason (email too big?) a message fails. In the current implementation, that message would be retried every second which spams the logs. The bigger problem though is that if 8 of these un-sendable messages end up at the front of the queue, Kratos would never send any more emails, always retrying those 8 every time. I have a few suggestions for a fix (or a combination):
|
Nice find! This is definitely an issue. I think that we originally planned to have this as it's own sub-command (e.g. In my opinion, we should have four message states:
In the future, we might also keep track of errors and put the message in a permanent error state if continued delivery fails. |
Are there any workarounds for this issue while we wait for a permanent fix? We are looking to start using Kratos in production very soon but we have 8 machines right now and we're looking to double that in the next 6 months. Sending more than a dozen duplicates of each email is not really an option for us. We are considering removing the email credentials from all but one machine. This will make it a single point of failure, so it's definitely not a great choice but it might just make it possible for us to operate. Can you advise on whether that's a viable strategy? I am not familiar with the inner workings of Kratos, so it's hard for me to anticipate the possible problems this might create e.g. multiple nodes failing to send the email and marking it as "bad" in the DB, so it never gets sent, etc. |
The best thing would be to create a PR with a fix! That way you have a long-term solution, help out the community, and don't spend time on hacks which potentially mess up your system. My idea here was to remove the background worker and create a new CLI command called e.g. This would then need to be properly documented in several places - e.g. quickstart, guides, concepts. We would also need to update the docker-compose example. The last thing is, if |
@jonas-jonas this might be a good next issue for you, let's discuss it when you're in munich :) |
See also #1598 |
@aeneasr sounds great, really looking forward to that! |
I think this issue is already solved, coincidentally by me, 5 months ago: #2257 The PR introduced a new configuration key ( An endpoint to query these could be added by querying all messages with the status |
Oh, indeed! I'm wondering what a good value would be for message_ttl? Could we also use a retry count? Also, it seems like this setting is currently not used in Ory Cloud - if you search for the package "dogfood" in the backoffice you will find the kratos configuration - could you please set a reasonable message TTL there? |
This PR replaces the `courier.message_ttl` configuration option with a `courier.message_retries` option to limit how often the sending of a message is retried before it is marked as `abandoned`. BREAKING CHANGES: This is a breaking change, as it removes the `courier.message_ttl` config key and replaces it with a counter `courier.message_retries`. Closes ory#402 Closes ory#1598
Describe the bug
When the server is unable to connect to the SMTP server it wil spam the error and using a lot of resources by retrying very frequently.
Reproducing the bug
Steps to reproduce the behavior:
Server logs
(Sensitive details have been changed)
This continues on and on.
Server configuration
Expected behavior
For it to try a few times increasing the time between them every time and maybe giving up after a few times. Somehow let the administrator know that it's not working correctly, but without spamming the console.
Environment
Additional context
The text was updated successfully, but these errors were encountered: