Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

Closed
MartinMitro opened this issue Apr 25, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@MartinMitro
Copy link

MartinMitro commented Apr 25, 2023

Key information

  • RFC PR: (leave this empty)
  • Related issue(s), if known:
  • Area: (i.e. Tracer, Metrics, Logger, etc.)
  • Meet tenets: (Yes/no)

Summary

Once the first message of Batch fails and we know it will be returned back to queue, rest of the messages with same message group id should no be processed

Motivation

We want to be complaint with the FIFO SQS principle and process messages within same group in order.

Proposal

If the processing of the batch message fails with retryable exception and message attributes contain message group id, then the id should be stored and subsequent message of batch should be checked against it. If there is a match, message processing should be skipped and message also returned to queue.

Drawbacks

As in FIFO queues, next messages of the same message group will be processed only when the failing was in removed from queue.

Rationale and alternatives

  • What other designs have been considered? Why not them?

  • What is the impact of not doing this?

Messages are not processed in order. Example: Lets say we have entity, which with id 1, and that ID is also message group ID. Now we have stack of updates in FIFO queue, first one updates value of entity to A, second one to value B. First one fails lets say on connection problem and will be retried later, second one succeeds and value is updated to B. Once the visibility timeout of first message expires, it will be processed again and value of entity with id 1 will be updated to A. Entity ends with invalid state.

Unresolved questions

@scottgerring
Copy link
Contributor

Hey @MartinMitro thanks for raising this!

Looking to the python powertools for inspiration, I see:

When using SQS FIFO queues, we will stop processing messages after the first failure, and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

The user then picks either the BatchProcessor or the SqsFifioPartialProcessor to choose the behavior. As far a I can see this logic is not yet implemented in Lambda powertools for Java.

I believe this would solve your problem and maintain consistency with the other powertools implementations. In your example, the failed update to entity A would cause the batch to be returned as unprocessed, and the update to entity B would not be passed to the handler until A had succeeded.

What do you think?

@MartinMitro
Copy link
Author

Hey @scottgerring,

thanks for the response. If I understand correctly, Lambda can receive messages with different MessageGroupIds. If the implementation would fail on the first message, regardless of message group id, that would just postpone processing of not related message. That could repeat also multiple times - maybe causing messages to be sent to DLQ or be discarded. I would therefore suggest to fail only messages with same MessageGroupId and process the rest.

I look forward to hearing from you.

@scottgerring
Copy link
Contributor

Hey @MartinMitro, I've dug into this a bit deeper to try and understand how python powertools arrived at its implementation. The initial PR adding support for the feature links to the SQS documentation page Implementing partial batch responses which advises the following:

If you're using this feature with a FIFO queue, your function should stop processing messages after the first failure and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

Do you think this will be a workable solution in your case?

@scottgerring scottgerring self-assigned this May 24, 2023
@MartinMitro
Copy link
Author

Hey @scottgerring, let me verify that statement with Lambda / SQS team. Once I have response I will come back to you. Thank you

@scottgerring
Copy link
Contributor

Hey @MartinMitro , i've started sketching out a fix for this in #1183 . I have also tried to reach out to the SQS team to determine the correct way of handling this situation.

@MartinMitro
Copy link
Author

Hey @scottgerring, I've received feedback to my support case and it seems lambda can receive only one message group id per batch. It references this documentation: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling

So in fact python implementation is correct and we can short circuit rest of the processing if one message fails.

@scottgerring
Copy link
Contributor

scottgerring commented Jun 14, 2023

Hey @MartinMitro - great - and reassuring to see they've come to the same documentation. I can change the implementation in the PR to do this easily enough.

I've been hunting internally for some more canonical wording here but have come up empty handed - if you don't mind forwarding me the messaging from support ( gerrings at amazon ) it might help scratch that itch!

@scottgerring scottgerring changed the title RFC: Improve handling of messages with same MessageGroupId in SQS Batch processor Improve handling of messages with same MessageGroupId in SQS Batch processor Jul 13, 2023
@github-actions
Copy link

This is now released under 1.16.1 version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Shipped
Development

No branches or pull requests

2 participants