Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Prevent misaligned segment reuploads #12966

Merged
merged 2 commits into from
Aug 23, 2023

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Aug 23, 2023

For normal segment uploads and reuploads we have strict checks that disallow any modifications if they're not safe. The check is done by the partition_manifest::safe_segment_meta_to_add method. This method checks if the new segment "connects" with the previous one in case of new segment upload. For reuploads it checks if the segment is aligned with the manifest.

For compacted reuploads this check was simplified because originally we had a feature that allowed compacted segment reupload to fill the gap in the manifest. So the reupload could potentially break the rule if the segment begins or ends in a gap. This feature is not needed now because the manifest can't have gaps anymore due to consistency checks described above. But because the check is not strict enough we may end up with the following situation:

manifest segments
[1000 1999] [2000 2999] [3000 3999] [4000 4999] [5000 5999]
first compacted reupload
[1000                         3999]
manifest after first reupload
[1000                         3999] [4000 4999] [5000 5999]
second comopacted reupload
            [2000                         4999]

When the first reupload is added to the manifest the second reupload is no longer aligned and can't be added. The new binary manifest format can't handle this situation and triggers assertion.

This PR removes reupload gap tests since they're no longer needed. It also updates the safe_segment_meta_to_add method to support compacted segments.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Bug Fixes

  • Fix consistency violation caused by compacted segment reupload

The tests are checking compacted reupload that starts in a gap. The
situation is impossible in redpanda since we can't have gaps in the
manifest. The tests are failing when strict checks on segment metadata
are enabled. Only alinged segment reuploads are allowed by these checks
so the reupload that starts in the gap can't really be registred by
the archival STM.
Validate compacted segments inside the 'safe_segment_meta_to_add'
method. Do not allow segment reuploads which are not aligned with
existing segments in the manifest.
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If safe_segment_meta_to_add returns false for a compacted re-upload, does this result in the "re-upload loop" getting stuck? The last compacted uploaded offset is not advanced, so we can end up picking the same compacted re-upload and being refused over and over.

@Lazin
Copy link
Contributor Author

Lazin commented Aug 23, 2023

If safe_segment_meta_to_add returns false for a compacted re-upload, does this result in the "re-upload loop" getting stuck? The last compacted uploaded offset is not advanced, so we can end up picking the same compacted re-upload and being refused over and over.

The loop won't stuck. The upload could be rejected only if it's not safe to replicate the metadata. This can be caused by race condition in the code or two racing ntp_archiver instances. In this case we will have two conflicting uploads and one of them will win and the second one will be rejected in a deterministic manner. The manifest will be updated so the next loop iteration will chose different upload candidate.

In the worst case we will have single compacted upload which gets rejected continuously. I don't know why this could happen but imagine that it did. In this case when the retention will kick in the ntp_archiver will chose different upload candidate.

@Lazin Lazin merged commit a73623e into redpanda-data:dev Aug 23, 2023
27 of 28 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants