Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement record based Crucible reference counting #6805

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Oct 9, 2024

Crucible volumes are created by layering read-write regions over a hierarchy of read-only resources. Originally only a region snapshot could be used as a read-only resource for a volume. With the introduction of read-only regions (created during the region snapshot replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this they need to have a reference count so they can be deleted when they're not referenced anymore. The region_snapshot table uses a volume_references column, which counts how many uses there are. The region table does not have this column, and more over a simple integer works for reference counting but does not tell you what volume that use is from. This can be determined (see omdb's validate volume references command) but it's information that is tossed out, as Nexus knows what volumes use what resources! Instead, record what read-only resources a volume uses in a new table.

As part of the schema change to add the new volume_resource_usage table, a migration is included that will create the appropriate records for all region snapshots.

In testing, a few bugs were found: the worst being that read-only regions did not have their read_only column set to true. This would be a problem if read-only regions are created, but they're currently only created during region snapshot replacement. To detect if any of these regions were created, find all regions that were allocated for a snapshot volume:

SELECT id FROM region
WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts #6728, enabling region snapshot replacement again - it was disabled due to a lack of read-only region reference counting, so it can be enabled once again.

Crucible volumes are created by layering read-write regions over a
hierarchy of read-only resources. Originally only a region snapshot
could be used as a read-only resource for a volume. With the
introduction of read-only regions (created during the region snapshot
replacement process) this is no longer true!

Read-only resources can be used by many volumes, and because of this
they need to have a reference count so they can be deleted when they're
not referenced anymore. The region_snapshot table uses a
`volume_references` column, which counts how many uses there are. The
region table does not have this column, and more over a simple integer
works for reference counting but does not tell you _what_ volume that
use is from. This can be determined (see omdb's validate volume
references command) but it's information that is tossed out, as Nexus
knows what volumes use what resources! Instead, record what read-only
resources a volume uses in a new table.

As part of the schema change to add the new `volume_resource_usage`
table, a migration is included that will create the appropriate records
for all region snapshots.

In testing, a few bugs were found: the worst being that read-only
regions did not have their read_only column set to true. This would be a
problem if read-only regions are created, but they're currently only
created during region snapshot replacement. To detect if any of these
regions were created, find all regions that were allocated for a
snapshot volume:

    SELECT id FROM region
    WHERE volume_id IN (SELECT volume_id FROM snapshot);

A similar bug was found in the simulated Crucible agent.

This commit also reverts oxidecomputer#6728, enabling region snapshot replacement
again - it was disabled due to a lack of read-only region reference
counting, so it can be enabled once again.
@jmpesp jmpesp requested review from smklein and leftwo October 9, 2024 00:29
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have some more questions after I finish going through nexus/tests/integration_tests/volume_management.rs but I thought I should give you what I have so far.

/// this column, and more over a simple integer works for reference counting but
/// does not tell you _what_ volume that use is from. This can be determined
/// (see omdb's validate volume references command) but it's information that is
/// tossed out, as Nexus knows what volumes use what resources! Instead, record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of the comment here, I'm having trouble tracking what the "Instead" is referencing.

We are recording what read-only resources a volume uses instead of doing what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context I mean "instead of tossing out information, record it in the usage table", where "tossing out information" means to just increment and decrement the volume_references integer in the region snapshot table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, could we update the comment with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 221751a

VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the .expects below, this message is printed when we panic, right? If so, should it be saying that we did not find a valid region usage record?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, that's in 0a45763

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
for read_only_target in crucible_targets.read_only_targets {
let sub_err = OptionalError::new();

let maybe_usage = Self::read_only_target_to_volume_resource_usage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line and a few others here seem to have escaped cargo fmt as they are super long.

@@ -2409,24 +2956,188 @@ impl DataStore {
read_only_parent: None,
};

let volume_data = serde_json::to_string(&vcr)
let volume_data = serde_json::to_string(&vcr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, there is a lot going on inside this transaction :)

)
})
})?
// XXX be smart enough to .filter the above query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a TODO, you can get smart enough!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/src/app/sagas/disk_create.rs Show resolved Hide resolved
'region_snapshot'
);

CREATE TABLE IF NOT EXISTS omicron.public.volume_resource_usage (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it exists elsewhere in this PR, but I think this table would benefit from some text explaining what it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, done in bd2ef84

Comment on lines +54 to +58
pub region_id: Option<Uuid>,

pub region_snapshot_dataset_id: Option<Uuid>,
pub region_snapshot_region_id: Option<Uuid>,
pub region_snapshot_snapshot_id: Option<Uuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea reading the DB schema that these were two groups of columns, and "either one or the other is non-null".

If that's the intent -- and it seems to be, based on VolumeResourceUsage as an enum -- maybe we could add a CHECK on the table validating this?

Something like:

CONSTRAINT exactly_one_usage_source CHECK (
  (
    (usage_type = 'readonlyregion') AND
    (region_id IS NOT NULL) AND
    (region_snapshot_dataset_id IS NULL AND region_snaphshot_region_id IS NULL AND region_snapshot_snapshot_id IS NULL)
  ) OR
  (
    (usage_type = 'regionsnapshot') AND
    (region_id NOT NULL) AND
    (region_snapshot_dataset_id IS NOT NULL AND region_snaphshot_region_id IS NOT NULL AND region_snapshot_snapshot_id IS NOT NULL)
  )
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, done in bd2ef84

VolumeResourceUsage::ReadOnlyRegion {
region_id: record
.region_id
.expect("valid read-only region usage record"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conservatively, this should maybe be a TryFrom implementation. As it exists today, someone could modify a column in the database and cause Nexus to panic with these .expects

@@ -264,7 +264,7 @@ impl DataStore {
block_size,
blocks_per_extent,
extent_count,
read_only: false,
read_only: maybe_snapshot_id.is_some(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note though that currently the only thing in Nexus that creates read-only regions is region snapshot replacement, so this wasn't a bug that was hit anywhere.

enum VolumeCreationError {
#[error("Error from Volume creation: {0}")]
Public(Error),
let maybe_volume: Option<Volume> = dsl::volume
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume has a time_deleted column -- do we not care about that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, no - if the caller is trying to call volume_create with a volume that has the same ID as one that already exists or is soft-deleted, then we should disallow that.

Comment on lines +3060 to +3064
// This function may be called with a replacement volume
// that is completely blank, to be filled in later by this
// function. `volume_create` will have been called but will
// not have added any volume resource usage records, because
// it was blank!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this comment, in this location - namely, I'm not sure who is calling volume_create in this situation where we're saying "volume_create will have been called". Is this something we're doing within the body of this function, and I'm not seeing it? Or is this something someone else could concurrently be doing?

// not have added any volume resource usage records, because
// it was blank!
//
// The indention leaving this transaction is that the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The indention leaving this transaction is that the
// The intention leaving this transaction is that the

// We don't support a pure Region VCR at the volume
// level in the database, so this choice should
// never be encountered.
panic!("Region not supported as a top level volume");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun to me, to have a pub fn with an undocumented panic?

OVERLAY(
OVERLAY(
MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? Why '4'? Why from 13?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code creates a deterministic V4 UUID from the other columns, which has the shape of

xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
12345678 9111 1111 1112 222222222333
          012 3456 7890 123456789012

where the first hexadecimal digit in the third group always starts with a 4 means that M = 4, and I put a "random" value (read: volume references) in for N (the variant field).

But from https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-4:

Alternatively, an implementation MAY choose to randomly generate the exact required number of bits for random_a, random_b, and random_c (122 bits total) and then concatenate the version and variant in the required position.

I don't think I'm doing this right - position 17 shouldn't be "random":

var:
The 2-bit variant field as defined by Section 4.1, set to 0b10.

I can fix this tomorrow

MD5(volume.id::TEXT || dataset_id::TEXT || region_id::TEXT || snapshot_id::TEXT || snapshot_addr || volume_references::TEXT)
PLACING '4' from 13
)
PLACING TO_HEX(volume_references) from 17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why from 17?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

The garbage collection of read/write regions must be separate from
read-only regions:

- read/write regions are garbage collected by either being deleted
  during a volume soft delete, or by appearing later during the "find
  deleted volume regions" section of the volume delete saga

- read-only regions are garbage collected only in the volume soft delete
  code, when there are no more references to them

`find_deleted_volume_regions` was changed to only operate on read/write
regions, and no longer returns the optional RegionSnapshot object: that
check was moved from the volume delete saga into the function, as it
didn't make sense that it was separated.

This commit also adds checks to validate that invariants related to
volumes are not violated during tests. One invalid test was deleted
(regions will never be deleted when they're in use!)

In order to properly test the separate region deletion routines, the
first part of the fixes for dealing with deleted volumes during region
snapshot replacement were brought in from that branch: these are the
changes to region_snapshot_replacement_step.rs and
region_snapshot_replacement_start.rs.
@jmpesp
Copy link
Contributor Author

jmpesp commented Oct 16, 2024

@smklein @leftwo have a look at 2ff34d5, which contains a fix for the region deletion bug I've been talking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants