Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run full Cadence 1.0 migration on TN state #3096

Closed
10 tasks done
turbolent opened this issue Feb 12, 2024 · 21 comments
Closed
10 tasks done

Run full Cadence 1.0 migration on TN state #3096

turbolent opened this issue Feb 12, 2024 · 21 comments
Assignees

Comments

@turbolent
Copy link
Member

turbolent commented Feb 12, 2024

Tasks

  1. Bug
  2. 1 of 2
    turbolent
  3. 2 of 2
    turbolent
  4. Improvement
    turbolent
  5. Improvement
    turbolent
@turbolent
Copy link
Member Author

For investigating the concurrency issues, I tried building the util binary with the race detector on, but it reported no issues:

CGO_ENABLED=1 \
GOOS=linux \
GOARCH=amd64 \
CC="zig cc -target x86_64-linux-musl" \
CXX="zig c++ -target x86_64-linux-musl" \
go build -v --tags "netgo,osusergo" -race -ldflags "-extldflags=-static"

@turbolent
Copy link
Member Author

For the empty intersection type loading issue, I opened #3138, which so far has a reproducer for one particular case where the type loading occurs (DictionaryValue.Insert in StorageMigration.MigrateNestedValue).

I'm not really sure how to proceed here. Disabling type checks, like checkContainerMutation isn't sufficient, because e.g. CompositeValue.Transfer calls CompositeValue.IsResourceKinded, which isn't just a run-time check, but required to distinguish between the struct vs resource code path of transferring (copy vs move).

The best idea I came up so far is rewriting the static type of a container value (e.g. dictionary, array), before migrating the nested value inside of it, i.e. before calling Remove/Insert, and before the related Transfer calls occur, which convert the static type.

@turbolent
Copy link
Member Author

I wonder if we can integrate the intersection type rewriting and entitlements migration code directly into interpreter.ConvertStaticToSemaType.

I tried just performing the intersection type rewriting there, and none of the tests break, except for the entitlements migration, which looses the necessary information it needs (legacy intersection type). Maybe also adding the entitlements migration there will fix that last issue?

@janezpodhostnik
Copy link
Contributor

About Errors when worker count >1

not 100% sure they are all related to the same thing (its likely), but one of the problems is that the migration code currently doesn't support creating slab indexes from multiple goroutines. This can be fixed by either adding a locking mechanism or some channel magic.

I will take a look.

@fxamacker
Copy link
Member

About Errors when worker count >1

not 100% sure they are all related to the same thing (its likely), but one of the problems is that the migration code currently doesn't support creating slab indexes from multiple goroutines. This can be fixed by either adding a locking mechanism or some channel magic.

Unless I'm mistaken, worker count > 1 means n workers processing n accounts concurrently.

Slab indexes are per account. I think we might only have a problem to generate slab index if multiple goroutines migrate the same account.

@turbolent
Copy link
Member Author

I had a look at AtreeRegisterMigrator, and the function that generates the new payloads from the snapshot's, validateChangesAndCreateNewRegisters, looks different from MergeRegisterChanges in the Cadence 1.0 migration. Are there maybe some cases we're not properly handling yet?

@turbolent
Copy link
Member Author

turbolent commented Feb 28, 2024

Re: Burner deployment is too expensive: I think @fxamacker you mentioned that creating a snapshot for all payloads is potentially intractable.

I opened onflow/flow-go#5470, to show that for the deployment migration, we can filter the payloads to the payloads of just the accounts that are needed for the deployment.

However, how can we merge the resulting write set into the original payloads without having to construct a full payload snapshot?

cc @onflow/flow-cadence-execution

@turbolent
Copy link
Member Author

Cross posting from Discord, Re: Does the migration pipeline properly migrate nested containers?

The example I thought of was:
Given an outer container (e.g. dictionary) and and inner one, both their static types must be updated (by the static type migration, and again by the entitlements migration).
Both the static type migration (https:/onflow/cadence/pull/3141/files#diff-1b50282851963b6af7e8c0baf33a448c9b6fdb4d405de6182626484310c8b5ffR151-R168) and the entitlements migration (

case *interpreter.DictionaryValue:
elementType := v.Type
entitledElementType, err := ConvertToEntitledType(inter, elementType)
if err != nil {
return nil, err
}
if entitledElementType != nil {
var keysAndValues []interpreter.Value
iterator := v.Iterator()
for {
keyValue, value := iterator.Next(inter)
if keyValue == nil {
break
}
keysAndValues = append(keysAndValues, keyValue)
keysAndValues = append(keysAndValues, value)
}
return interpreter.NewDictionaryValueWithAddress(
inter,
interpreter.EmptyLocationRange,
entitledElementType.(*interpreter.DictionaryStaticType),
v.GetOwner(),
keysAndValues...,
), nil
}
) create/return a new/replacement value, and do not recursively call the migration.

From what I can see (but I might be missing something), the "outer" storage migration (StorageMigration) only mutates the existing value to be migrated (

case *interpreter.DictionaryValue:
dictionary := value
type keyValuePair struct {
key, value interpreter.Value
}
// Read the keys first, so the iteration wouldn't be affected
// by the modification of the nested values.
var existingKeysAndValues []keyValuePair
iterator := dictionary.Iterator()
for {
key, value := iterator.Next(nil)
if key == nil {
break
}
existingKeysAndValues = append(
existingKeysAndValues,
keyValuePair{
key: key,
value: value,
},
)
}
for _, existingKeyAndValue := range existingKeysAndValues {
existingKey := existingKeyAndValue.key
existingValue := existingKeyAndValue.value
newKey := m.MigrateNestedValue(
storageKey,
storageMapKey,
existingKey,
valueMigrations,
reporter,
)
newValue := m.MigrateNestedValue(
storageKey,
storageMapKey,
existingValue,
valueMigrations,
reporter,
)
if newKey == nil && newValue == nil {
continue
}
// We only reach here at least one of key or value has been migrated.
var keyToSet, valueToSet interpreter.Value
if newKey == nil {
keyToSet = existingKey
} else {
// Key was migrated.
// Remove the old value at the old key.
// This old value will be inserted again with the new key, unless the value is also migrated.
existingKey = legacyKey(existingKey)
oldValue := dictionary.Remove(
m.interpreter,
emptyLocationRange,
existingKey,
)
if _, ok := oldValue.(*interpreter.SomeValue); !ok {
panic(errors.NewUnexpectedError(
"failed to remove old value for migrated key: %s",
existingKey,
))
}
keyToSet = newKey
}
if newValue == nil {
valueToSet = existingValue
} else {
// Value was migrated
valueToSet = newValue
}
dictionary.Insert(
m.interpreter,
emptyLocationRange,
keyToSet,
valueToSet,
)
}
).
I'd assume to be that the outer storage migration code either
performs a post-order "traversal"/migration, i.e. migrates the child values of the container first, and then updates the outer replacement (if any); or
performs a pre-order "traversal"/migration, i.e. migrates the value (in this case container) first, then the children

I wasn't quite sure if we properly handle it and it would be good to have a test case for it (maybe we have one already)

@SupunS
Copy link
Member

SupunS commented Feb 28, 2024

Added some tests and improved the comments here: #3142

@turbolent turbolent mentioned this issue Feb 29, 2024
6 tasks
@turbolent
Copy link
Member Author

@janezpodhostnik @fxamacker I reran the util with race detector enabled on the machine and this time it reported data races while grouping accounts. Opened onflow/flow-go#5485. Could you please have a look?

@turbolent
Copy link
Member Author

turbolent commented Feb 29, 2024

Adding atree storage health checks to the tests of the migrations in #3144 shows errors. Also opened onflow/flow-go#5486 in flow-go and it reports problems as well.

My hunch is that the container re-creation, as done by the static type and entitlement migration, is incorrect: We're just creating a new dictionary, and insert the elements of the container directly into it – we probably need to properly remove them from the old container, which transfers them to the "stack", and then insert them into the new container

@fxamacker
Copy link
Member

@janezpodhostnik @fxamacker I reran the util with race detector enabled on the machine and this time it reported data races while grouping accounts. Opened onflow/flow-go#5485. Could you please have a look?

@turbolent I opened PR to fix the data race and added a test to reproduce it:

The PR targets main branch since it is a bugfix.

@turbolent
Copy link
Member Author

Running the migration with the latest fixes also reports a "Go panic" for an out-of-bound index:

ERR failed to run StorageMigration in account 9eca2b38b18b5dfe, domain contract, key FlowDKG: runtime error: index out of range [0] with length 0 migration=cadence-value-migration migration_index=0

@SupunS Could we maybe improve the storage migration to include a stack trace when we recover from panics, so we can pinpoint the problem?

The migration also runs in to "slab not found" errors, but those are probably due to the bugs we have already reported by the atree storage health checks:

ERR failed to run StorageMigration in account 9eca2b38b18b5dfe, domain contract, key FlowIDTableStaking: slab (0x9eca2b38b18b5dfe.49614) not found: slab not found for stored value migration=cadence-capability-value-migration migration_index=0

Finally, the progress logs for storing the final trie seemed odd, cc @onflow/flow-cadence-execution:

INF storing 9-th sub trie roots progress 6906506/5755421 (120.0%) elapsed: 4m17s, eta 0s checkpoint_file=/var/flow/c1/root.checkpoint trie_count=1 version=6

@turbolent
Copy link
Member Author

turbolent commented Mar 4, 2024

@fxamacker In the migrations, we need to often update the static types of values, including containers (array, dictionaries, composites), i.e. the type info of atree arrays and ordered maps.

Do you think we could add support for changing the type info of atree arrays and ordered maps?

I saw that OrderedMap has Type() to get the TypeInfo, which in turn uses MapSlab.ExtraData(). There is also a MapSlab.SetExtraData(), but it's not (yet) exposed via a OrderedMap.SetType().

@fxamacker
Copy link
Member

Do you think we could add support for changing the type info of atree arrays and ordered maps?

@turbolent Yes, I can add Array.SetType() and OrderedMap.SetType() in atree.

Are these functions needed for both atree-inlining branch and master branch in atree repo?

@turbolent
Copy link
Member Author

turbolent commented Mar 5, 2024

@fxamacker Awesome! That would be extremely useful 🙏

Are these functions needed for both atree-inlining branch and master branch in atree repo?

Yes, those functions will be needed for both the current atree version used on master (higher priority), and in the register-inlining version of atree used in the atree register-inlining branch (we can port the functions there later)

@fxamacker
Copy link
Member

Do you think we could add support for changing the type info of atree arrays and ordered maps?

@turbolent I opened PRs at onflow/atree to add SetType for atree containers (without inlining):

Yes, those functions will be needed for both the current atree version used on master (higher priority)

Since Cadence master branch is using atree feature/stable-cadence branch, I made the new PRs target feature/stable-cadence branch (instead of master).

github.com/onflow/atree v0.6.1-0.20230711151834-86040b30171f

in the register-inlining version of atree used in the atree register-inlining branch (we can port the functions there later)

The same functions probably can't be ported to atree-inlining version because inlined container needs to notify parent container so parent container slab can be persisted with new inlined static type information.

I will open new PRs to add SetType implemented differently for atree-inlining version.

@turbolent
Copy link
Member Author

@fxamacker Awesome! Thank you for adding these 👏 👌

@turbolent
Copy link
Member Author

The TN state service account is massive (5GB!), because of the random beacon history, see https://discord.com/channels/613813861610684416/1215015019260158022. I upgraded my machine at home to 64GB of memory and that still doesn't seem to be enough.

I'll try to add a migration to the beginning of the migration pipeline which prunes the random beacon history.

@turbolent
Copy link
Member Author

turbolent commented Mar 10, 2024

All items so far have been addressed, and this issue is getting long.

Closing and opened #3162 for follow-up work that was discovered

@fxamacker
Copy link
Member

Nice! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants