Skip to content
This repository has been archived by the owner on May 24, 2022. It is now read-only.

[RFC] Schema migration tool #148

Closed
wants to merge 1 commit into from
Closed

[RFC] Schema migration tool #148

wants to merge 1 commit into from

Conversation

marcoguerri
Copy link
Contributor

Proposal for a schema migration tool, based on golang-migrate/migrate.

Each migration is representd with a Task structure, which dictates
which schema changes should be applied and defines in code how
data should be migrated from prevoius to new schema. So, the migration
process consists in two parts:

  • Schema migration: this is done via golang-migrate/migrate library.
    SQL for forward and backward migration is defined Task structures.
    The schema is versioned based on the version number returned by
    First/Next/Prev functions implemented by Task structures. Migrations
    are applied in ascending order by golang-migrate/migrate, which keeps
    track of the current version of the schema, and whether it was
    applied cleanly in a dedicated table (schema_migrations).
  • Data migration: after a schema has been migrated, data might need
    to be migrated as well. This is done via also in code by Task structs
    The same approach for tracking data migrations as schema migrations
    will be implemented: there will be a data_migrations table which
    keeps track of which schema version the data has been migrated to.
    Schema migration and data migration versions need to be aligned,
    even though it is not mandatory to define code to implement a data
    migration.

Proposal for a schema migration tool, based on `golang-migrate/migrate`.

Each migration is representd with a Task structure, which dictates
which schema changes should be applied and defines in code how
data should be migrated from prevoius to new schema. So, the migration
process consists in two parts:

* Schema migration: this is done via golang-migrate/migrate library.
SQL for forward and backward migration is defined Task structures.
The schema is versioned based on the version number returned by
First/Next/Prev functions implemented by `Task` structures. Migrations
are applied in ascending order by `golang-migrate/migrate`, which keeps
track of the current version of the schema, and whether it was
applied cleanly in a dedicated table (`schema_migrations`).
* Data migration: after a schema has been migrated, data might need
to be migrated as well. This is done via also in code by `Task` structs
The same approach for tracking data migrations as schema migrations
will be implemented: there will be a `data_migrations` table which
keeps track of which schema version the data has been migrated to.
Schema migration and data migration versions need to be aligned,
even though it is not mandatory to define code to implement a data
migration.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 7, 2020
@marcoguerri
Copy link
Contributor Author

marcoguerri commented Oct 7, 2020

This is still very much a work in progress and it's meant to be only and RFC for the approach adopted. For what is worth, example output of the tool:

@debian [Exit 1, 44s] ~/contest ~ at 11:23 [Wed 7] (schema_migration) $ go run tools/migration/rdbms/main.go --from 1 --to 3
[2020-10-07T11:24:30+02:00]  INFO contest: beginning schema migration from 1 to 3
[2020-10-07T11:24:30+02:00]  INFO contest: beginning migration v1 -> v3
[2020-10-07T11:24:30+02:00]  WARN contest: could not get the currrent active migration version: no migration
Task: add test_name_modified to test_events [====================================================================] 100%
[2020-10-07T11:25:14+02:00]  INFO contest: migration v1 completed!
Task: add event_name_modified to test_events [====================================================================] 100%
[2020-10-07T11:25:57+02:00]  INFO contest: migration v2 completed!
[2020-10-07T11:25:57+02:00] FATAL contest: could not run migration v2: no next version for v2
exit status 1

Note that data_migrations table has not been implemented yet.

I was planning to keep it simple and not record a full history of the past migrations. My priority is to have something basic in place to unblock #118, #119, #121. If we reach consensus on the approach above, it can be extended later in my opinion, but I am open to implementing full history support since the beginning if this is considered necessary.

@xaionaro
Copy link
Contributor

xaionaro commented Oct 7, 2020

Schema migration: this is done via golang-migrate/migrate library.

Just an useless thought:

In my practice the tool https:/golang-migrate/migrate/tree/master/cmd/migrate sometimes reported very obscure errors and it was quite time consuming to understand what am I doing wrong. Though may be this time we will find time to help to improve this library :)
I'm not saying we should not use this library. Just thinking out loud.

Proposal for a schema migration tool

First thought:
Heh, you decided to implement a tool from scratch? Out of curiosity: why we cannot encode data migration on SQL (why we need to do it on Golang?)?

Second thought:
We should explicitly return an error on attempt to go "down" if the migration does not support it. If I understand the draft correctly such cases are currently just ignored.

Basically the main question is actually: why do we need to implement a tool from scratch? Just in case: it is not a criticism, just asking for more context. I think I initially misunderstood and I thought you were thinking about applying an existing solution.

@marcoguerri
Copy link
Contributor Author

marcoguerri commented Oct 8, 2020

Thank you for looking into this :)

This is very much an attempt to actually better understand the requirements for our use case. In fact, I should have started with requirements, but I have limited experience in this space and the tooling available, so I needed to go through this experiment first.

Heh, you decided to implement a tool from scratch? Out of curiosity: why we cannot encode data migration on SQL (why we need to do it on Golang?)?

This is the most important point to address, so I'll focus on this one only. My intention was to support both schema migration and data backfill in a single go module/plugin. For #118 (and I imagine for many other cases), we will need to backfill the newly created column, and that can happen only with custom logic which is aware of ConTest structures.

In order to bundle together schema migration and data backfill logic, I implemented them together as part of Task interface.

However, I am not happy with the result: in this implementation, the library is not aware in any way of the data backfill process (so basically, it does not support writing the migration step in go). It is added as part of Task interface as MigrateData method, but it's not being tracked in any way by the library (e.g. I would need to track whether data has been migrated or not with a dedicated table). Furthermore, the data backfill should happen in most cases when the schema has already been migrated, and the new version is already running: so, it should be modeled as a separate migration.

I looked into https:/pressly/goose, and I believe it is better suited for this task as it natively supports both SQL and Go migrations. So the process to "backfill column B by taking column A and applying custom transformation" can be modeled as a migration step tracked by the library. A problem I see with that library is that Go migration at the moment can run only within a transaction, which might cause downtime (which for ConTest should be just fine, though, if we can avoid it, why now. It should be relatively easy to add support for non-transactional updates).

An additional comment: I have not considered online schema change tools, as I assumed most of the schema migrations we would need to support can natively happen online. Anything more complex for which this is not the case might be tackled with an ad-hoc process.

@xaionaro
Copy link
Contributor

xaionaro commented Oct 8, 2020

we will need to backfill the newly created column, and that can happen only with custom logic which is aware of ConTest structures.

Still, not an objection, just asking for more details:

I still do not see what we gain by doing Go-code instead of SQL code in this example. I would see the point if:

  • We just use SQL (without Go) to describe a migration: START TRANSACTION, ALTER, UPDATE, COMMIT.
  • We describe the migration purely on Go in a dialect-agnostic way.

But we currently do dialect-dependent code which still requires Go. While in a simple SQL-file we could be aware of ConTest structures without Go as well. Yes: it will be two sources of truth about structures, but you have two of them anyway in the proposed variant, AFAICS. If you want to have one, then you need to use ORM-based migrations of something like that.

In total: could you rephrase the point?

backfill column B by taking column A and applying custom transformation

Why we cannot do this custom transformation on pure MySQL?

Furthermore, the data backfill should happen in most cases when the schema has already been migrated, and the new version is already running: so, it should be modeled as a separate migration.

I think schema and data migration should go together in one transaction: to transit the state of DB from one good state to another good state (without intermediate states).

A problem I see with that library is that Go migration at the moment can run only within a transaction, which might cause downtime (which for ConTest should be just fine, though, if we can avoid it, why now. It should be relatively easy to add support for non-transactional updates).

What you mean by non-transactional updates? In InnoDB every update is a transaction, AFAIK. And in MyISAM even worse: every update is a global lock. @dimkup : do you have any comments here?

@marcoguerri
Copy link
Contributor Author

marcoguerri commented Oct 8, 2020

but you have two of them anyway in the proposed variant, AFAICS

Not sure I follow here: why do I have two sources of truth? I would use core ConTest structs to support the data backfill part of the migration.

In total: could you rephrase the point?

In #118 we would need to:

  • Create extended_descriptor column
  • For each entry in jobs table:
    • Read jobs.descriptor field, de-serialize into the corresponding golang struct
    • Read jobs.testssteps field, de-serialize into the corresponding golang struct
    • Build and ExtendedDescriptor golang object based on the two deserialization above (basically, merge them), and serialiaze it into the new extended_descriptor column.

The reasons why these changes are necessary are described in https:/facebookincubator/contest/pull/118. I can explain again the rationale, but probably a better approach is to agree that the above is a reasonable migration flow that we need to support.

This migration is probably not even a complex one, more complex transformation might be necessary.

Why we cannot do this custom transformation on pure MySQL?

Would you do the migration above in SQL?

I think schema and data migration should go together in one transaction: to transit the state of DB from one good state to another good state (without intermediate states).

Take for example the case above. ConTest v2 will use jobs.extended_descriptor, ConTest v1 will use jobs.descriptor (e.g. for Status call). ConTest v2 can start running only when jobs.extended_descriptor is populated. At the same time, if you migrate jobs.descriptor -> jobs.extended_descriptor on ConTest v1 online, the running v1 instance will keep on populating jobs.descriptor, which you would need to migrate afterwards.

In my opinion, this can be solved with an ad-hoc migration (e.g. double write in the transition phase). But I don't think we want to go down this route. So the alternative for a clean migration in my opinion is:

  • Shutdown v1
  • Migrate transactionally
  • Bring up v2

The alternative without downtime:

  • Run v1
  • Migrate schema
  • Bring up v2 (Status calls for old jobs will fail)
  • Migrate data

So, schema migration + data backfill would happen at two separate times in the alternative approach.

What you mean by non-transactional updates? In InnoDB every update is a transaction, AFAIK. And in MyISAM even worse: every update is a global lock. @dimkup : do you have any comments here?

Here I was referring specifically to the data backfill flow. The golang library enforces that a tx *sql.Tx objec is used to run any manipulation against the database. In the case above, where you need to backfill possibly a large number of records, one can design the migration code not to require to run within a transaction.

At this point I think we need to clarify whether we need to support a data backfill mechanism as part of the migration (I think we do) and how we would model that.

@xaionaro
Copy link
Contributor

xaionaro commented Oct 8, 2020

but you have two of them anyway in the proposed variant, AFAICS

Not sure I follow here: why do I have two sources of truth? I would use core ConTest structs to support the data backfill part of the migration.

I meant that the schema migration itself is presented as an static SQL-query therefore it is the second source of truth about the structure.

In total: could you rephrase the point?

[ ... ]

  • Read jobs.descriptor field, de-serialize into the corresponding golang struct
  • Read jobs.testssteps field, de-serialize into the corresponding golang struct

OK, now I see the point :)
Yeah. Indeed, in ConTest our data is deeply not-normalized and therefore it might be not that easy to handle the data with pure SQL. So the point is not about sources of truth, but about just handy tooling to handle highly-non-normalized data (and even worse: with dynamic structures). Makes sense.

Why we cannot do this custom transformation on pure MySQL?

Would you do the migration above in SQL?

I could've, but I already see: it seems Go-based implementations of data migration are indeed the least-bad solution. I was just a lack of context in my head, sorry.

I think schema and data migration should go together in one transaction: to transit the state of DB from one good state to another good state (without intermediate states).

Take for example the case above. ConTest v2 will use jobs.extended_descriptor, ConTest v1 will use jobs.descriptor (e.g. for Status call). ConTest v2 can start running only when jobs.extended_descriptor is populated. At the same time, if you migrate jobs.descriptor -> jobs.extended_descriptor on ConTest v1 online, the running v1 instance will keep on populating jobs.descriptor, which you would need to migrate afterwards.

In my opinion, this can be solved with an ad-hoc migration (e.g. double write in the transition phase). But I don't think we want to go down this route. So the alternative for a clean migration in my opinion is:

  • Shutdown v1
  • Migrate transactionally
  • Bring up v2

The alternative without downtime:

  • Run v1
  • Migrate schema
  • Bring up v2 (Status calls for old jobs will fail)
  • Migrate data

So, schema migration + data backfill would happen at two separate times in the alternative approach.

In my opinion, guarantees of the clean state and predictable behavior is much more important than avoiding of few seconds of downtime. What if "Migrate data" will fail? What if something between "Migrate schema" and "Migrate data" will happen and "Migrate data" will not fail, but will do the job wrongly? We cannot guarantee anything while "migrate schema" and "migrate data" are not performed in a single transaction. I understand this is unlikely scenarios, but still makes me feel bad about this approach :)

Migrating non-breaking changes (which will be the vast majority of the cases):

  • Migrate transactionally (usually just add a field with a DEFAULT value).
  • Upgrade ConTest version: v1 → v2

Migrating breaking changes:

  • Upgrade ConTest version: v1 → v2 and, roughly speaking, answer with HTTP code "503" (so the client should try again in few seconds) until schema revision will be high enough.
  • Migrate transactionally.

To do not do two flows, we can always use the second one.

The problem is that we do not rotate/partition the data and the longer ConTest works the longer each migration will take. But, I think, in future this problem will be solved (we will need to solve this problem anyway).

What you mean by non-transactional updates? In InnoDB every update is a transaction, AFAIK. And in MyISAM even worse: every update is a global lock. @dimkup : do you have any comments here?

Here I was referring specifically to the data backfill flow. The golang library enforces that a tx *sql.Tx objec is used to run any manipulation against the database. In the case above, where you need to backfill possibly a large number of records, one can design the migration code not to require to run within a transaction.

I do not follow here, but it seems to be not important, so let's skip the discussion about this :)

At this point I think we need to clarify whether we need to support a data backfill mechanism as part of the migration (I think we do) and how we would model that.

Personally I have no strong opinion:
The proposed approach looks good, though my personal (and probably: terribly wrong) taste would like to use another approach. I think if my response above haven't changed your mind anywhere then we need to go with your approach (since personally I see no essential reasons why not).

@xaionaro
Copy link
Contributor

xaionaro commented Oct 8, 2020

@kshvakov : Hello! You are the best person I know in this area. So I'm humbly wondering if you have any extra time to express any opinion you might have :)

The discussion is about DB migrations tooling (in Go).

@dimkup
Copy link
Contributor

dimkup commented Oct 8, 2020

2 cents from DBA in my head, because he is too loud:)

  • Any DB migration should start from a full database backup and ideally offer a simple way to restore from it
  • Online migrations are hard. If we want to keep contest "simple", it's probably a wrong way to go.
    that's it, much quieter now:)

@marcoguerri
Copy link
Contributor Author

marcoguerri commented Oct 9, 2020

OK, now I see the point :)
Yeah. Indeed, in ConTest our data is deeply not-normalized and therefore it might be not that easy to handle the data with pure SQL. So the point is not about sources of truth, but about just handy tooling to handle highly-non-normalized data (and even worse: with dynamic structures). Makes sense.

Among the de-normalization issues we have, this is the one that concerns me the least. Lack of referential integrity in various other places concerns me more, and I am hoping we'll be able to fix this soon. Storing the full JSON document here made sense in my opinion to make the job ingestion control path simpler. As long as the structure evolves in a backwards compatible way (I know, I am talking about backwards compatibility in a PR which breaks it. Almost idiotic), and we preserve a schema migration mechanism when backwards compatibility needs to be breached, it won't be a major source of problem in my opinion.

In my opinion, guarantees of the clean state and predictable behavior is much more important than avoiding of few seconds of downtime. What if "Migrate data" will fail? What if something between "Migrate schema" and "Migrate data" will happen and "Migrate data" will not fail, but will do the job wrongly? We cannot guarantee anything while "migrate schema" and "migrate data" are not performed in a single transaction. I understand this is unlikely scenarios, but still makes me feel bad about this approach :)

I agree with the transactional approach. If ConTest goes down for n minutes, the world will go on. I also think that a backfill of this type can also be implemented non-transactionally, if the use-case requested this. It's tempting to solve a problem we don't have, but as long as our approach to migrations leaves the way open for non-transactional migration, with all the necessary safety nets, if we needed to, I think we can focus on the current problem and go on.

The problem is that we do not rotate/partition the data and the longer ConTest works the longer each migration will take. But, I think, in future this problem will be solved (we will need to solve this problem anyway).

Agree, maybe we should open an issue for this.

would like to use another approach
I need to make sure I understand correctly: you would do this purely in SQL (without Go).

Now, I think we need to forget for a moment about this PR, as I am not happy with this approach either. I want to put another proposal using https:/pressly/goose that would allow the following:

  • Express migration purely in SQL. This is the preferred approach, and whenever possible it's the recommended one
  • Express migration in .go, possibly mixing with raw SQL statements. This becomes necessary in cases like the above where we need to back fill data. By default, it would happen transactionally.

Does this sound good? The library used in this PR is less flexible and uses a custom source, to overcome the fact that raw SQL and related golang backfill logic before needed to be decoupled, which was very confusing. The result is anyway still not good. So if we agree at least that we need to support a way to backfill data with pure go statement, I think the next proposal will be more flexible and better aligned also with your preferences (which are reasonable, it's just this PR that sucks :) )

@marcoguerri
Copy link
Contributor Author

marcoguerri commented Oct 9, 2020

2 cents from DBA in my head, because he is too loud:)

A place where I would not like to be, your head 😜

Thank you for sharing your experience. It's tempting to go YOLO when you haven't been burnt yet 😄

@xaionaro
Copy link
Contributor

xaionaro commented Oct 9, 2020

OK, now I see the point :)
Yeah. Indeed, in ConTest our data is deeply not-normalized and therefore it might be not that easy to handle the data with pure SQL. So the point is not about sources of truth, but about just handy tooling to handle highly-non-normalized data (and even worse: with dynamic structures). Makes sense.

Among the de-normalization issues we have, this is the one that concerns me the least. Lack of referential integrity in various other places concerns me more, and I am hoping we'll be able to fix this soon. Storing the full JSON document here made sense in my opinion to make the job ingestion control path simpler. As long as the structure evolves in a backwards compatible way (I know, I am talking about backwards compatibility in a PR which breaks it. Almost idiotic), and we preserve a schema migration mechanism when backwards compatibility needs to be breached, it won't be a major source of problem in my opinion.

I agree.

I agree with the transactional approach. If ConTest goes down for n minutes, the world will go on. I also think that a backfill of this type can also be implemented non-transactionally, if the use-case requested this. It's tempting to solve a problem we don't have, but as long as our approach to migrations leaves the way open for non-transactional migration, with all the necessary safety nets, if we needed to, I think we can focus on the current problem and go on.

Makes sense.

The problem is that we do not rotate/partition the data and the longer ConTest works the longer each migration will take. But, I think, in future this problem will be solved (we will need to solve this problem anyway).

Agree, maybe we should open an issue for this.

Just opened an issue: https:/facebookincubator/contest/issues/149

would like to use another approach
I need to make sure I understand correctly: you would do this purely in SQL (without Go).

Now, I think we need to forget for a moment about this PR, as I am not happy with this approach either. I want to put another proposal using https:/pressly/goose that would allow the following:

  • Express migration purely in SQL. This is the preferred approach, and whenever possible it's the recommended one
  • Express migration in .go, possibly mixing with raw SQL statements. This becomes necessary in cases like the above where we need to back fill data. By default, it would happen transactionally.

Does this sound good?

Sounds great. Thank you for the inclusion!

@marcoguerri
Copy link
Contributor Author

Final implementation merged with #154 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants