nvs: possibility of losing data #34722

Laczen · 2021-04-30T07:09:17Z

Describe the bug
When nvs does a garbage collection to free up space for new writes it ends by doing a flash erase of a sector. When the erase is interrupted due to a power fail the subsequent startup of nvs will detect data in the sector that was being erased and restart a gc operation (erasing the already copied data). As the erase was ongoing the data that is found in the sector might no longer contain valid data.

The result is losing data that was valid before the interrupted erase.

Expected behavior
No data should be lost even if a sector erase is interrupted.

Impact
This could result in a variety of malfunctions depending on the use of the nvs file system.

carlescufi · 2021-04-30T12:32:20Z

@Laczen do you have a fix in the works?

Laczen · 2021-04-30T12:40:29Z

@Laczen do you have a fix in the works?

To tell you the truth, I don't, I even don't know if it's possible (been challenging myself all morning).

lemrey · 2021-05-03T11:28:36Z

It'll be hard without storing metadata for the sectors. If you had that you could store some information about what the sector state/use is.

Laczen · 2021-05-03T11:54:43Z

@lemrey, I agree.

One possibility I see is to add a special ate (id = 0xfffe) indicating that gc of the previous sector has been completed.

Another possibility would be to change the storage method where the first ate written (at the end of a sector)
would be an opening ate (containing a number that indicates the sector order). In that case the erase could be delayed until the new sector is needed. This could also speed up the startup because it only needs to read one entry for each sector. This is something I have done in a non volatile storage system (sfcb: small/simple flash circular buffer) that I created some time ago (but I never proposed it to zephyr). Probably sfcb still needs some improvements (e.g. to allow simultaneous writes of multiple data items, reverse travelling the file system), but it has some nice features that are missing in nvs (optional gc, allow one flash sector filesystem, streaming writes, ...). Rewriting nvs to use sfcb as it's flash storage would be trivial.

nvlsianpu · 2021-05-04T06:22:55Z

One possibility I see is to add a special ate (id = 0xfffe) indicating that gc of the previous sector has been completed.

That Is actually quite common behavior among storages, will work good for sure. Option for the open-ATE-counter also sounds reasonable.

nvlsianpu · 2021-05-04T07:13:59Z

If we add new metadata record for marking sector active - this can be made backward-compatible with previous NVS volumens (NVS sector written before the patch will not have such record). Just need to add this record before the firs data_wra.

Regard porting new storage:
Probably good idea, especially as it add missing feature.
But that should be different topic - we need to patch what we have already.

Laczen · 2021-05-04T09:40:30Z

I see an opportunity to improve nvs. Zephyr has changed a lot since nvs was introduced.

Regarding the start of a new sector it is possible to change the sector close item with a sector open item, while maintaining compatibility (the sector close ate is optional as we anyhow needed a way to recover if the close ate was badly written).

In the rework I would introduce a lower level api (nvs_ll_xxx) that would enable streaming writes, optional gc, ...) and work more similar to fcb. Nothing would change to the layout on flash, so that compatibility with existing nvs systems would remain.

There is only one thing that still bothers me: it seems that a write can be interrupted and nothing is written to flash (remain 0xff), I don't understand how this is possible. Can this be something in the hal layer ?

carlescufi · 2021-05-04T09:57:07Z

I see an opportunity to improve nvs. Zephyr has changed a lot since nvs was introduced.

NVS is currently used in production in many projects, so a backwards-compatible fix to the current NVS is something that needs to happen, a filesystem needs to provide that level of certainty and maintenance over time. Perhaps the best approach would be to provide a fix for this issue in the current NVS and then start a new "nvs2" project, similar to what was done with "tcp2" that can be developed in parallel. Does that sound reasonable @Laczen ?

Laczen · 2021-05-05T06:59:00Z

NVS is currently used in production in many projects, so a backwards-compatible fix to the current NVS is something that needs to happen, a filesystem needs to provide that level of certainty and maintenance over time. Perhaps the best approach would be to provide a fix for this issue in the current NVS and then start a new "nvs2" project, similar to what was done with "tcp2" that can be developed in parallel. Does that sound reasonable @Laczen ?

OK, will work on a fix for the present NVS and (slowly) start working on nvs2.

Fix the possibility of losing data after startup as a result of a badly erased sector. Fixes zephyrproject-rtos#34722. Signed-off-by: Laczen JMS <[email protected]>

Fix the possibility of losing data after startup as a result of a badly erased sector. Fixes #34722. Signed-off-by: Laczen JMS <[email protected]>

Laczen added the bug The issue is a bug, or the PR is fixing a bug label Apr 30, 2021

carlescufi added priority: high High impact/importance bug area: Storage Storage subsystem labels Apr 30, 2021

nvlsianpu self-assigned this May 4, 2021

nvlsianpu assigned Laczen May 6, 2021

Laczen mentioned this issue May 7, 2021

nvs: fix possibility of losing data #34957

Merged

Laczen added a commit to Laczen/zephyr that referenced this issue May 7, 2021

nvs: fix possibility of losing data

1e0a20a

Fix the possibility of losing data after startup as a result of a badly erased sector. Fixes zephyrproject-rtos#34722. Signed-off-by: Laczen JMS <[email protected]>

galak closed this as completed in #34957 May 10, 2021

galak pushed a commit that referenced this issue May 10, 2021

nvs: fix possibility of losing data

5e9d6d6

Fix the possibility of losing data after startup as a result of a badly erased sector. Fixes #34722. Signed-off-by: Laczen JMS <[email protected]>

nvlsianpu mentioned this issue May 10, 2021

fs/nvs: add testcase for testing #34722 #35069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvs: possibility of losing data #34722

nvs: possibility of losing data #34722

Laczen commented Apr 30, 2021

carlescufi commented Apr 30, 2021

Laczen commented Apr 30, 2021

lemrey commented May 3, 2021 •

edited

Loading

Laczen commented May 3, 2021

nvlsianpu commented May 4, 2021

nvlsianpu commented May 4, 2021

Laczen commented May 4, 2021

carlescufi commented May 4, 2021

Laczen commented May 5, 2021

nvs: possibility of losing data #34722

nvs: possibility of losing data #34722

Comments

Laczen commented Apr 30, 2021

carlescufi commented Apr 30, 2021

Laczen commented Apr 30, 2021

lemrey commented May 3, 2021 • edited Loading

Laczen commented May 3, 2021

nvlsianpu commented May 4, 2021

nvlsianpu commented May 4, 2021

Laczen commented May 4, 2021

carlescufi commented May 4, 2021

Laczen commented May 5, 2021

lemrey commented May 3, 2021 •

edited

Loading