Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvs: possibility of losing data #34722

Closed
Laczen opened this issue Apr 30, 2021 · 9 comments · Fixed by #34957
Closed

nvs: possibility of losing data #34722

Laczen opened this issue Apr 30, 2021 · 9 comments · Fixed by #34957
Assignees
Labels
area: Storage Storage subsystem bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@Laczen
Copy link
Collaborator

Laczen commented Apr 30, 2021

Describe the bug
When nvs does a garbage collection to free up space for new writes it ends by doing a flash erase of a sector. When the erase is interrupted due to a power fail the subsequent startup of nvs will detect data in the sector that was being erased and restart a gc operation (erasing the already copied data). As the erase was ongoing the data that is found in the sector might no longer contain valid data.

The result is losing data that was valid before the interrupted erase.

Expected behavior
No data should be lost even if a sector erase is interrupted.

Impact
This could result in a variety of malfunctions depending on the use of the nvs file system.

@Laczen Laczen added the bug The issue is a bug, or the PR is fixing a bug label Apr 30, 2021
@carlescufi carlescufi added priority: high High impact/importance bug area: Storage Storage subsystem labels Apr 30, 2021
@carlescufi
Copy link
Member

@Laczen do you have a fix in the works?

@Laczen
Copy link
Collaborator Author

Laczen commented Apr 30, 2021

@Laczen do you have a fix in the works?

To tell you the truth, I don't, I even don't know if it's possible (been challenging myself all morning).

@lemrey
Copy link
Collaborator

lemrey commented May 3, 2021

It'll be hard without storing metadata for the sectors. If you had that you could store some information about what the sector state/use is.

@Laczen
Copy link
Collaborator Author

Laczen commented May 3, 2021

@lemrey, I agree.

One possibility I see is to add a special ate (id = 0xfffe) indicating that gc of the previous sector has been completed.

Another possibility would be to change the storage method where the first ate written (at the end of a sector)
would be an opening ate (containing a number that indicates the sector order). In that case the erase could be delayed until the new sector is needed. This could also speed up the startup because it only needs to read one entry for each sector. This is something I have done in a non volatile storage system (sfcb: small/simple flash circular buffer) that I created some time ago (but I never proposed it to zephyr). Probably sfcb still needs some improvements (e.g. to allow simultaneous writes of multiple data items, reverse travelling the file system), but it has some nice features that are missing in nvs (optional gc, allow one flash sector filesystem, streaming writes, ...). Rewriting nvs to use sfcb as it's flash storage would be trivial.

@nvlsianpu
Copy link
Collaborator

One possibility I see is to add a special ate (id = 0xfffe) indicating that gc of the previous sector has been completed.

That Is actually quite common behavior among storages, will work good for sure. Option for the open-ATE-counter also sounds reasonable.

@nvlsianpu
Copy link
Collaborator

If we add new metadata record for marking sector active - this can be made backward-compatible with previous NVS volumens (NVS sector written before the patch will not have such record). Just need to add this record before the firs data_wra.

Regard porting new storage:
Probably good idea, especially as it add missing feature.
But that should be different topic - we need to patch what we have already.

@nvlsianpu nvlsianpu self-assigned this May 4, 2021
@Laczen
Copy link
Collaborator Author

Laczen commented May 4, 2021

I see an opportunity to improve nvs. Zephyr has changed a lot since nvs was introduced.

Regarding the start of a new sector it is possible to change the sector close item with a sector open item, while maintaining compatibility (the sector close ate is optional as we anyhow needed a way to recover if the close ate was badly written).

In the rework I would introduce a lower level api (nvs_ll_xxx) that would enable streaming writes, optional gc, ...) and work more similar to fcb. Nothing would change to the layout on flash, so that compatibility with existing nvs systems would remain.

There is only one thing that still bothers me: it seems that a write can be interrupted and nothing is written to flash (remain 0xff), I don't understand how this is possible. Can this be something in the hal layer ?

@carlescufi
Copy link
Member

I see an opportunity to improve nvs. Zephyr has changed a lot since nvs was introduced.

NVS is currently used in production in many projects, so a backwards-compatible fix to the current NVS is something that needs to happen, a filesystem needs to provide that level of certainty and maintenance over time. Perhaps the best approach would be to provide a fix for this issue in the current NVS and then start a new "nvs2" project, similar to what was done with "tcp2" that can be developed in parallel. Does that sound reasonable @Laczen ?

@Laczen
Copy link
Collaborator Author

Laczen commented May 5, 2021

NVS is currently used in production in many projects, so a backwards-compatible fix to the current NVS is something that needs to happen, a filesystem needs to provide that level of certainty and maintenance over time. Perhaps the best approach would be to provide a fix for this issue in the current NVS and then start a new "nvs2" project, similar to what was done with "tcp2" that can be developed in parallel. Does that sound reasonable @Laczen ?

OK, will work on a fix for the present NVS and (slowly) start working on nvs2.

Laczen added a commit to Laczen/zephyr that referenced this issue May 7, 2021
Fix the possibility of losing data after startup as a result of a badly
erased sector.

Fixes zephyrproject-rtos#34722.

Signed-off-by: Laczen JMS <[email protected]>
galak pushed a commit that referenced this issue May 10, 2021
Fix the possibility of losing data after startup as a result of a badly
erased sector.

Fixes #34722.

Signed-off-by: Laczen JMS <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Storage Storage subsystem bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants