New non-volatile storage system #77929

rghaddab · 2024-09-03T13:00:45Z

Introduction

In recent years, advances to process nodes in embedded hardware have made it necessary to support non-volatile technologies different from the classical on-chip NOR flash, which is written in words but erased in pages. These new technologies do not require a separate erase operation at all, and data can be overwritten directly at any time.
On top of that, complexity of firware has not stopped growing, making it necessary to ensure that a solid, scalable storage mechanism is available for all applications. This storage needs to support millions of entries with solid CRC protection and multiple advanced features.

Problem description

In Zephyr, there are currently a few alternatives for non-volatile memory storage:

NVS: Basic ID-based storage, but optimized for devices with page erase
LittleFS: Full file system, optimized for devices with page erase
FCB: Very little use, extremely bare-bones storage

None of them are optimal for the current new wave of solid-state non-volatile memory technologies, including resistive (RRAM) and magnetic (MRAM) random-access, non-volatile memory, because they rely on the "page erase" abstraction whereas these devices do not require an erase operation at all, and data can be overwritten directly.
Additionally, none of the storage systems above is a good match for the widely used settings subsystem, given that they were never designed to operate as a backend for it.

The closest one is NVS, and an analysis of why it is not suitable can be found in the Alternatives section of this issue.

Proposed change

Create a new storage mechanism that fulfills the following requirements:

Simple Key-Value Storage (i.e. no file/folder abstractions)
32-bit IDs
Support for entries in multiple formats
CRC-24 for entries that require it
No limits in value length
Metadata entries can also store small (1 to 4 bytes) data entries
Optimized for bigger (e.g. 16-byte) write block sizes
Support for no-erase-required flash drivers (i.e. RRAM, MRAM, etc)
Designed from the start to be efficient when used as a backend for the settings susbystem
Designed from the start to be able to serve as a backend of the Secure Storage subsystem (link)

Potential names

ZMS: Zephyr Memory Storage
NVMS: Non-Volatile Memory Storage
IDVS: ID Value Storage
ZKVS: Zephyr Key-Value Storage

Detailed RFC

Proposed change (Detailed)

General behavior:

ZMS divides the memory space into sectors (minimum 2), and each sector is filled with key/value pair until it is full , we close it then the storage system will move forward to the next sector until it reaches the end and then it starts again from the first sector after garbage collecting it and erasing its content.

Mounting the FS:

Mounting the filesystem will start by getting the flash parameters, checking that the file system properties are correct (sector_size, sector_count ...) Then initializes the file system.

Initialization of ZMS:

As the ZMS has a fast-forward write mechanism, we must find the last sector and the last pointer of the entry where it stopped the last time.
It must look for a closed sector followed by an open one, then within the open sector, it finds (recover) the last written ATE (Allocation Table Entry).
After that it checks that the sector after this one is empty, or it will erase it.

Composition of a sector.

A sector is organized in this form :

Sector N
data0
data1
...
...
ate1
ate0
gc_done
empty_ate
close_ate

Close ATE is used to close a sector if a sector is full
Empty ATE is used to erase a sector
ATEn are entries that describe where the data is stored, its size and its crc32
Data is the written value

ZMS Key/value write :

To avoid rewriting the same data with the same ID again, it must look in all the sectors if the same ID exist then compares its data, if the data is identical no write is performed.
If we must perform a write, then an ATE and Data (if not a delete) are written in a sector
If the sector is full (cannot hold the current data + ATE) we have to move to the next sector, garbage collect the sector after the newly opened one then erase it.
Data size that is smaller or equal to 4 bytes are written within the ATE

ZMS read (with history):

By default it looks for the last data with the same ID and retrieves its data.
If history count is provided that is different than 0, older data with same ID is retrieved.

ZMS how does the cycle counter works ?

Each sector has a lead cycle counter which is a uin8_t that is used to validate all the other ATEs.
The lead cycle counter is stored in the empty ATE.
To become valid, an ATE must have the same cycle counter as the one stored in the empty ATE.
Each time an ATE is moved from a sector to another it must get the cycle counter of the destination sector.
To erase a sector, the cycle counter of the empty ATE is incremented. All the ATEs in that sector become invalid

ZMS how to close a sector ?

To close a sector a close ATE is added at the end of the sector and it must have the same cycle counter as the empty ATE
When closing a sector, all the remaining space that has not been used is filled with garbage data to avoid having old ATEs with a valid cycle counter.

ZMS triggering Garbage collector

Some applications need to make sure that storage writes have a maximum defined latency.
When calling a ZMS write, the current sector could be alomst full and we need to trigger the GC to switch to the next sector.
This operation is time consuming and it will cause some applications to not meet their real time constraints.
ZMS adds an API for the application to get the current remaining free space in a sector. The application could then decide when needed to switch to the next sector if the current one is almost full and of course it will trigger the garbage collection on the next sector. This will guarantee the application that the next write won't trigger the garbage collection.

ZMS structure of ATE (Allocation Table Entries)

An entry has 16 bytes divided between these variables :

struct zms_ate {
	uint32_t id;  /* data id */
	uint16_t len; /* data len within sector */
	union {
		struct {
			uint32_t offset; /* data offset within sector */
			union {
				uint32_t data_crc; /* crc for data */
				uint32_t metadata; /* Used to store metadata information
						    * such as storage version.
						    */
			};
		};
		uint8_t data[8]; /* used to store small size data */
	};
	uint8_t cycle_cnt; /* cycle counter for non erasable devices */
	uint8_t crc8;      /* crc8 check of the entry */
} __packed;

id has 32 bits
sector size is now 32 bits, that's why offset is also 32 bits => That allows to define large sectors of needed.
length of data is 16 bits (could be changed to 32 bits in the future) which can store data up to 64K
crc_data/data is a field that can store small size data (<= 4 bytes) or the crc32 for bigger data
cycle_cnt is used to validate an ATE within a sector
crc8 is the crc of the ATE (could be changed to crc24 in the future)

ZMS wear leveling feature

This storage system is optimized for devices that do not require an erase.
Using storage systems that rely on an erase-value (NVS as an example) will need to emulate the erase with write operations. This will cause a significant decrease in the life expectancy of these devices and will cause more delays for write operations and for initialization.
ZMS introduces a cycle count mechanism that avoids emulating erase operation for these devices.
It also guarantees that every memory location is written only once for each cycle of sector write.

Dependencies

Only on flash drivers.

Concerns and Unresolved Questions

The first draft of this new storage system will not include all the features listed in the proposed change section.
This is intended to minimize the effort of reviewing this new storage system for developers that are familiar to NVS filesystem.
More changes will come in future patch lists

Alternatives

The one alternative we have considered would be to expand the existing NVS codebase in order to remove its described shortcomings. This is in fact how this new proposal was born, once expanding NVS was identified as suboptimal.

Among other issues, we identified the following:

NVS was never designed for devices that do not have an erase operation available
NVS limits the max value length to be the size of a sector (32KB)
NVS was designed to be simple and compact, so extending it is not necessarily a good option
NVS performs poorly when the storage mechanism gets close to being full
Slow Garbage Collector that can go through all sectors for a single write operation
Switching to the next sector is a time consuming operation
NVS was not designed to be used as a backend for the settings subsystem, causing latency (up to minutes) and other issues

More info in these Pull Requests:

The text was updated successfully, but these errors were encountered:

butok · 2024-09-03T14:06:15Z

Zephyr platforms have a maximum write size of up to 512 bytes.
Will ZMS support it?

rghaddab · 2024-09-03T14:59:00Z

Zephyr platforms have a maximum write size of up to 512 bytes. Will ZMS support it?

@butok I saw this RFC #77576
Although this storage system is still (could change in the future) not optimized for larger block write size, we could add a hidden config for that with a warning for users that want to increase the default maximum write block size.

carlescufi · 2024-09-03T15:56:26Z

Architecture WG:

Presentation by @rghaddab of the new system, including some basic technical details
@bjarki-andreasen asks whether the API is similar to NVS, Riadh answers that it is
A question on whether this new system should coexist with NVS is raised, @henrikbrixandersen and @carlescufi suggest that there's no reason why the two systems could coexist. There's a large base of NVS installations in the wild, and the two systems differ in scope and maturity.

andrisk-dev · 2024-09-03T16:15:55Z

I was thinking about one thing when learning about how NVS works - is separating data ATE From actual data worth it?

We can make the data more dense by storing ATE - data pairs from the start of the sector.

Sector start
ATE 1
DATA 1
ATE 2
DATA 2a
DATA 2b
ATE 3
DATA 3
.....
gc_done
emty_ate
close_ate

The advantages would be that the ATE and DATA could be placed right next to each other - so we waste less space in case of larger write block size. On the other end the disadvantage is that we would need to do some address calculation to find every other data ATE except of the first one. But I think it would not be a reason for a noticeable slowdown - just calculate ATE start address + ATE size + data length and align it to the next start of the write block.

de-nordic · 2024-09-03T16:29:01Z

I was thinking about one thing even when learning about how NVS works - is separating data ATE From actual data worth it?

Yes it is. It is easier to recover if something happens, otherwise you may just write something that looks like ATE to data and glitch device into attempting to read storage as your data mandates or loop. Also if you write in the loop you may basically loop the ATE/DATA storage without a way to figure out where it really ends. It is much easier to keep things working if you keep users out of area where metadata of your storage is stored.

Same happens with any block-device oriented FS, where metadata is separated from data streams.

Of course there is also a way to do that, for example introducing different alphabets for metadata and data, but this means that you end up in some 8 to N encodings (N > 8), and have to make sure that user data will not get encoded to look like metadata.

andrisk-dev · 2024-09-03T17:15:18Z

I was thinking about one thing even when learning about how NVS works - is separating data ATE From actual data worth it?

Yes it is. It is easier to recover if something happens, otherwise you may just write something that looks like ATE to data and glitch device into attempting to read storage as your data mandates or loop. Also if you write in the loop you may basically loop the ATE/DATA storage without a way to figure out where it really ends. It is much easier to keep things working if you keep users out of area where metadata of your storage is stored.

Same happens with any block-device oriented FS, where metadata is separated from data streams.

Of course there is also a way to do that, for example introducing different alphabets for metadata and data, but this means that you end up in some 8 to N encodings (N > 8), and have to make sure that user data will not get encoded to look like metadata.

OK I understand the reason now.

For the purpose of saving space I really like the small data inside an ATE feature.

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough? So maybe there could be another rule that if there is enough space in a block right after the ATE for the data than it would be stored there.

Technically this is also mixing ATE and data but we would search for ATEs only on the start of the blocks anyway.

What do you think about such feature?

rghaddab · 2024-09-03T17:19:46Z

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough?

This could be done once the multiple format entries feature is added. Which means that you can have a different format which is larger and that holds N bytes of data

de-nordic · 2024-09-03T19:16:06Z

For the purpose of saving space I really like the small data inside an ATE feature.

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough? So maybe there could be another rule that if there is enough space in a block right after the ATE for the data than it would be stored there.

Technically this is also mixing ATE and data but we would search for ATEs only on the start of the blocks anyway.

What do you think about such feature?

The original design of NVS and the ZMS here intends to work with devices with relatively small write block sizes, wbs, that can be appended without altering other data (unless area is overwritten); this allows placing metadata in small chunks of constant size and data at variable size, with no mandated boundaries (except write block size) between data.

Because ATE have same sizes, if wbs becomes large, it should be possible to start placing some data in it, for example if you have 32byte long wbs and 16 bytes of ATE, then any data of size <= 16 bytes can go into ATE, and it would not be a problem as wbs and size of ATE set boundaries, which means that the ATE and data in ATE wbs are still separated.

Eventually you may have to erase some part of storage, but that happens because device requires it, for example flash, before it can be written. Using a magnetic tape analogy: erase head has to erase data before r/w head can write in are previously used.

I understand that what you are trying to solve in your case @andrisk-dev , is a problem of relatively big write block size of your device that equals to erase block size - so you basically have a block device. You can see a difference here, where you can not really append data, directly on storage, you have to basically replace entire block contents, unless you are willing to append data at wbs of sector size.

In your case, the scheme you have presented in comment #77929 (comment) could work, if you decide to divide your sector into ATE and data part assuming that you always write both as a single sector, for every sector, even if it carries continuation of data from previous sector, has that ATE part reserved and not available for users. Still, you will probably have some unused space wasted.
Amiga on OFS has been doing that to datablocks, where each sector had reserved 24 bytes for OFS header, which means that user data could only take 488 bytes out of 512 byte sector (https://en.wikipedia.org/wiki/Amiga_Old_File_System, https://wiki.osdev.org/FFS_(Amiga))

What I understand is you are trying to provide your users with small reliable storage for basic data or settings, but I do not think that this PR will effectively solve your problem, at least not without significant complexity being introduced, as it is basically based on ability to freely append data at small granularity of xRAM and small wbs Flash devices, something your device does not provide. We can try to bend it your way, but I would rather focus first on making it solid solution for the devices it has been originally designed for.

andrisk-dev · 2024-09-04T07:30:10Z

Thanks for your replies @rghaddab @de-nordic ,

I understand that the first version is to be as simple as possible. I think one solution that would enable us at NXP to make most of the 512 bytes write block size is to have ATE in different format, maybe we can call it long ATE here, which could store information about multiple data records in one place. The format would include information about a number of data items stored in that ATE and a list of metadata about all of them would follow. That way, even if the data stored individually would be still sparse in flash, when reallocating the data from erased sector to a new one we could pack the data much more densely.

As this is more of a future release thing, I think the main question for now is how would the filesystem distinguish between normal entry fromat and an entry in different format. I think that should be decided now to make sure the "Support for entries in multiple formats" is possible in the future.

rghaddab · 2024-09-04T09:04:44Z

As this is more of a future release thing, I think the main question for now is how would the filesystem distinguish between normal entry fromat and an entry in different format.

This change is planned as following :
The first byte of an ATE will be a format-type field that defines what is the ATE format that should be considered.
For example : 0 => default format, 1 => format for big data 2=> ...
All the write/read/ATE validation functions will have different behavior depending on the format that is read from the first byte. This is of course should be done in the initialization phase and we must verify that the ATEs are valid if we choose a custom format.
At a certain point there will be different files containing each the corresponding function for each format. The main file will only have pointer to these functions depending on the format

dleach02 · 2024-09-10T09:20:05Z

Zephyr platforms have a maximum write size of up to 512 bytes. Will ZMS support it?

@butok I saw this RFC #77576 Although this storage system is still (could change in the future) not optimized for larger block write size, we could add a hidden config for that with a warning for users that want to increase the default maximum write block size.

@rghaddab, This needs to be a requirement on ZMS to not artificially limit the size. Optimize later if needed. Add warnings to make sure the users are aware of the impacts.

rghaddab · 2024-09-19T09:45:42Z

@rghaddab, This needs to be a requirement on ZMS to not artificially limit the size. Optimize later if needed. Add warnings to make sure the users are aware of the impacts.

@dleach02 I included this on the last revision of the PR

rghaddab added the RFC Request For Comments: want input from the community label Sep 3, 2024

rghaddab linked a pull request Sep 3, 2024 that will close this issue

A new non volatile storage system #77930

Open

nordicjm linked a pull request Sep 4, 2024 that will close this issue

A new non volatile storage system #77930

Open

henrikbrixandersen added the area: Storage Storage subsystem label Sep 10, 2024

zephyrbot assigned de-nordic Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New non-volatile storage system #77929

New non-volatile storage system #77929

rghaddab commented Sep 3, 2024 •

edited

Loading

butok commented Sep 3, 2024

rghaddab commented Sep 3, 2024 •

edited

Loading

carlescufi commented Sep 3, 2024

andrisk-dev commented Sep 3, 2024 •

edited

Loading

de-nordic commented Sep 3, 2024

andrisk-dev commented Sep 3, 2024

rghaddab commented Sep 3, 2024

de-nordic commented Sep 3, 2024

andrisk-dev commented Sep 4, 2024

rghaddab commented Sep 4, 2024

dleach02 commented Sep 10, 2024 •

edited

Loading

rghaddab commented Sep 19, 2024

New non-volatile storage system #77929

New non-volatile storage system #77929

Comments

rghaddab commented Sep 3, 2024 • edited Loading

Introduction

Problem description

Proposed change

Potential names

Detailed RFC

Proposed change (Detailed)

General behavior:

Mounting the FS:

Initialization of ZMS:

Composition of a sector.

ZMS Key/value write :

ZMS read (with history):

ZMS how does the cycle counter works ?

ZMS how to close a sector ?

ZMS triggering Garbage collector

ZMS structure of ATE (Allocation Table Entries)

ZMS wear leveling feature

Dependencies

Concerns and Unresolved Questions

Alternatives

butok commented Sep 3, 2024

rghaddab commented Sep 3, 2024 • edited Loading

carlescufi commented Sep 3, 2024

andrisk-dev commented Sep 3, 2024 • edited Loading

de-nordic commented Sep 3, 2024

andrisk-dev commented Sep 3, 2024

rghaddab commented Sep 3, 2024

de-nordic commented Sep 3, 2024

andrisk-dev commented Sep 4, 2024

rghaddab commented Sep 4, 2024

dleach02 commented Sep 10, 2024 • edited Loading

rghaddab commented Sep 19, 2024

rghaddab commented Sep 3, 2024 •

edited

Loading

rghaddab commented Sep 3, 2024 •

edited

Loading

andrisk-dev commented Sep 3, 2024 •

edited

Loading

dleach02 commented Sep 10, 2024 •

edited

Loading