Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft][RFC] Multi-tiered file cache #8891

Open
abiesps opened this issue Jul 26, 2023 · 5 comments
Open

[Draft][RFC] Multi-tiered file cache #8891

abiesps opened this issue Jul 26, 2023 · 5 comments
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label

Comments

@abiesps
Copy link

abiesps commented Jul 26, 2023

Introduction
As an extension to writeable warm tier and fetch-on-demand composite directory feature in OpenSearch we are proposing to introduce a new abstraction which makes it seamless for composite directory abstraction to read files without worrying about the locality of the file.
Today in OpenSearch (with remote store ), for the shards to be able to accept reads or writes, there is a need to download full data on disk from remote store.
With the file cache (and fetch-on-demand composite directory ) the shards can accept reads and writes without keeping whole data on local store.

Proposed Solution
We are proposing a new FileCache abstraction that manages the lifecycle of all committed (segment) files for the shards present on a node.
Screenshot 2023-07-12 at 9 13 31 AM

FileCache provides three storage tiers for the files.

  • Tier1 : Entries in tier1 are memory mapped files.
  • Tier2: Entries in tier2 are pointers to files present on local storage.
  • Tier3 : Entries in tier3 are pointer to files in remote storage.
Screenshot 2023-07-12 at 9 25 35 AM

As the node boots up file cache is initialised with a maximum capacity of tier1 and tier2, where capacity is defined as the total size in bytes of the files associated with entries in file cache.
Each shard’s primary persist its working set as part of the commit, where working set contains the names of the segment file and the respective file cache tiers these files are present in. During shard recovery, working set of a shard is used to populate the file cache with segment files across different tiers.
An entry is created in tier2 and tier3 for all the files after every upload to remote store. Newly written files that are not committed are not managed by FileCache.

File cache internally has a periodic file lifecycle manager that evaluates lifecycle policies for each tier and take action on it.
Lifecycle policies for a file cache can be set as part of cluster settings. We are defining following major cluster settings for the same

  • Capacity of tier1 and tier2
  • TTLs for tier1 and tier2.
  • Tier3 stale commit retention [TBD]
  • Configuring tier2 to be in sync with tier3 (This will keep all data as hot)

Tier1 lifecycle policies

Files which are not actively used in tier1 (0 reference count) are eligible for movement from tier1 to tier2 based on following policies.

  • Every entry in the tier1 has a ttl value associated with it (which resets after every access). Entries with ttl less than tier1_ttl are marked for movement from tier1 to tier2.
  • If the tier1 capacity has reached a threshold (x% of max tier1 capacity) and there are no entries marked for movement from ttl entries are moved in a lru fashion to tier2.

Tier2 lifecycle policies

Files which are not actively used in tier2 (0 reference count) are eligible for movement from tier2 to tier1 based on following policies.

  • Every entry in the tier2 has a ttl value associated with it. Entries with ttl less than tier2_ttl are marked for movement from tier2 to tier3.
  • If the tier2 capacity has reached a threshold (x% of max tier2 capacity) and there are no entries marked for movement from ttl, entries are removed from tier2 in lfu fashion .

Tier3 lifecycle policies

Files with stale commits older than X days are eligible for eviction from tier3. This can be done as a later work.

File Cache Stats

Cache also provides metrics around its usage, evictions, hit rate, miss rate etc at a shard granularity as well as at a node granularity.

File Cache Admission Control and throttling

We are also going to add an admission control on file cache based on current capacity, maximum configured capacity and watermark settings of file cache which can result in a read/write block on a node.
We are also proposing to track the total bytes downloaded from remote store and total bytes removed from local store by life cycle manager and throttle downloading of new files from remote store based on a total bytes downloaded on disk as measured against disk watermark settings.

Potential Issues

  • We may see higher latencies on updates.
  • Segment merging can have more impact on read/write latencies.

Future work

  • Marking stale commits for deletion in tier3.
    • File cache lifecycle manager can potentially mark the stale commits in tier3 for deletion.
  • Tier3 capacity management
    • File cache lifecycle manager can define bounds on number of days to store a stale commit in remote store.
  • Offload segment merges from primary writer to avoid shard working set pollution because of merges.
    • Segment merges have tendency to pollute working set of a shard and file cache. We can potentially offload merges away from primary writer.
  • File cache lifecycle based on IOContext.
    • File cache api can potentially take in IO context as an input, specifying the operation for which the file is accessed (e.g Merge, Read etc). This context can be used to define lifecycle policies based on operations.
@abiesps abiesps added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 26, 2023
@shwetathareja shwetathareja changed the title [Draft Feature Request] Multi-tiered file cache [Draft][RFC] Multi-tiered file cache Jul 26, 2023
@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jul 26, 2023

Thanks for the proposal
Honestly its hard to understand where will these tiers be hosted, memory, disk or just file pointers. Have we evaluated a block level cache that works for caching data(maybe 512Kb blocks)across shards that could be both backed entirely on local disk and remote store. Having the cache sharded, limits the risk of random assignments thrashing one shard.

Curious how are we deciding on limiting the files getting mmapped in Tier-1, is it based on the total system memory that is available. The one thing that is not pretty clear from a lifecycle management of Tier-1 is the memory management and page caching. Are files in Tier-1 guaranteed to be served from memory, can kernel decide to page out files to disk.

While I understand the lifecycle policy is based on ttl and LFU policies, I couldn't find a way whether files hosted on remote store can make their way to Tier-1 and what are these quotas and what the criteria on choosing these sizes.

It's also important to understand how the downloads from remote store would be done, if we aren't doing Direct-IO for instance, data from Tier-2 could end up thrashing Tier-1 mmapped files

@anasalkouz anasalkouz added discuss Issues intended to help drive brainstorming and decision making RFC Issues requesting major changes and removed untriaged labels Jul 26, 2023
@andrross
Copy link
Member

andrross commented Jul 27, 2023

Thanks @abiesps!

I agree with @Bukhtawar's question about blocks. How do you see the block-based partial file fetch fitting into this scheme?

Also agree that it's not clear how files move up the tiers. Presumably at the moment a file is needed by an indexing, merge, or search operation and it is in tier-2 or 3 then it gets promoted to tier-1? If the file was in tier-3 in this scenario then this would be where the block-based partial fetch could make a lot of sense.

Is tier-1 strictly mmap or would it reuse the logic in the FSDirectoryFactory to use mmap, NIO, or a mix of the two based on the index configuration?

@dblock
Copy link
Member

dblock commented Jul 27, 2023

I like the spirit of this, but I think these tiers are hard-coding behavior that makes assumptions about what storage is slower and which is faster, which is more available or which is not, which is strongly consistent and which is not. As an example, until very recently S3 was not strongly consistent across reads and writes, but now is. An S3 bucket in the same AZ may be blazingly fast, while an S3 bucket on the other side of the world may be very slow. Or, should we really always assume that slow local storage (tier 1) is faster than really fast remote storage (tier 3)? This makes me wonder, why do we need tiers at all? Maybe it would be simpler for a store implementation to advertise a caching strategy, and an administrator can override that strategy if needed. Then one can fine-tune depending on their requirements with some sensible defaults. Finally, I think we need to make sure anything we insert in the stack isn't mandatory overhead - the minimum version of a cache is pass-through.

@andrross
Copy link
Member

This makes me wonder, why do we need tiers at all? Maybe it would be simpler for a store implementation to advertise a caching strategy, and an administrator can override that strategy if needed. Then one can fine-tune depending on their requirements with some sensible defaults.

@dblock I think this is actually a key building block for a tier-less user experience. This issue is discussing a pretty low-level component, but I would see this as a part of the "sensible default" for the common scenario of fast local storage (e.g. SSD) and a relatively slower remote object store (e.g. same-region S3).

@abiesps
Copy link
Author

abiesps commented Aug 1, 2023

Thank you for the feedback @Bukhtawar , @andrross , @dblock .Please find the answers to the questions

Honestly its hard to understand where will these tiers be hosted, memory, disk or just file pointers ?

  • File cache is an in-memory key-value data structure.
  • The values in tier1 are always blocks of IndexInput which is kept open and is cloned for consecutive reads.
  • Values in tier2 are closed IndexInput. Tier2 can have block files as well as complete files.
  • Values in tier3 is file metadata containing (file name, remote location etc).

Have we evaluated a block level cache that works for caching data(maybe 512Kb blocks)across shards that could be both backed entirely on local disk and remote store. Having the cache sharded, limits the risk of random assignments thrashing one shard.
Curious how are we deciding on limiting the files getting mmapped in Tier-1, is it based on the total system memory that is available.
The one thing that is not pretty clear from a lifecycle management of Tier-1 is the memory management and page caching.
Are files in Tier-1 guaranteed to be served from memory, can kernel decide to page out files to disk

  • Current implementation of cache, is shared across shards. Having a cache sharded, we will have to make sure that we have minimum and maximum space reserve for a shard. Having space reservations for cache may not work well when the working set of shards varies a lot.
  • It is still a possibility that when we read files from tier2 it can cause thrashing in page cache. Ideally we would want to read files on local disk as block files as well. We will have a common abstraction to read blocks of a file where file can be present on remote store or on local disk. Abstraction for reading blocks of a file when file is on local disk can be developed incrementally to just mmap the block of file and not the complete file for reading but for now we will try to remove complete files after the ttl has expired and only bring back the blocks of the file on-demand basis.
  • We will configure the capacity of tier1 as some X% of system memory. We will have to keep some buffer for reads from tier1 as well (till we complete the block read of a file from local disk).
  • Kernel can decide to page out files to disk.

I couldn't find a way whether files hosted on remote store can make their way to Tier-1 and what are these quotas and what the criteria on choosing these size

  • Missed adding details on tier promotion

    Tier3 to Tier-1
    We are starting with a simple policy wherein if a file is needed for read and it is not present in tier1 and tier2, it is downloaded from tier3 as blocks. For initial part we are going to keep the block size constant (for first phase) for each block. These block files are added to tier1.

    Tier2 to Tier1
    If the tier1 has free capacity (above a threshold), we will move files from tier2 to tier1 based on frequency of access.
    We are only going to move block files from tier2 to tier1 .
    (We can add statistics on which block of a file is accessed more frequently and we can move that block even if the file type in tier2 is not block file).

It's also important to understand how the downloads from remote store would be done, if we aren't doing Direct-IO for instance, data from Tier-2 could end up thrashing Tier-1 mmapped files

  • All the files that are not present in tier1 and tier2 are expected to be downloaded as block files. As part of future work all the files that are present locally on tier2 are also going to read with block abstraction. For now we are going to evict complete files from tier2 based on tt (regardless of frequency of access) and will only bring back the blocks of that file on-demand basis.

Is tier-1 strictly mmap or would it reuse the logic in the FSDirectoryFactory to use mmap, NIO, or a mix of the two based on the index configuration?

  • It reuses the FSDirectoryFactory (can be ByteBufferIndexInput or NIOFSIndexInput). In tier1 IndexInput is on open state while in tier2 it is closed. Tier1 will only have blocks of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

5 participants