Skip to content

Sharding: Reducing Load on the Filesystem

William Silversmith edited this page Aug 22, 2020 · 28 revisions

Motivation

In the initial implementation of Precomputed, the primary format CloudVolume and Neuroglancer support, very large datasets generate a commensurately large numbers of files. Image datasets, depending on how they are divided up into a regular grid of files ("chunked"), can produce tens or hundreds of millions of files per a layer. Data derived from the labels in the image, such as meshes and skeletons, range into the billions of files as they create a file for every label in every processed region at minimum. This becomes quite taxing on filesystems and object storage systems. In object storage systems, the file metadata are frequently replicated three to five times to ensure integrity via voting.

With cloud storage vendors, this complication is passed onto customers in the form of a cost per file created. It is not immediately clear how closely this cost directly relates to the actual cost of servicing requests, but it is typical as of this writing to cost $5 per million files created and $4 per ten million files read. For file creation, the costs become substantial at around the 100 million file mark ($500), and serious at the 1 billion file mark ($5,000). The dataset size where these considerations start to become important is somewhere around 100 TB of uint8 images.

Therefore, to remain cost efficient when producing these large datasets, we need a way of compacting files together while retaining random access to individual files.

The Shard File

The shard file format is a container for arbitrary files, but typically used for images, meshes, and skeletons. It works by grouping together a set of files based on a hash of their file name and storing them such that they can be read in isolation using partial file reads using byte offsets. The individual file is located by consulting two levels of indices. The first index is located at the beginning of the file and is of fixed size so that a reader will know beforehand how to access it. This index read points to a second index containing an arbitrarily sized listing of segment ids and byte ranges. A third read based on the second index accesses the data required directly.

Figure 7. The sharded file format. Shard files are containers that concatenate  files including meshes, skeletons, and images while allowing for random access. Three HTTP requests are needed to retrieve data from a shard. First, a header index of predefined size that lists the byte ranges of mini-shard indicies is queried for the mini-shard that the data is contained in. Secondly, this mini-shard index is used to determine the byte range of the data for a given segment. Lastly, the data is loaded from that byte range.

Image Credit: Oluwaseun Ogedengbe

See the sharding specification for implementation details. [1]

The "sharding" Info Key

Sharding meta information is located in the "sharding" key of the main info file for images, a special mesh info file, or a special skeleton info file. It has the following format:

"sharding": {
      "@type": "neuroglancer_uint64_sharded_v1",
      "preshift_bits": 4,
      "hash": "murmurhash3_x86_128", # also could be "identity"
      "minishard_bits": 9,
      "shard_bits": 3, 
      "minishard_index_encoding": "gzip", # gzip or raw
      "data_encoding": "raw" # gzip or raw
},

Note that as the various bits point to different sections of a uint64, preshift_bits + minishard_bits + shard_bits <= 64

The json above can be used to synthesize a ShardingSpecification, which is a CloudVolume class necessary to locate and read shard files, as follows:

from cloudvolume.datasource.precomputed.sharding import ShardingSpecification

spec = ShardingSpecification.from_dict(info['sharding'])

Reading a Shard File

There are two aspects to reading a shard file:

  1. Locating the shard file.
  2. Reading data from it.

In order to locate the file, you must know what kind of data you're looking for. For Precomputed images, the name of the file is based on the bounding box the shard represents. For meshes and labels, it is based on a manually specified number of bits ("shard bits") as well as an adjustment factor ("preshift bits") that are extracted from a hash of the segment ID and looks like (as in the diagram above) o6e.shard. The hash function typically used is the non-cryptographically secure murmurhash3_x86_128. [2][3]

Once you know which file the data is located in, it is necessary to know the number of "minishard bits" and "preshift bits" in order to extract the minishard number from the segment ID's hash. This allows you to read the correct byte range of the minishard index from the corresponding row of the fixed width index as in the figure above. You then read the minishard index, extract the byte range of the segment you care about, and read the data from there. Both the minishard index and the data themselves may be gzip compressed (other compressions schemes are possible, but not supported in Neuroglancer).

As of this writing, CloudVolume implements this using the ShardReader class. [4] The ShardReader class is integrated into the sharded versions of accessors that have been implemented. As of this writing, sharded readers have been incorporated into Precomputed Images and Skeletons. We are designing support for Graphene Meshes [5].

In order to read a shard, ShardReader absolutely requires a ShardingSpecification instance. As implemented, it currently also requires a PrecomputedMetadata instance and CacheService instance to correctly treat path joining in the cloud and on the local filesytem and also to make caching easy to implement. The ShardReader class could have those two things stripped out.

Once created, data can be read from shards like so:

reader = ShardReader(...)
binary = reader.get_data(label)

res = reader.exists(label) # returns filepath or None
filepath, byte_start, num_bytes = reader.exists(label, return_byte_range=True)
result_set = reader.exists(labels) # { label: res }, supports return_byte_range

filename = reader.get_filename(label) # helps locate shards
labels = reader.list_labels(reader.get_filename(label)) # list all labels in shard
content = reader.disassemble_shard(binary) # if you download an entire shard file, parse it like this

For a less artificial example, you can access a ShardReader instance via CloudVolume on a volume with sharded skeletons and reset cache sizes from defaults like so:

cv.skeleton.reader
cv.skeleton.reader.shard_index_cache.resize(N) # default 512
cv.skeleton.reader.minishard_index_cache.resize(N) # default 128

As a naive user, in order to read a shard file, nothing special need be done. Just use the download functions normally. However, there is a caveat in the current implementation: if sharding is turned on or off, a new CloudVolume object must be created to properly construct certain accessor classes.

Synthesizing a Shard File

Synthesizing a shard file is more tricky than the more typical approach. All the data needed for the file must be gathered in one place. In the case of meshes and skeletons, this means all labels that hash to a particular value must be retrieved. Then, a CloudVolume function that will perform the synthesis:

from cloudvolume.datasource.precomputed.sharding import ShardingSpecification

spec = ShardingSpecification(...)
binary = spec.synthesize_shard(labels, progress=True)

Designing a ShardingSpecification

  • shard_bits: 2^shard_bits is the number of shards files. i.e. 0 is one shard file, and 2 is 4 files. If you have too few files, the minishard index will grow in size and make downloads slower. If you have too many files, it caching the shard index will be less effective.
  • minishard_bits: 2^minishard_bits is the number of minishards per file. Too few, and the minishard index read request will become large. Too many, and the size of the shard index will become large.
  • preshift_bits: Helps generate additional hash collisions by removing some of the lower significant bits from the uint64.
  • minishard_index_encoding: 'gzip' or 'raw'. What you use depends on your connection speed and the predicted size of the minishard index.
  • data_encoding: 'gzip' or 'raw', i.e. should the data itself be gzip compressed?

References

  1. J. Maitin-Shepard. "Sharded Format Specification". Github. Accessed April 11, 2020. (link)
  2. F. Kihlander et al. "PYMMH3 - a pure python MurmurHash3 implementation". Github. Accessed April 11, 2020. https:/wc-duck/pymmh3
  3. A. Appleby. "MurmurHash3.cpp". Github. Accessed April 11, 2020. (link)
  4. W. Silversmith. "ShardReader". Github. Accessed April 11, 2020. ([link])(https:/seung-lab/cloud-volume/blob/ae6e2194c92ab205068fdd1372011d5973f4421f/cloudvolume/datasource/precomputed/sharding.py#L178-L297))
  5. W. Silversmith. "Graphene: Format for Large Scale Proofreading". Accessed April 11, 2020. (link) (Note that this is a simple description of Graphene as relates to CloudVolume. Graphene is the work of many people and my involvement has been peripheral until now.)