Skip to content

Advanced Topic: Shared Memory

William Silversmith edited this page Oct 21, 2019 · 15 revisions
# LISTING 1: How to Use Raw CloudVolume Shared Memory
from cloudvolume.sharedmemory import ndarray, unlink
import numpy as np

# Create a shared memory numpy array
fh, array = ndarray(shape=(10,10,10), dtype=np.uint8, location="my-example")
array[:] = 1

# Delete the backing file for it.
# You need to do this to keep your RAM free.
unlink("my-example")

# Close mmap file handle to really free memory
fh.close()

Docker containers require special treatment see below.

The Problem

  1. We want to fully utilize our hardware to upload and download as fast as possible via multiprocessing.
  2. We want to share 3D arrays between Python and other languages that sometimes don't support direct sharing quite right (impossible, memory leaks, etc).
  3. We might even want to share memory between otherwise unconnected Python processes.

Threading vs. Shared Memory

Python (at least CPython) has a restrictive threading model due to the global interpreter lock. Nonetheless, CloudVolume does use multiple threads within one process. However, we found experimentally that a single process cannot scale much beyond about one core's worth of processing. By using multiple processes, we can get around this limitation.

However, the memory sharing model processes use is different from the threading model. The threading model allows multiple threads within a single process to share the virtual memory within that process. Ordinarily, processes can't share memory. However, operating systems define an inter-process communication (IPC) protocol called "Shared Memory".

How CloudVolume Shared Memory Works

At least on Ubuntu 14.04, the right way to think about Shared Memory is that it's a RAM file that lives in /dev/shm/. Any process can created a shared memory segment, and if the permissions are ok, any process can access that same segment if it knows the name. The "files" in /dev/shm/ are so file-like that you can even use command like tools like rm on them as in rm /dev/shm/my-shared-memory-name and ls -lh to see how big they are. The big difference is that you need to use special functions to open these files in code.

Luckily, the CloudVolume sharedmemory module in combination with posix_ipc papers over that for you. On MacOS, the shared memory model is limited to 4 MB without hacking your OS configuration and restarting your computer, so rather than fight it, I emulated it via regular on-disk files stored in /tmp/cloudvolume-shm/. I haven't experimented with getting Windows to work with shared memory, so it defaults to using file emulation as well.

If a shared memory segment is just a file on RAM or disk, then we can create numpy arrays out of them via mmap. The most critical code in the sharedmemory module is just four lines.

# LISTING 2: How to mmap a Shared Numpy Array
import numpy as np
import posix_ipc

shared = posix_ipc.SharedMemory(location, flags=flags, size=size)
array_like = mmap.mmap(shared.fd, shared.size)
os.close(shared.fd)
renderbuffer = np.ndarray(buffer=array_like, dtype=dtype, shape=shape, order=order, **kwargs)

What's going on? Let's walk through it line-by-line.

  1. A shared memory file called /dev/shm/$location is created or resized with $size bytes.
  2. Use the file descriptor (fd) to mmap the file into an array-like entity.
  3. Close the shared memory file handle, the mmap opened its own; no need to track two of them.
  4. Create a numpy.ndarray backed by the raw array buffer we created

This numpy array will remain writable until the mmap file handle is closed (array_like.close()) at which point further writes will crash the program.

The shared memory will be deallocated when all file handles are closed and the underlying file is unlinked (deleted). The underlying file will outlive the process that created it.

Parallel Download

When CloudVolume is used without any special parameters, it behaves as you might expect though it uses threads to accelerate its download speed. When parallel is set to a number greater than one, shared memory comes into play. By default, CloudVolume will attempt to make the use of shared memory transparent to the user at the cost of double the memory usage.

In parallel operation, CloudVolume paints a numpy array termed the renderbuffer by allocating a shared memory array and evenly distributing lists of chunks to download to each process (each of which uses threads to fully utilize its core). When the download is complete and the array is fully painted, CloudVolume makes a copy of the result into ordinary memory, unlinks the underlying file, and closes the mmap file descriptor. So long as the process doesn't die in mid-download (preventing file deletion), the user will be unaware that shared memory was even used.

However, alternative means of IPC are desired, you can use vol.download_to_shared_memory(np.s_[...]) to accept responsibility for cleaning up the allocated segment. In this case, no copy will be made, and the underlying segment will not be automatically unlinked. By default, the name of the segment will be vol.shared_memory_id.

Parallel Upload

Uploading works similarly to downloading, but the order of operations is inverted. At parallel 1, no shared memory is used. Above parallel 1, vol[...] = image causes the allocated of shared memory, and a copy is made of the input image. The child processes then read from the shared memory segment rather than the original image and perform their upload duties. At the end of the process, the shared memory segment is unlinked.

However, if memory is tight, or the image was generated by another process, it is possible to use vol.upload_from_shared_memory($location, $bbox) which will read from $location and not need to make a copy. This means that unlike download, the user needs to generate the shared memory segment themselves (though they can make use of the cloudvolume.sharedmemory module to make things easier).

Emulation vs. Shared Memory

While the cloudvolume.sharedmemory module defaults to file backed shared memory on OS X, it's possible to use it on any platform. Though disk files are obviously much slower to create and have lower bandwidth, this can be desirable in the case of memory pressure.

To toggle emulation globally, set cloudvolume.sharedmemory.EMULATE_SHM = True. Alternatively, for boutique use cases, you can directly use cloudvolume.sharedmemory.ndarray_fs (for file emulation) or cloudvolume.sharedmemory.ndarray_shm (for real shared memory). The standard ndarray function just decides which of those to use.

Docker Containers and Kubernetes

If you are using parallel > 1, you are using shared memory. By default Docker containers allocate 64 MB of shared memory. This is typically insufficient for most connectomics applications. Pick an appropriate size for the images you are manipulating. For instance, a 1024x1024x128 8 bit image is 128MB, but float32 xyz affinites would be 1.5GB per copy.

docker run --shm-size="4g" ....

In your Kubernetes deployment.yml, you should add the following clauses to enable shared memory. If you do not do so, you may see a "Bus Error" when memory allocation fails.

      containers:
          ...
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
            readOnly: false
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory

Sharing is caring.