Load tiles in parallel on workers and add options to TissueDetectionHE
#336
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This contains two separate improvements
drop_empty_tiles
andkeep_mask
options to theTissueDetectionHE
transform to bypass saving tiles with no detected H&E tissue and bypass saving masksdask.delayed
to avoid loading images on the main threadThe first part is both for convenience and performance. It's possible to generate all tiles and then filter out the empty tiles and remove masks before writing the h5path to disk, but that requires that all the tiles be added to the Tiles which takes IO time. If these tiles and masks are never saved even to in-memory objects, processing can finish faster.
The second part is a core performance issue with distributed processing. I believe it's relevant to #211 and #299. When processing tiles, I've found that loading time >> processing time, and currently, tile image data is loaded on the main thread and scatters the loaded tile to workers. This prevents any parallelism as all but one worker are always waiting for the main thread to load data and send them a tile.
Additionally, as all tiles have to be loaded on the main thread, the block that generates the futures
has to load all tiles and send them all to workers before ANY tile can be added to the
Tiles
and the memory can be freed in the next blockcausing the dramatic memory leaks seen in #211.
I've used
dask.delayed
to prevent reading from the input file until the image is accessed on the worker. The code that accesses the file and loads the image can now be run by each worker in parallel. To preserve the parallelism, we have to take care not to access and loadtile.image
on the main thread before loading it on the worker, or to at least wrap accesses indask.delayed
as inSlideData.generate_tiles
.I had some issues with the backends not being picklable. The Backend has to be sent to each worker so it has access to the code that interfaces with the filesystem. I changed Backend filelike attributes to be lazily evaluated with the @Property decorator.