Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

GeraldIr · 2024-01-22T15:38:49Z

As far as I am aware currently there is no support for sparse arrays or more than one polygon for masking. Would there be a way for submitting a geometry made up of many disjoint polygons and receiving back a lazy sparse array.

If this is out of scope, not feasible or simply not worth implementing (or already possible) I'd also love to know.

Kirill888 · 2024-01-23T02:57:50Z

@GeraldIr is there support for single polygon for masking? There is only computation of bounding box from a supplied geometry, multi-polygons should be treated same as any other geometry, bounding box slightly expanded to pixel boundaries will be used to decide what region to load, no actual masking is applied to the pixel data.

GeraldIr · 2024-01-23T08:58:56Z

@Kirill888 oh my bad, I must have imagined this capability then, I think I might have seen something similar in another use-case potentially unrelated with odc-stac.

Regardless, the feature request stands just with all geometries then.

Again though, if this is out of scope, don't hesitate to close this issue. This is something I would find useful personally for loading small far apart AoIs or AoIs with peculiar shapes without blowing up memory or the amount of chunks.

Kirill888 · 2024-01-24T06:07:10Z

@GeraldIr most likely this is out of scope. But I am curious as to how exactly do you envision the result should look like?

Say you have a global dataset of moderate/low resolution data using the same projection for all tiles, something like Modis. And you are querying 100 different smallish polygons (100x100px sort of thing) covering the globe. What comes out of odc.stac.load in your ideal scenario, for both Dask and direct load?

Is it just xr.Dataset covering entire globe "but is somehow sparse with irregular dense sections", is it 100 different xr.Datasets?

Dask data arrays do not really support "sparse" or irregular chunks. An individual chunk can be a numpy sparse array, but all chunks have to be accounted for, also all chunks in the same column have to be the same width and all chunks in the same row have to be the same height. So just computing a "reasonable" chunking regime from a collection of bounding boxes where valid data might be, is a non-trivial exercise with a lot of context dependent decisions.

In general "small polygon gather" operation is tricky to optimize for load efficiency. One would need to cluster "nearby" regions into bigger regions for loading together for more optimal reading. There are a lot of choices between "load every region independently" and "load all the pixels first, then extract all the regions of interest from that".

GeraldIr · 2024-01-24T09:24:33Z

@Kirill888 Ideally I would get a single combined Dataset back whose underlying data is stored in the sparse.COO format (https://sparse.pydata.org/en/stable/), Lazy or non-lazy depending on dask/direct-load.

As far as chunking goes, you could potentially just not allow spatially chunking, as the chunk size should be manageable regardless given the sparsity of the underlying array.

I have experimented with this as a second layer on top of odc.stac.load, loading small areas of interest for each given polygon(or using the union of the polygons if they are next to each other) and using xr.combine_by_coords to fit them into a singular xr.Dataset and then using map_blocks to convert all of it to sparse arrays. The eventual problem I run into is that combining/merging Datasets becomes non-trivial or straight up impossible using xr.combine_by_coords, because when combining them iteratively like this you cannot help but run into overlaps of the combined and the combinee arrays. Merge would work, but it's not lazy. It does seem to work for very low amounts of polygons reliably though.

Loading all the pixels first and then extracting regions of interest was also tried, but lead to immediate issues regarding the size of the mask and the amount of/size of the chunks needed for huge regions that you quickly run into with very far apart polygons.

I just thought that potentially you had some insight that I was overlooking, but I don't want to hold you up with any issues that are out-of-scope or unplanned.

You can close this issue if you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

GeraldIr commented Jan 22, 2024

Kirill888 commented Jan 23, 2024

GeraldIr commented Jan 23, 2024

Kirill888 commented Jan 24, 2024

GeraldIr commented Jan 24, 2024

Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

Comments

GeraldIr commented Jan 22, 2024

Kirill888 commented Jan 23, 2024

GeraldIr commented Jan 23, 2024

Kirill888 commented Jan 24, 2024

GeraldIr commented Jan 24, 2024