Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Allow multipolygons for masking and return sparse arrays when loading. #139

Open
GeraldIr opened this issue Jan 22, 2024 · 4 comments

Comments

@GeraldIr
Copy link

As far as I am aware currently there is no support for sparse arrays or more than one polygon for masking. Would there be a way for submitting a geometry made up of many disjoint polygons and receiving back a lazy sparse array.

If this is out of scope, not feasible or simply not worth implementing (or already possible) I'd also love to know.

@Kirill888
Copy link
Member

@GeraldIr is there support for single polygon for masking? There is only computation of bounding box from a supplied geometry, multi-polygons should be treated same as any other geometry, bounding box slightly expanded to pixel boundaries will be used to decide what region to load, no actual masking is applied to the pixel data.

@GeraldIr
Copy link
Author

@Kirill888 oh my bad, I must have imagined this capability then, I think I might have seen something similar in another use-case potentially unrelated with odc-stac.

Regardless, the feature request stands just with all geometries then.

Again though, if this is out of scope, don't hesitate to close this issue. This is something I would find useful personally for loading small far apart AoIs or AoIs with peculiar shapes without blowing up memory or the amount of chunks.

@Kirill888
Copy link
Member

@GeraldIr most likely this is out of scope. But I am curious as to how exactly do you envision the result should look like?

Say you have a global dataset of moderate/low resolution data using the same projection for all tiles, something like Modis. And you are querying 100 different smallish polygons (100x100px sort of thing) covering the globe. What comes out of odc.stac.load in your ideal scenario, for both Dask and direct load?

Is it just xr.Dataset covering entire globe "but is somehow sparse with irregular dense sections", is it 100 different xr.Datasets?

Dask data arrays do not really support "sparse" or irregular chunks. An individual chunk can be a numpy sparse array, but all chunks have to be accounted for, also all chunks in the same column have to be the same width and all chunks in the same row have to be the same height. So just computing a "reasonable" chunking regime from a collection of bounding boxes where valid data might be, is a non-trivial exercise with a lot of context dependent decisions.

In general "small polygon gather" operation is tricky to optimize for load efficiency. One would need to cluster "nearby" regions into bigger regions for loading together for more optimal reading. There are a lot of choices between "load every region independently" and "load all the pixels first, then extract all the regions of interest from that".

@GeraldIr
Copy link
Author

@Kirill888 Ideally I would get a single combined Dataset back whose underlying data is stored in the sparse.COO format (https://sparse.pydata.org/en/stable/), Lazy or non-lazy depending on dask/direct-load.

As far as chunking goes, you could potentially just not allow spatially chunking, as the chunk size should be manageable regardless given the sparsity of the underlying array.

I have experimented with this as a second layer on top of odc.stac.load, loading small areas of interest for each given polygon(or using the union of the polygons if they are next to each other) and using xr.combine_by_coords to fit them into a singular xr.Dataset and then using map_blocks to convert all of it to sparse arrays. The eventual problem I run into is that combining/merging Datasets becomes non-trivial or straight up impossible using xr.combine_by_coords, because when combining them iteratively like this you cannot help but run into overlaps of the combined and the combinee arrays. Merge would work, but it's not lazy. It does seem to work for very low amounts of polygons reliably though.

Loading all the pixels first and then extracting regions of interest was also tried, but lead to immediate issues regarding the size of the mask and the amount of/size of the chunks needed for huge regions that you quickly run into with very far apart polygons.

I just thought that potentially you had some insight that I was overlooking, but I don't want to hold you up with any issues that are out-of-scope or unplanned.

You can close this issue if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants