-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for zarr #316
Comments
Go figure, this is a bit trickier than I initially expected. Most importantly,
@rabernat, do you know why |
Thanks for pinging this again @spencerahill; sorry for neglecting to respond earlier. You bring up some good points.
My initial thought was actually to create a separate
I think if we focus on single-zarr-store datasets for now (just as a proof of concept) it would be OK that
The I'm not sure if there is a strong case for adding a |
Yes, that would be a much easier first step. I guess my long term vision was for users not even having to worry about whether their data was zarr or netCDF, but I suppose that's too far ahead of things. So for this current, proof-of-concept stage, I think @spencerkclark you're right that something like a simple ZarrDataLoader is the way to proceed. (Actually that leads to a new idea: could we (eventually) separate the logic of what the type of data store is (zarr vs. netcdf) from the description of how the files are organized? Then we could use composition to specify any combination, e.g. a NestedDictDataLoader that uses Zarr files vs. the same but that uses netCDF.)
OK, that's fine by me. We should be able to replicate this logic ourselves within aospy for zarr data, because I do think we need it. |
Sorry for the slow reply here.
There are some points related to this topic on the pangeo website. The reason that The way we are using zarr, however, generally makes that sort of function obsolete. To produce zarr datasets, we commonly do something like ds = xr.open_mfdataset('*.nc')
ds.to_zarr('big_datset.zarr') In other words, datasets that were originally stored in hundreds or thousands of netcdf files are now stored in a single zarr store (which may contain many files, but zarr handles that part).
This sounds a lot like what intake does. You might get more mileage out of first refactoring around intake. Then you would be able to outsource all of the file loading stuff. The pangeo intake catalog for example contains both multi-netcdf file datasets and zarr datasets. The user doesn't ever have to care what the underlying driver is. Speaking of intake, have you seen this? |
Thanks much @rabernat.
Ah, duh. If the use case arises for us for an
Good point; I'll open a separate issue for us to discuss this. It's been on our radar for a while, but we didn't have a compelling reason to switch so far. But now that it's getting more and more adoption including through pangeo (including intake-cmip5...very cool!), perhaps that's no longer the case. So all that said, I think @spencerkclark 's idea of starting with a simple ZarrDataLoader as a proof-of-concept is the best way to proceed. |
Zarr is becoming the format of choice for N-D data on the cloud, with heavy usage in e.g. pangeo. @spencerkclark also has found some compelling use cases for it over netCDF on the GFDL analysis cluster, i.e. not on the cloud. And xarray has a very clean interface for zarr IO:
Dataset.to_zarr
andxr.open_zarr
.As such and based on offline conversations with @rabernat and @spencerkclark regarding using aospy within pangeo, I think it makes sense for aospy to provide zarr support. So it remains to decide how to do that.
The
open_zarr
method returns a Dataset object just asopen_mfdataset
andopen_dataset
do. So it's really purely a matter of I/O: once a Dataset is loaded from a zarr file, we can proceed with the rest of our pipeline just as if it were a netCDF file (hooray for clean interfaces!)For input, then, insofar as users save their zarr files with the
.zarr
extension, we can simply use that to choose which method we use to load: e.g. something likeThere are most likely some additional complications I haven't thought of yet, but this seems like a reasonable approach at this stage. And/or we could allow the users to specify the filetype via a flag rather than relying on the extension. I've thought less about output, but my impression is a similar approach would do the trick.
@spencerkclark any initial thoughts? And CCing @rabernat for any thoughts in case I'm leading us astray here.
The text was updated successfully, but these errors were encountered: