Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-node DataTree chunking #9634

Open
sjperkins opened this issue Oct 16, 2024 · 1 comment
Open

Per-node DataTree chunking #9634

sjperkins opened this issue Oct 16, 2024 · 1 comment
Labels
enhancement topic-DataTree Related to the implementation of a DataTree class

Comments

@sjperkins
Copy link

Is your feature request related to a problem?

In the radio astronomy domain specific xarray-ms, we construct a DataTree representing partitions of a legacy data format where each partition contains regular data cubes. As currently implemented, the custom backend supports a partition_chunks kwarg in the BackendEntrypoint.open_datatree method so that it is possible to specify different chunking schemas per partition:

The chunking specification above is specific to a radio astronomy legacy format, but it may be more generally useful to be able to specify per-DataTree node chunking.

Describe the solution you'd like

Currently, BackendEntrypoint.open_datatree passes it's chunks kwarg to each Dataset constructor in the DataTree. This is quite coarse-grained as it applies the same chunking schema to all Datasets in the DataTree.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node). For example:

import xarray

xdt = xarray.open_datatree(..., chunks={
  "/path/to/node1": {"time": 20, "frequency": 16},
  "/path/to/a/node2": {"time": 10, "frequency": 4},
}

Then, when constructing Datasets in the DataTree, the chunking schema appropriate to the node can be applied.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

xd = xarray.open_datatree(..., chunks={
  # Apply to node1 and any node below
  "/path/to/node1/...": {"time": 20, "frequency": 16}
}

Describe alternatives you've considered

We've implemented a custom partition_chunks kwarg argument in the BackendEntrypoint.open_datatree method for our legacy data format.

Additional context

No response

@headtr1ck headtr1ck added the topic-DataTree Related to the implementation of a DataTree class label Oct 16, 2024
@TomNicholas
Copy link
Member

Really cool to see you using xarray for radio astronomy data! I didn't know we had users in that field.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node)

Good idea! We would be happy to take a PR if you want to generalize this.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

I think we should avoid the temptation to make this overly clever, at least initially, because the chunks kwarg type is already heavily overloaded. Per-node and per-variable chunking would be sufficiently expressive for all use cases. The only other subtlety that the chunk dict validation code would need to watch out for is duplicated coordinates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

3 participants