Per-node DataTree chunking #9634

sjperkins · 2024-10-16T07:20:31Z

Is your feature request related to a problem?

In the radio astronomy domain specific xarray-ms, we construct a DataTree representing partitions of a legacy data format where each partition contains regular data cubes. As currently implemented, the custom backend supports a partition_chunks kwarg in the BackendEntrypoint.open_datatree method so that it is possible to specify different chunking schemas per partition:

https://xarray-ms.readthedocs.io/en/latest/tutorial.html#per-partition-chunking

The chunking specification above is specific to a radio astronomy legacy format, but it may be more generally useful to be able to specify per-DataTree node chunking.

Describe the solution you'd like

Currently, BackendEntrypoint.open_datatree passes it's chunks kwarg to each Dataset constructor in the DataTree. This is quite coarse-grained as it applies the same chunking schema to all Datasets in the DataTree.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node). For example:

import xarray

xdt = xarray.open_datatree(..., chunks={
  "/path/to/node1": {"time": 20, "frequency": 16},
  "/path/to/a/node2": {"time": 10, "frequency": 4},
}

Then, when constructing Datasets in the DataTree, the chunking schema appropriate to the node can be applied.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

xd = xarray.open_datatree(..., chunks={
  # Apply to node1 and any node below
  "/path/to/node1/...": {"time": 20, "frequency": 16}
}

Describe alternatives you've considered

We've implemented a custom partition_chunks kwarg argument in the BackendEntrypoint.open_datatree method for our legacy data format.

Additional context

No response

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-10-16T19:59:35Z

Really cool to see you using xarray for radio astronomy data! I didn't know we had users in that field.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node)

Good idea! We would be happy to take a PR if you want to generalize this.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

I think we should avoid the temptation to make this overly clever, at least initially, because the chunks kwarg type is already heavily overloaded. Per-node and per-variable chunking would be sufficiently expressive for all use cases. The only other subtlety that the chunk dict validation code would need to watch out for is duplicated coordinates.

sjperkins added the enhancement label Oct 16, 2024

headtr1ck added the topic-DataTree Related to the implementation of a DataTree class label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-node DataTree chunking #9634

Per-node DataTree chunking #9634

sjperkins commented Oct 16, 2024

TomNicholas commented Oct 16, 2024

Per-node DataTree chunking #9634

Per-node DataTree chunking #9634

Comments

sjperkins commented Oct 16, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

TomNicholas commented Oct 16, 2024