Consistency of names of stored children / dataarrays with keys in wrapping object #9447

TomNicholas · 2024-09-07T22:09:21Z

What is your issue?

In xarray-contrib/datatree#309 @marcel-goldschen-ohm pointed out some counterintuitive behaviour of the DataTree.name setter.

# using xarray main
In [2]: from xarray.core.datatree import DataTree

In [3]: root = DataTree.from_dict(
   ...:     {
   ...:         'child': DataTree(),
   ...:     },
   ...:     name='root',
   ...: )

In [4]: root['child'].name = 'childish'

In [5]: root
Out[5]: 
<xarray.DataTree 'root'>
Group: /
└── Group: /childish

this looks fine, but

In [6]: root.children
Out[6]: 
Frozen({'child': <xarray.DataTree 'childish'>
Group: /childish})

which means that

In [14]: root['childish']

KeyError: 'Could not find node at childish'

Okay, so we just fix it so that setting .name changes the key in the parent node right? But as @etienneschalk pointed out in xarray-contrib/datatree#309 (comment), that wouldn't be consistent with DataArray and Dataset names behave: they don't update the key in the wrapping object either. Instead they just completely ignore that you asked to update it at all:

In [15]: ds = xr.Dataset({'a': 0})

In [17]: ds['a'].name = 'b'

In [18]: ds
Out[18]: 
<xarray.Dataset> Size: 8B
Dimensions:  ()
Data variables:
    a        int64 8B 0

In [19]: ds.variables
Out[19]: 
Frozen({'a': <xarray.Variable ()> Size: 8B
array(0)})

This happens because Dataset objects do not actually store DataArray objects, they only store Variables (which are un-named), and reconstruct a new DataArray upon __getitem__ calls. So if you use .name to alter the name of that newly-constructed DataArray you're not actually touching the wrapping Dataset at all, and therefore have no way to update its keys.

Indeed the same thing happens (for the same reason) when altering the name of a Variable that is part of a DataTree node:

In [8]: import xarray as xr

In [9]: root['a'] = xr.DataArray(0)

In [10]: root
Out[10]: 
<xarray.DataTree 'root'>
Group: /
│   Dimensions:  ()
│   Data variables:
│       a        int64 8B 0
└── Group: /child

In [11]: root['a'].name
Out[11]: 'a'

In [12]: root['a'].name = 'b'

In [13]: root['a']
Out[13]: 
<xarray.DataArray 'a' ()> Size: 8B
array(0)

Is this ignoring of the name setter intended, or a bug? And should DataTree follow the behaviour of Dataset here or not? If it doesn't follow the behaviour of Dataset then we're going to end up with a situation where

root['child'].name = 'new_name'

actually makes a change to root, but

root['a'].name = 'new_name'

does not, so the behaviour of .name depends on whether or not your key corresponds to a DataTree value or a Variable value. 🫤

In the DataTree case I think it is a lot easier to update the key in the wrapping object when .name is set because there is a bi-directional linkage back to the parent node, which doesn't exist to link the DataArray instance back to Dataset.

cc @shoyer - I would appreciate your thoughts here.

The text was updated successfully, but these errors were encountered:

marcel-goldschen-ohm · 2024-09-09T15:03:38Z

In my opinion, XXX.name = 'blah' should work as intuitively expected (the name should be reset) for all objects with a name property. Anything else is a nightmare of debugging until this counterintuitive behavior is realized.

I would also note that from a UIX perspective having persistent references to DataTree objects that allow changing their name (or other data) inplace is much better than the alternative of rebuilding hierarchies of references for simple name changes or manually updating Frozen lists. I'm not sure if this perspective has been considered?

If due to the design of the library the name property simply cannot be used this way, then I would still argue that some other method that allows renaming inplace is needed.

shoyer · 2024-09-09T15:37:07Z

We should definitely have a DataTree.rename method that works similarly to Dataset.rename, allowing for renaming a collection of nodes all of once.

I don't think we should allow setting names on a child node, like root['child'].name = 'childish'. The only way to make this work reasonably would be to also update root.children. That feels like quite a distant update, especially if done via an intermediate variable, which unfortunately Python cannot syntactically distinguish, e.g.,

child = root['child']
child.name = 'childish'

As noted, it would also be inconsistent with the long standing behavior of Dataset.

This would be a relatively straightforward change, adding an error to the .name setter if parent is not None. The error mesage could suggest calling .copy() on the child or rename on the parent.

Alternatively, we might consider re-implementing DataTree more like Dataset, to re-create the DataTree objects corresponding to nodes whenever they are accessed. I wonder if this would break any critical use-cases? In practice it would look like automatically calling .copy() on child nodes whenever they are accessed. (Edit: one serious issue that occurs to me is that we would loose the reference to inherited coordinates/dimensions from the parents. We would need another way to handle this.)

marcel-goldschen-ohm · 2024-09-09T16:26:50Z

Just to clarify, as far as I understand Dataset.rename returns a new Dataset object, whereas I am suggesting that what is needed is an option to rename "inplace" so that the original DataTree reference remains valid along with all descendent node references.

shoyer · 2024-09-09T16:39:14Z

Just to clarify, as far as I understand Dataset.rename returns a new Dataset object, whereas I am suggesting that what is needed is an option to rename "inplace" so that the original DataTree reference remains valid along with all descendent node references.

What is your use-case for persistent DataTree references?

Elsewhere in Xarray, we have removed most "inplace" methods, in favor of methods that return modified objects.

marcel-goldschen-ohm · 2024-09-09T20:30:32Z

I have a GUI interface to a DataTree hierarchy that currently stores the references to each node. A user clicks on a node and renames it in the GUI. At the moment, this requires either repopulating all the descendent node references in the GUI tree representation or manually editing Frozen children lists. This does not seem to me like a good way forward. The "inplace" method would solve this.

Now that I have thought about this more and better understand the issue, I could potentially rewrite the GUI interface to use paths instead of references which would maybe be more inline with how you are thinking DataTree should be used. So maybe I can achieve what I want with a rename method that still returns a new object, will need to explore. That said, I probably won't be the last person to take the node reference approach above, for which an "inplace" method would solve the issue.

TomNicholas · 2024-09-10T15:26:24Z

I could potentially rewrite the GUI interface to use paths instead of references which would maybe be more inline with how you are thinking DataTree should be used.

You also might be able to use the upcoming .move method (see #9442). If you have a reference to the root of the tree, calling

copied_root = root.move(origin='/path/old_name', destinatiom='/path/new_name')

would return a new tree root with everything the same except that your one node had been renamed. This usage would be like using mv /path/old_dir_name/ /path/new_dir_name/ on a unix filesystem.

marcel-goldschen-ohm · 2024-09-10T18:21:29Z

The more I think about it, the more I am convinced that it will be possible to alter the GUI design so that all tree operations use a singular root reference and paths. In that case, a move method or a rename method that returns a new node instance should suffice. Thanks for entertaining my confusion on this ;)

It may be worth bringing up in the docs that a tree of node references is really not the most ideal way to work with DataTree (if that is already described, my apologies), especially given that this is probably a common initial design for trees. In any case, I think it is important that the name property setter throws an understandable error message as that is a massive confusion.

Cheers!

This ensures that the tree cannot be in an inconsistent state. Fixes pydata#9447

TomNicholas added bug API design topic-DataTree Related to the implementation of a DataTree class labels Sep 7, 2024

This was referenced Sep 9, 2024

setting node name breaks tree linkage xarray-contrib/datatree#309

Closed

Setting node name keeps tree linkage xarray-contrib/datatree#310

Closed

shoyer added a commit to shoyer/xarray that referenced this issue Sep 13, 2024

Forbid modifying names of DataTree objects with parents

7bdfb75

This ensures that the tree cannot be in an inconsistent state. Fixes pydata#9447

shoyer mentioned this issue Sep 13, 2024

Forbid modifying names of DataTree objects with parents #9494

Merged

2 tasks

shoyer closed this as completed in #9494 Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency of names of stored children / dataarrays with keys in wrapping object #9447

Consistency of names of stored children / dataarrays with keys in wrapping object #9447

TomNicholas commented Sep 7, 2024

marcel-goldschen-ohm commented Sep 9, 2024

shoyer commented Sep 9, 2024 •

edited

Loading

marcel-goldschen-ohm commented Sep 9, 2024 •

edited

Loading

shoyer commented Sep 9, 2024

marcel-goldschen-ohm commented Sep 9, 2024

TomNicholas commented Sep 10, 2024

marcel-goldschen-ohm commented Sep 10, 2024

Consistency of names of stored children / dataarrays with keys in wrapping object #9447

Consistency of names of stored children / dataarrays with keys in wrapping object #9447

Comments

TomNicholas commented Sep 7, 2024

What is your issue?

marcel-goldschen-ohm commented Sep 9, 2024

shoyer commented Sep 9, 2024 • edited Loading

marcel-goldschen-ohm commented Sep 9, 2024 • edited Loading

shoyer commented Sep 9, 2024

marcel-goldschen-ohm commented Sep 9, 2024

TomNicholas commented Sep 10, 2024

marcel-goldschen-ohm commented Sep 10, 2024

shoyer commented Sep 9, 2024 •

edited

Loading

marcel-goldschen-ohm commented Sep 9, 2024 •

edited

Loading