Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency of names of stored children / dataarrays with keys in wrapping object #9447

Closed
TomNicholas opened this issue Sep 7, 2024 · 7 comments · Fixed by #9494
Closed
Labels
API design bug topic-DataTree Related to the implementation of a DataTree class

Comments

@TomNicholas
Copy link
Member

What is your issue?

In xarray-contrib/datatree#309 @marcel-goldschen-ohm pointed out some counterintuitive behaviour of the DataTree.name setter.

# using xarray main
In [2]: from xarray.core.datatree import DataTree

In [3]: root = DataTree.from_dict(
   ...:     {
   ...:         'child': DataTree(),
   ...:     },
   ...:     name='root',
   ...: )

In [4]: root['child'].name = 'childish'

In [5]: root
Out[5]: 
<xarray.DataTree 'root'>
Group: /
└── Group: /childish

this looks fine, but

In [6]: root.children
Out[6]: 
Frozen({'child': <xarray.DataTree 'childish'>
Group: /childish})

which means that

In [14]: root['childish']
KeyError: 'Could not find node at childish'

Okay, so we just fix it so that setting .name changes the key in the parent node right? But as @etienneschalk pointed out in xarray-contrib/datatree#309 (comment), that wouldn't be consistent with DataArray and Dataset names behave: they don't update the key in the wrapping object either. Instead they just completely ignore that you asked to update it at all:

In [15]: ds = xr.Dataset({'a': 0})

In [17]: ds['a'].name = 'b'

In [18]: ds
Out[18]: 
<xarray.Dataset> Size: 8B
Dimensions:  ()
Data variables:
    a        int64 8B 0

In [19]: ds.variables
Out[19]: 
Frozen({'a': <xarray.Variable ()> Size: 8B
array(0)})

This happens because Dataset objects do not actually store DataArray objects, they only store Variables (which are un-named), and reconstruct a new DataArray upon __getitem__ calls. So if you use .name to alter the name of that newly-constructed DataArray you're not actually touching the wrapping Dataset at all, and therefore have no way to update its keys.

Indeed the same thing happens (for the same reason) when altering the name of a Variable that is part of a DataTree node:

In [8]: import xarray as xr

In [9]: root['a'] = xr.DataArray(0)

In [10]: root
Out[10]: 
<xarray.DataTree 'root'>
Group: /Dimensions:  ()
│   Data variables:
│       a        int64 8B 0
└── Group: /child

In [11]: root['a'].name
Out[11]: 'a'

In [12]: root['a'].name = 'b'

In [13]: root['a']
Out[13]: 
<xarray.DataArray 'a' ()> Size: 8B
array(0)

Is this ignoring of the name setter intended, or a bug? And should DataTree follow the behaviour of Dataset here or not? If it doesn't follow the behaviour of Dataset then we're going to end up with a situation where

root['child'].name = 'new_name'

actually makes a change to root, but

root['a'].name = 'new_name'

does not, so the behaviour of .name depends on whether or not your key corresponds to a DataTree value or a Variable value. 🫤

In the DataTree case I think it is a lot easier to update the key in the wrapping object when .name is set because there is a bi-directional linkage back to the parent node, which doesn't exist to link the DataArray instance back to Dataset.

cc @shoyer - I would appreciate your thoughts here.

@TomNicholas TomNicholas added bug API design topic-DataTree Related to the implementation of a DataTree class labels Sep 7, 2024
@marcel-goldschen-ohm
Copy link

In my opinion, XXX.name = 'blah' should work as intuitively expected (the name should be reset) for all objects with a name property. Anything else is a nightmare of debugging until this counterintuitive behavior is realized.

I would also note that from a UIX perspective having persistent references to DataTree objects that allow changing their name (or other data) inplace is much better than the alternative of rebuilding hierarchies of references for simple name changes or manually updating Frozen lists. I'm not sure if this perspective has been considered?

If due to the design of the library the name property simply cannot be used this way, then I would still argue that some other method that allows renaming inplace is needed.

@shoyer
Copy link
Member

shoyer commented Sep 9, 2024

We should definitely have a DataTree.rename method that works similarly to Dataset.rename, allowing for renaming a collection of nodes all of once.

I don't think we should allow setting names on a child node, like root['child'].name = 'childish'. The only way to make this work reasonably would be to also update root.children. That feels like quite a distant update, especially if done via an intermediate variable, which unfortunately Python cannot syntactically distinguish, e.g.,

child = root['child']
child.name = 'childish'

As noted, it would also be inconsistent with the long standing behavior of Dataset.

This would be a relatively straightforward change, adding an error to the .name setter if parent is not None. The error mesage could suggest calling .copy() on the child or rename on the parent.

Alternatively, we might consider re-implementing DataTree more like Dataset, to re-create the DataTree objects corresponding to nodes whenever they are accessed. I wonder if this would break any critical use-cases? In practice it would look like automatically calling .copy() on child nodes whenever they are accessed. (Edit: one serious issue that occurs to me is that we would loose the reference to inherited coordinates/dimensions from the parents. We would need another way to handle this.)

@marcel-goldschen-ohm
Copy link

marcel-goldschen-ohm commented Sep 9, 2024

Just to clarify, as far as I understand Dataset.rename returns a new Dataset object, whereas I am suggesting that what is needed is an option to rename "inplace" so that the original DataTree reference remains valid along with all descendent node references.

@shoyer
Copy link
Member

shoyer commented Sep 9, 2024

Just to clarify, as far as I understand Dataset.rename returns a new Dataset object, whereas I am suggesting that what is needed is an option to rename "inplace" so that the original DataTree reference remains valid along with all descendent node references.

What is your use-case for persistent DataTree references?

Elsewhere in Xarray, we have removed most "inplace" methods, in favor of methods that return modified objects.

@marcel-goldschen-ohm
Copy link

I have a GUI interface to a DataTree hierarchy that currently stores the references to each node. A user clicks on a node and renames it in the GUI. At the moment, this requires either repopulating all the descendent node references in the GUI tree representation or manually editing Frozen children lists. This does not seem to me like a good way forward. The "inplace" method would solve this.

Now that I have thought about this more and better understand the issue, I could potentially rewrite the GUI interface to use paths instead of references which would maybe be more inline with how you are thinking DataTree should be used. So maybe I can achieve what I want with a rename method that still returns a new object, will need to explore. That said, I probably won't be the last person to take the node reference approach above, for which an "inplace" method would solve the issue.

@TomNicholas
Copy link
Member Author

I could potentially rewrite the GUI interface to use paths instead of references which would maybe be more inline with how you are thinking DataTree should be used.

You also might be able to use the upcoming .move method (see #9442). If you have a reference to the root of the tree, calling

copied_root = root.move(origin='/path/old_name', destinatiom='/path/new_name')

would return a new tree root with everything the same except that your one node had been renamed. This usage would be like using mv /path/old_dir_name/ /path/new_dir_name/ on a unix filesystem.

@marcel-goldschen-ohm
Copy link

The more I think about it, the more I am convinced that it will be possible to alter the GUI design so that all tree operations use a singular root reference and paths. In that case, a move method or a rename method that returns a new node instance should suffice. Thanks for entertaining my confusion on this ;)

It may be worth bringing up in the docs that a tree of node references is really not the most ideal way to work with DataTree (if that is already described, my apologies), especially given that this is probably a common initial design for trees. In any case, I think it is important that the name property setter throws an understandable error message as that is a massive confusion.

Cheers!

shoyer added a commit to shoyer/xarray that referenced this issue Sep 13, 2024
This ensures that the tree cannot be in an inconsistent state.

Fixes pydata#9447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants