Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: size returns nan for Dask array with unknown shape #175

Closed
lucascolley opened this issue Aug 8, 2024 · 7 comments
Closed

BUG: size returns nan for Dask array with unknown shape #175

lucascolley opened this issue Aug 8, 2024 · 7 comments

Comments

@lucascolley
Copy link
Contributor

>>> xp
<module 'scipy._lib.array_api_compat.dask.array' from ...
>>> x = xp.asarray([1, 2, 3])            
>>> x
dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
>>> from scipy._lib.array_api_compat import size
>>> size(xp.unique_values(x))
nan
>>> size(xp.unique_values(x).compute())
3

I think size should call compute() if it has to?

@lucascolley
Copy link
Contributor Author

Okay, this was already mentioned for shape in scikit-learn/scikit-learn#26724 (comment)

@lithomas1
Copy link
Contributor

This is technically allowed I think
https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.shape.html#shape

(technically speaking, it should be None instead of nan, but I don't think the difference is too important here)

@lucascolley
Copy link
Contributor Author

Yeah, the shape is allowed to be None, but I think the array-api-compat helper size should trigger the computation if needed to return the actual size.

@lithomas1
Copy link
Contributor

I think it might be worth adding a version of the wrapped dask.array that calls compute, but I think it's probably better to wait a bit to better understand when/why downstream libraries need the shape.

It might also be possible to return a dask.delayed shape for dask (advantage here is that we wouldn't break the dask graph by computing), which is why I want to wait on doing this, but not 100% sure of this.

@lucascolley
Copy link
Contributor Author

The example I've bumped into in SciPy is n_clusters = int(xp.unique_values(T).shape[0])

@asmeurer
Copy link
Member

asmeurer commented Aug 8, 2024

It's not clear to me whether the wrappers here should be automatically calling compute(). I'd like to get some better understanding of the use-cases before doing that.

Note that the original purpose of the size() helper was to work around PyTorch not making size an attribute. Outside of that the behavior should just match x.size. The standard does say size = None is allowed https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.size.html#array_api.array.size (as @lithomas1 points out, dask uses nan instead of None, which we could fix here, but that doesn't really do anything about your problem). So as far as I can tell, the dask wrapper is standard compliant.

Maybe you should move the discussion to the array API repo to get a better understanding of how scipy should be handling unmaterialized sizes (for instance, at data-apis/array-api#748, or in a new issue).

@lucascolley
Copy link
Contributor Author

closed in favour of data-apis/array-api#834. There certainly isn't consensus yet that it is right for consuming libraries to force computation.

@lucascolley lucascolley closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants