Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_common: make as_file return the real file on os.PathLike #232

Closed
wants to merge 1 commit into from

Conversation

FFY00
Copy link
Member

@FFY00 FFY00 commented Jul 29, 2021

We currently only have custom handling for pathlib.Path, but that means
that custom readers that do not using pathlib.Path will have their
traversables copied to a temporary directory in as_file, which can be an
expensive operation.
By adding a dispatch for os.PathLike, we extend this behavior for all
path-like objects.

Signed-off-by: Filipe Laíns [email protected]

We currently have custom handling for pathlib.Path, but that means
that custom readers that do not using pathlib.Path will have their
traversables copied to a temporary directory in as_file, which can be an
expensive operation.
By adding a dispatch for os.PathLike, we extend this behavior for all
path-like objects.

Signed-off-by: Filipe Laíns <[email protected]>
@jaraco
Copy link
Member

jaraco commented Sep 18, 2021

I know I had some comments on this elsewhere. Can you provide an example of what use-case this benefits? In what situations will a resource provider have a path-like object that's not a pathlib.Path object?

@FFY00
Copy link
Member Author

FFY00 commented Sep 18, 2021

CompabilityFiles for example could benefit. Custom loaders that are not aware of this too. os.PathLike includes pathlib.Path so we could drop that code path if complexity/surface is an issue, creating a new pathlib.Path shouldn't have much performance impact, I would probably keep it though.

@jaraco
Copy link
Member

jaraco commented Sep 18, 2021

I just don't see where CompatbilityFiles exposes a PathLike. Can you show me a demo from an actual loader where the file would end up being temporary due to the current implementation, but would be a Path object after this change?

@FFY00
Copy link
Member Author

FFY00 commented Sep 18, 2021

#233 😄

@FFY00 FFY00 closed this Sep 18, 2021
@FFY00 FFY00 reopened this Sep 18, 2021
@FFY00
Copy link
Member Author

FFY00 commented Sep 18, 2021

Sorry, on phone, misclicked.

That PR makes CompatibilityFiles expose os.PathLike traversables if the loader provides the path in resource_path.

@jaraco
Copy link
Member

jaraco commented Sep 18, 2021

Okay. I see that now. Thanks for the explanation. I can see the value in these changes in principle.

I'm still unconvinced this change is worth the additional complexity. The approach is a clean one, but it still adds new expectations and increases cognitive and maintenance burden.

I'd like to see a use-case that demonstrates the value of this change in practice. Is there actually a loader that implements resource_path that would benefit from this change? And is that use-case worth the extra trouble of adding this complexity, especially since this code is a compatibility layer? Another way to think about it - by leaving the code simpler, legacy loaders get a sub-optimal but adequate experience and are incentivized to implement files natively.

I'm reluctant to add code that's likely never going to be used except maybe in compatibility code, especially if that code doesn't expire with that compatibility code. Once CompatibilityFiles is deprecated and removed, this behavior would remain, likely unused by anybody, for future generations to wonder if it's needed.

What about a more surgical approach where as_file had an override only for CompatibilityFiles, something like:

@singledispatch(CompatibilityFiles.ChildPath)
def _(path):
    return path.actual_path or _tempfile(path.read_bytes, suffix=path.name)

At least that would obviously go away when CompatibilityFiles does. Still I wonder if it's needed.

@FFY00
Copy link
Member Author

FFY00 commented Sep 18, 2021

Okay, before we get into all that, do you have any reasoning for not wanting to give preferential treatment to pathlib.Path? By this I mean, why choose pathlib.Path instead of os.PathLike, that is what I don't really understand. os.PathLike works for all objects that implement the protocol, including pathlib.Path, so I don't understand what is special about it other than being used in the stdlib readers. Does it have any drawbacks I am not aware of?
Currently, if I try to use something like https:/bitranox/pathlib3x, we would not get this preferred behavior.

@jaraco
Copy link
Member

jaraco commented Sep 22, 2021

I don't understand what is special about it other than being used in the stdlib readers.
why choose pathlib.Path instead of os.PathLike

What's special about it is that as_file promises implicitly to return a pathlib.Path object given a Traversable, and it's providing special pass-through when the traversable happens to already be a pathlib.Path object.

If there were a resource provider out there whose files object returned a pathlib3x object or a path pie object, that would be a compelling reason to consider casting those file-like objects to a pathlib.Path object. But until that's the case, those objects will still work (as they meet the required traversable protocol); they'll just be converted to a temporary file first.

I do worry that the tests don't adequately test the intended behavior. When I remove the new behavior, the tests still pass as long as the object passed to as_file implements the Traversable protocol (as required):

diff --git a/importlib_resources/_common.py b/importlib_resources/_common.py
index 5d5e7a723a..91f04f9e34 100644
--- a/importlib_resources/_common.py
+++ b/importlib_resources/_common.py
@@ -114,12 +114,3 @@ def _as_file_pathlib(path):
     Degenerate behavior for pathlib.Path objects.
     """
     yield path
-
-
-@as_file.register(os.PathLike)
[email protected]
-def _as_file_pathlike(path):
-    """
-    Degenerate behavior for os.PathLike objects.
-    """
-    yield pathlib.Path(os.fspath(path))
diff --git a/importlib_resources/tests/test_path.py b/importlib_resources/tests/test_path.py
index 63f575aba3..2f3423fdcb 100644
--- a/importlib_resources/tests/test_path.py
+++ b/importlib_resources/tests/test_path.py
@@ -67,6 +67,13 @@ class PathLikeTests(PathTests, unittest.TestCase):
         def __init__(self, *args, **kwargs):
             self._path = pathlib.Path(*args, **kwargs)
 
+        def read_bytes(self):
+            return self._path.read_bytes()
+
+        @property
+        def name(self):
+            return self._path.name
+
         def __fspath__(self):
             return os.fspath(self._path)
 

It's not obvious to me that PathLikeTraversable needs to be a broken traversable to capture the failed expectation. Moreover, test_path is for testing a legacy API and will likely go away with the deprecation of _legacy.path.

Probably instead, the test should explicitly construct a non-Path PathLike object and pass it to as_file and assert that the file is the same filename as the passed object. Maybe in test_files.

But I'm still unconvinced this functionality is needed. Is there a use-case where a resource provider other than CompatibilityFiles would supply a non-Path PathLike object for files()?

@FFY00
Copy link
Member Author

FFY00 commented Sep 27, 2021

Okay, so I guess you and me have different opinions on this.

I dislike this bypass is done using a type check instead of a protocol that anyone can implement. And that is especially relevant considering this code is in the standard library, meaning this decision will affect people for several years, so I think we should try to make sure the behavior we put in now will be as future-proof as it can. There may be use-cases where using pathlib.Path is undesirable, some of which we may predict, and some we may not, so I am afraid that we may be cornering ourselves into an issue we might not be able to fix later.

For eg. I have some code locally where I was playing with a custom loader for editable packages. The goal was to minimize and delay rebuilding/compilation steps until the very last moment, and only do them if absolutely necessary (eg. I have an auto generated 3GB data file, I only want to re-generate it if anyone actually tries to access it).

That alone is not enough for me to tell you that we should have this functionality, as I am not actually shipping any code, but it does demonstrate the sort use-cases that may arise and we are not considering.

But anyway, this is a fairly minor issue, so I don't want to be wasting more of your time with it. Feel free to close this PR if you feel like that is the best decision.

I do worry that the tests don't adequately test the intended behavior. When I remove the new behavior, the tests still pass as long as the object passed to as_file implements the Traversable protocol (as required):

Oh, yes, I forgot to add the fs path tests. I only added the PathTests tests, which will of course still pass if you provide a valid traversable.

Probably instead, the test should explicitly construct a non-Path PathLike object and pass it to as_file and assert that the file is the same filename as the passed object. Maybe in test_files.

I would probably just add a new test in PathLikeTests directly.

@jaraco
Copy link
Member

jaraco commented Oct 3, 2021

Okay, so I guess you and me have different opinions on this.

I don't want you to be discouraged. I do think we are converging on a solution.

I dislike this bypass is done using a type check instead of a protocol that anyone can implement. And that is especially relevant considering this code is in the standard library, meaning this decision will affect people for several years, so I think we should try to make sure the behavior we put in now will be as future-proof as it can.

That's a compelling point, but I believe the converse is true as well - if this behavior turns out to be difficult to support or have unintended consequences, it's difficult to remove. I used to be of the opinion that an interface should be lenient and support as many plausible use-cases as possible. After working on CPython for a long time, I've found the opposite to be true, that it's preferable to limit the interface surface and specifically deny unintended uses. That's why I feel strongly that this behavior should not be added until there is at least one use-case that strongly calls for this behavior.

I do sympathize with the argument that requiring a use-case could lead to a chicken/egg problem, that by not supporting the interface, it incentivizes alternate implementations, preventing otherwise elegant use-cases from emerging.

For eg. I have some code locally where I was playing with a custom loader for editable packages.

This example is helpful.

I think I can imagine why this custom loader would not be able to present a pathlib.Path object due to the desire to have the individual resources resolved late. In particular, if a call to files() would return a pathlib.Path, that would demand that all resources within that path would have already been generated to the file system.

So then the question becomes, could this use-case be served by a Traversable that generates the content on demand? Could this resource reader return a traversable that never puts the content on disk? In other words, could the loader rely on the temporary file generation of as_file to generate the file on demand, in which case the file gets generated exactly once and only on demand? Is there something important about the file being generated at a particular location? Is it expected that the file will be accessed multiple times by different consumers?

I'm still leaning toward the two presented options (file already exists on disk and thus can be a pathlib.Path or file can be generated on demand from a Traversable) are sufficient.

The thing that would push me to change my position to supporting this change would be to have a docstring in the test that describes the conditions under which this behavior is useful. Something like:

def test_as_file_pathlke():
    """
    as_file may also accept a non-Path PathLike object resolved from a
    ResourceReader.files() in cases where the content is not on the
    file system at the time ``files()`` is called but is present on the file
    system by the time :func:`as_file` is called on one of the children. For
    example, consider `FFY00's custom loader <https:/...>`_
    in which files are only generated on demand at the last minute
    at their canonical location and re-used for subsequent calls and
    interpreter instances. See python/importlib_resources#232 for more
    details.
    """

@FFY00
Copy link
Member Author

FFY00 commented May 2, 2022

That's a compelling point, but I believe the converse is true as well - if this behavior turns out to be difficult to support or have unintended consequences, it's difficult to remove. I used to be of the opinion that an interface should be lenient and support as many plausible use-cases as possible. After working on CPython for a long time, I've found the opposite to be true, that it's preferable to limit the interface surface and specifically deny unintended uses. That's why I feel strongly that this behavior should not be added until there is at least one use-case that strongly calls for this behavior.

I do sympathize with the argument that requiring a use-case could lead to a chicken/egg problem, that by not supporting the interface, it incentivizes alternate implementations, preventing otherwise elegant use-cases from emerging.

That makes sense, and I respect your opinion here. The chicken/egg issue does concern me a bit.

I'm still leaning toward the two presented options (file already exists on disk and thus can be a pathlib.Path or file can be generated on demand from a Traversable) are sufficient.

That'd trigger the content generation when files is called, instead of when we actually need to access some data.

The thing that would push me to change my position to supporting this change would be to have a docstring in the test that describes the conditions under which this behavior is useful. Something like:

FWIW, I am working on such a loader for https:/FFY00/mesonpy. We want to use it for scipy and numpy, which are heavy projects, so triggering unneeded rebuilds should be avoided. I don't have any code pushed yet, but I hopefully I will soon.

@jaraco
Copy link
Member

jaraco commented Oct 5, 2022

I'm going to close this for now, but happy to revisit at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants