Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify input requirement parsing #7019

Open
chrahunt opened this issue Sep 14, 2019 · 10 comments
Open

Simplify input requirement parsing #7019

chrahunt opened this issue Sep 14, 2019 · 10 comments
Labels
state: needs discussion This needs some more discussion type: refactor Refactoring code

Comments

@chrahunt
Copy link
Member

chrahunt commented Sep 14, 2019

What's the problem this feature will solve?

Currently pip accepts several types of input as "requirements":

  1. name-based requirements (PEP 508)
  2. direct references (PEP 440 - unsupported currently per pip fails to install remote dependency when it is a .whl and follows PEP 440 #6202, but should be accepted)
  3. file paths (no PEP, just current behavior)
  4. URLs (no PEP, just current behavior)

The parsing for these is ad-hoc and pretty complicated, with lots of code paths (see here). This makes it hard to understand:

  1. the error that a user may see given some invalid input
  2. the possible initial states of InstallRequirement given a user input

It is also impossible to re-use the current code to initialize any other kind of type than an InstallRequirement (so this is a prereq for some of the build refactoring).

Describe the solution you'd like

At a high level we need to map any arbitrary input to one of the 4 categories mentioned above. This is difficult to do unambiguously because we accept file paths, so I think we should make some assumptions and then users that want to use weird file paths can feel free to use an explicit file:// URL.

The primary standards-based constraints are:

  1. a PEP 440 direct reference contains a @ followed by <scheme>:// followed optionally by ; and markers which can have any content
  2. a name-based requirement will consist of non-@ characters followed optionally by extras and specifiers and then by ; and markers which can have any content

Simplifying assumptions:

  1. A file path provided by the user must have at least one of ., /, or \ (on Windows), followed optionally by something that looks like extras and something that looks like markers
  2. The URI_reference part of a direct reference will contain ://
  3. URLs passed by the user will contain ://

That leads to the following rules for deciding how to process input:

  1. if the input contains "@" followed eventually by "://" (with no preceding ";") then we treat it like a direct reference - pass it to Requirement and derive all fields of RequirementInfo from that
  2. If the input contains "://" (with no preceding ";") then we treat it like a URL - we manually extract markers and optional package name and extras from and #egg= fragment, which are used to instantiate a Requirement if present. Any missing fields get derived from the Requirement if set.
  3. If the input contains os.pathsep or os.altsep or starts with '.' then we treat it like a path, convert it to an absolute file URL and process the same as 2.
  4. Otherwise, we treat it like a name-based requirement - pass it to Requirement and derive all fields of RequirementInfo from that

Other details:

The module to be added is pip._internal.req.parsing with a function parse_requirement_text that takes a string as would be input by a user or in dependency metadata and returns a RequirementInfo. RequirementInfo would contain:

  1. markers: Set[Marker]
  2. link: Optional[Link] - if None then it's a name-based requirement
  3. requirement: Optional[Requirement] - if None then it's an "unnamed" requirement
  4. extras: Set[str]

parse_requirement_text would do the steps as described above.

parse_requirement_text would not do any filesystem operations or logging and it should map any expected exceptions to RequirementParsingError with an indication of how we were trying to process the text (direct reference, url, path, or name-based).

Once implemented, we should refactor req.constructors.install_req_from_* to delegate parsing to parse_requirement_text and just do operations on the returned RequirementInfo.

Alternative Solutions

  1. Refactor the existing code while preserving all possible existing behaviors. Having just tried that, it's a big pain and the result doesn't look very good.

Additional context

@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Sep 14, 2019
@chrahunt chrahunt added state: needs discussion This needs some more discussion type: refactor Refactoring code labels Sep 14, 2019
@triage-new-issues triage-new-issues bot removed the S: needs triage Issues/PRs that need to be triaged label Sep 14, 2019
@cjerdonek
Copy link
Member

How does this relate to PR #6203?

@cjerdonek
Copy link
Member

Also, in your write-up of the proposed rules, can you distinguish between choices that are forced by / follow logically from PEP’s, and rules that are more heuristics of your choosing? It seems like PR #6203 uses different heuristics (though I’m not certain). I think it would be helpful for people to know if / where there might be any ambiguity in interpreting and applying any of the PEP’s, and if we are making any choices here.

@chrahunt
Copy link
Member Author

How does this relate to PR #6203?

Unrelated. The changes there are improvements on the existing code, but keep most of the logic locked up in install_req_from_line. The purpose here is to extract all of the logic for turning a "string that pip accepts" into the pieces of data needed to instantiate InstallRequirement (and different types in the future).

@chrahunt
Copy link
Member Author

can you distinguish between choices that are forced by / follow logically from PEP’s, and rules that are more heuristics of your choosing?

I have updated the original issue to make it more explicitl.

@cjerdonek
Copy link
Member

Unrelated.

Okay, I was asking because PR #7020, which is described as progressing this issue, also describes making behavior changes. So I wasn't sure if there was any overlap with the behavior changes in PR #6203.

@chrahunt
Copy link
Member Author

I think that #7020 or a followup will likely supersede the functional change in #6203, but we would need to incorporate the tests.

@cjerdonek
Copy link
Member

cjerdonek commented Sep 16, 2019

Also, PR #6203 introduces its own heuristics for resolving the ambiguities and deciding whether something should be interpreted as a path (something this issue uses different techniques for). It adds a _looks_like_path() function and also checks for existence on the file system for helping to make the determination. (So it feels like there is some overlap.)

@cjerdonek
Copy link
Member

Okay, so it is related then (it would subsume it).. I think it would be worth comparing the heuristics used there with the approach proposed here then, as there was lots of discussion around that. (I think the reason that PR was never merged is because of the trickiness here and lack of clarity on the correct way to proceed.) This is another reason why I think it's worth separating the behavior changes from refactoring in the discussion. The behavior changes alone are subtle and a bit tricky.

@chrahunt
Copy link
Member Author

chrahunt commented Sep 16, 2019

In that PR only a single branch within install_req_from_line is being refactored, so the checks are in that context and very limited in scope (i.e. direct references are getting treated incorrectly, so what needs to be done so they are less likely to be treated incorrectly). The changes proposed here are much broader, but IMO easier to reason about since we are considering if from the perspective of what we would like to support as opposed to behavior that has accumulated over several years of changes.

@cjerdonek
Copy link
Member

A couple other things that would help in the description of the proposed rules (the "leads to the following rules" part of the original issue comment) are distinguishing between the parts encoding pip's current behavior with the new logic being introduced. In other words, how much of this is new versus describing what pip already does. Something else that would help is to know if what's being proposed is backwards compatible or what, if anything, might break for people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: needs discussion This needs some more discussion type: refactor Refactoring code
Projects
None yet
Development

No branches or pull requests

2 participants