Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an example with / in namespace to README and SPEC #176

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maxhbr
Copy link
Contributor

@maxhbr maxhbr commented Jun 29, 2022

I think it is not clear from the text in the README and the examples whether a / in the namespace needs to be escaped. This adds an example from /test-suite-data.json to the list of examples, to make that clear, that escaping should not be done.

A follow-up question, that is not scope of this PR, would be:

given the PURL pkg:swift/github.com%2FAlamofire/[email protected] is pkg:swift/github.com/Alamofire/[email protected] its canonical form?
If yes, it should be added to the test cases.

Or, for having more fun and with looking at #63 : what is the canonical form of pkg:golang/github.com%2Frussross%2Fblackfriday%[email protected]?

I think it is not clear from the text in the README and the examples
whether a `/` in the namespace needs to be escaped. This adds an example
from `/test-suite-data.json` to the list of examples, to make that
clear, that escaping should not be done.

A follow-up question, that is not scope of this PR, would be:
=============================================================
given the PURL `pkg:swift/github.com%2FAlamofire/[email protected]`
is `pkg:swift/github.com/Alamofire/[email protected]` its canonical form?
If yes, it should be added to the test cases.

Or, for having more fun and with looking at package-url#63 : what is the canonical
form of `pkg:golang/github.com%2Frussross%2Fblackfriday%[email protected]`?

Signed-off-by: Maximilian Huber <[email protected]>
@sschuberth
Copy link
Member

I think it is not clear from the text in the README and the examples whether a / in the namespace needs to be escaped.

I think it's rather clear from the specification which states:

  • Each namespace segment must be a percent-encoded string
  • A name must be a percent-encoded string

However, the examples (e.g. for golang) seem to be contradicting that:

pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c
pkg:golang/google.golang.org/genproto#googleapis/api/annotations
pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c#api

Looking at the component specification of

scheme:type/namespace/name@version?qualifiers#subpath

it's obvious that the namespace/name part matches github.com/gorilla/context in the first example. And it does not really matter which parts of the latter string you regard to be part of the name or namespace; in any case there is more than one /, so any other would have to be escaped IMO.

@matt-phylum
Copy link
Contributor

However, the examples (e.g. for golang) seem to be contradicting that:

There is no contradiction.

  1. "percent-encoded string" does not specify whether / is supposed to be encoded or not.
  2. github.com/gorilla is the namespace of pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c. It contains two "segments", both of which are properly encoded, whether you expect / to be encoded or not. I would remove "segment" from the description because AFAIK all PURL implementations implemented namespaces without segments (ie it's github.com/gorilla, not ['github.com','gorilla']) and just don't escape / when emitting a namespace.

The pkg:golang examples show namespaces with slashes, and the test suite contains the correct encoding of pkg:swift/github.com/Alamofire/[email protected].

RE the linked vulnerablecode issue: this is an annoying thing about pkg:golang that other people have complained about with no viable solutions as long as PURL enforces a distinction between namespaces and names for Go. When converting a Go package ID to a PURL, you need to split on the last / and use the value before the slash as the namespace and the value after the slash as the name. Don't forget that it's possible for a Go package ID to contain no slashes, in which case the namespace needs to be omitted (eg pkg:golang/v.io).

@sschuberth
Copy link
Member

  1. "percent-encoded string" does not specify whether / is supposed to be encoded or not.

Then your definition of "percent-encoded" seems to differ from mime, and what Wikipedia says: It lists / as a reserved character, and says "If [...] / needs to be in a path segment, then the three characters %2F or %2f must be used in the segment instead of a raw /."

So if github.com/gorilla is the namespace of a Go package, it is used as part if the PURL's path segment, and thus needs to be encoded, no?

@jkowalleck jkowalleck added the PURL core specification Format and syntax that define PURL (excludes PURL type definitions) label Oct 17, 2024
@matt-phylum
Copy link
Contributor

Percent encoding lists no reserved characters. / is a reserved character in a URI, but even then it is not always encoded (in fact, Python's percent encoding function by default excludes /).

github.com/gorilla is the namespace of a Go package PURL and it is not used as the PURL's path. It is either the namespace component, in which / is not escaped, or two namespace segments which must not contain /. Including %2F in the namespace component of a PURL is effectively forbidden by the spec.

@sschuberth
Copy link
Member

Percent encoding lists no reserved characters

Ok, let me see:

image

image

/ is a reserved character in a URI

Right, and a PURL actually is a URI.

and it is not used as the PURL's path.

The term "path" refers to the URI equivalent of a PURL. An URI's path component corresponds to a PURL's namespace and name part as a whole.

@matt-phylum
Copy link
Contributor

You're apparently looking at a different Wikipedia. Mine does not list any reserved characters for percent encoding in general, and also clarifies that "reserved" does not mean escaped:

When a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded.

The real RFC also says that slashes are allowed in the path:

path-absolute = "/" [ segment-nz *( "/" segment ) ]

A path consists of a sequence of path segments separated by a slash ("/") character.

If you parse a PURL as a URL, the URL's path is the namespace and name joined with a slash. Encoding that delimiting slash or any slashes contained within the namespace is forbidden by the PURL spec. Only slashes within the name of the package can be encoded, but for Go PURLs the name must not contain slashes because of the way PURL creates a namespace and name for an ecosystem that only has package IDs. Encoding the slashes in a path would also be an error when building a URL.

@sschuberth
Copy link
Member

You're apparently looking at a different Wikipedia.

Well, it would be easier to tell if you provided any links like I did. Unless there was some irony involved from your side 🙄

Only slashes within the name of the package can be encoded

That again contradicts what's written in the spec. Let me quote it again:

  • Each namespace segment must be a percent-encoded string
  • A name must be a percent-encoded string

So name and namespace are encoded in the same way.

In the end, what I'm after is that it must be possible to unambiguously map "round-trip" like:

*package native coordinates* -> PURL -> *package native coordinates*

Where package native coordinates is e.g. Maven GAV (group, artifact, version), or NPM scope, name, version. If in the mapping to PURL somehow the information would get lost what's the namespace and what's the name, mapping back to package native coordinates would not be possible and we'd have a problem.

@matt-phylum
Copy link
Contributor

  • Each namespace segment must be a percent-encoded string
  • A name must be a percent-encoded string

So name and namespace are encoded in the same way.

That is not true. As I said before, "percent-encoded" does not mean that particular characters are or are not encoded. PURL is very particular about which characters are encoded when and how PURLs are formatted because it has a concept of a canonical form. To best understand the encoding requirements you should check the test suite and read the parsing spec.

There is no inherent name/namespace ambiguity converting values to PURL and back.

For Go, the PURL namespace and name are formed by splitting on the last / character. Therefore, the PURL name never contains /, and you don't even need to think about whether slashes are escaped or not.

For Maven, the PURL namespace is the Maven group ID and the PURL name is the Maven artifact ID. Neither of them can contain slashes because of Maven's rules.

For NPM, the PURL namespace is the NPM scope and the PURL name is the NPM unscoped name. Neither of them can contain slashes because of NPM's rules.

For a theoretical ecosystem that does allow slashes in both the PURL namespace and the PURL name, you would end up with something like namespace a/b namec/d -> pkg:generic/a/b/c%2Fd. However, slashes in the name are not properly specified in the character encoding section of the spec, AFAIK there is currently no package type that uses slashes in the names, and if you report bugs against PURL implementations because they don't handle escaping slashes in the name correctly the bugs may be rejected.

@sschuberth
Copy link
Member

sschuberth commented Oct 17, 2024

As I said before, "percent-encoded" does not mean that particular characters are or are not encoded.

Ok, but when talking about "percent-encoded" in PURL-context it should always mean the same thing. So namespace-percent-encoding should be exactly the same "percent-encoding" as name-percent-encoding. I can't be that "percent-encoding" is supposed to work differently depending on whether you encode the namespace or name. That would be awfully confusing. I hope we do agree here.

Regarding examples, let's focus on the Go case here, as you correctly state that other ecosystems do not allow / as part of their coordinates.

The character-encoding says

the '/' used as type/namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere

So if namespace == "github.com/gorilla", it's clear to me that this / should be encoded to not be confused with the / to the left and right of namespace. This also seems to comply with

It is OK to percent-encode purl components otherwise except for the type.

However, I agree that this is not in line with the "how to build" section which says to first split the namespace into segments on /, percent-encode segments, and then again join with /.

All in all, I feel docs, formal spec, tests and actual implementations to not at all be aligned here.

@matt-phylum
Copy link
Contributor

Percent encoding is supposed to work differently depending on whether it's the namespace or the name, and this is something that is often confusing.

It'd be nice if the spec said exactly which characters were supposed to be escaped when instead of saying which sometimes-escaped characters should not be escaped. The old RFC URL spec is similar with its list of "reserved" characters which might need to be encoded and then listing the characters that do not need to be encoded in different contexts. The newer WHATWG URL spec gives a constructive set of characters that must be encoded and that's much easier to understand.

If you have a PURL pkg:golang/github.com/gorilla/context and work through the parsing part of the spec you can see that it's not ambiguous because the number of slash-delimited items is known ahead of time.

  1. Split once from the right on # (no change)
  2. Split once from the right on ? (no change)
  3. Split once from the left on : -> scheme: pkg remainder: golang/github.com/gorilla/context
  4. Strip leading and trailing / (no change)
  5. Split once from the left on / -> type: golang remainder: github.com/gorilla/context
  6. Split once from the right on @ (no change) (at this point, the remainder should be the Go module ID)
  7. Split once from the right on / -> remainder: github.com/gorilla name: context (at this point, remainder is normally used as the namespace and the next step is skipped)
  8. Split on / -> namespace: ['github.com', 'gorilla']

@sschuberth
Copy link
Member

It'd be nice if the spec said exactly which characters were supposed to be escaped when instead of saying which sometimes-escaped characters should not be escaped.

Absolutely! Just calling everything "percent-encoding" if in fact it's different algorithms is highly confusing indeed. We should name different things differently.

If you have a PURL pkg:golang/github.com/gorilla/context and work through the parsing part of the spec you can see that it's not ambiguous because the number of slash-delimited items is known ahead of time.

I agree that your interpretation of the spec can be implemented in a way so that coordinate <-> PURL mapping is unambiguous.

Still I'm wondering what your main sources of your interpretation of the spec are. I guess mostly the "How to build" pseudo-algorithm description and the test suite, as the written texts are less clear.

I'm starting to get convinced that I should change our PURL implementation then to not encode / as %2F as part of the namespace.

sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 18, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 18, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
@matt-phylum
Copy link
Contributor

The steps in my comment are adapted from the parsing section of the spec: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-parse-a-purl-string-in-its-components

sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 21, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 22, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 22, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Oct 22, 2024
Implement the pseudo-algorithm described at [1]. Most importantly, '/'
in namespaces are now not escaped anymore (also see the lengthy
discussion at [2]), key names are lower-cased, and qualifiers are sorted
for comparability.

[1]: https:/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#how-to-build-purl-string-from-its-components
[2]: package-url/purl-spec#176

Signed-off-by: Sebastian Schuberth <[email protected]>
@johnmhoran
Copy link
Member

fwiw my interpretation re slash encoding is that a slash / should not be percent-encoded. This came up for me this morning looking at this go PURL: pkg:golang/github.com%2Fquic-go%[email protected]

  • the namespace rules provide that "Each namespace segment must be a percent-encoded string" and are silent on whether a slash can or must or must not be percent-encoded. References there to the slash do not mention any encoding, which I take to mean the slash should not be encoded.

  • This is consistent with the language in the Character encoding section of the spec:

    • "the '/' used as type/namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous [sic] unencoded everywhere".

I'm doing some work on a PURL-validate UI using the Python implementation, which seems to allow percent-encoding of the slashes -- my current w-i-p UI as well as the public PurlDB validate endpoint (both of which use packageurl-python) find pkg:golang/github.com%2Fquic-go%[email protected] to be valid, and my UI also incorrectly parses the namespace segments when the slashes are encoded:

image

image

@matt-phylum
Copy link
Contributor

my UI also incorrectly parses the namespace segments when the slashes are encoded

These screenshots show the namespace being parsed correctly according to the generic PURL rules. pkg:golang/github.com%2Fquic-go%[email protected] is a valid PURL with no namespace. It is not equivalent to pkg:golang/github.com/quic-go/[email protected].

The spec for Go PURLs unfortunately only implies that pkg:golang/github.com%2Fquic-go%[email protected] is incorrect by giving examples of pkg:golang/github.com/quic-go/[email protected] being the correct form.

@sschuberth
Copy link
Member

pkg:golang/github.com%2Fquic-go%[email protected] is a valid PURL with no namespace. It is not equivalent to pkg:golang/github.com/quic-go/[email protected].

Exactly. While both are valid purls syntax-wise, the former interprets the whole string "github.com/quic-go/quic-go" as the name (and thus gets encoded), which IMO is the correct thing to do as the Go language does not really have the concept of namespaces for its packages.

Unfortunately, the purl spec decided otherwise and artificially splits "github.com/quic-go/quic-go" into a "github.com/quic-go" namespace and a "quic-go" name, which I believe to be the source of all confusion.

And pretty much the same issue applies to Swift packages as well, which also do not have a native namespace concept.

@matt-phylum
Copy link
Contributor

I think the unfortunate thing is that PURL has a general concept of namespaces at all. pkg:golang/github.com/quic-go/[email protected] is an objectively better PURL-ification of golang/github.com/quic-go/quic-go v0.40.0 than pkg:golang/github.com%2Fquic-go%[email protected]. From my perspective, package types with meaningful namespaces are the exception, and PURL would be better off removing the concept of namespaces from the spec and just treating everything after the first slash as the name or path. Go and Swift are not the only package ecosystems where PURL splits the ecosystem package ID on the last slash to form a namespace and name which get joined back together to result in the same string (eg NPM, Composer).

Moving the namespace concept from PURL into the few types that actually have special namespace behavior (eg Maven, apt) can be done without breaking compatibility with any existing PURLs, but keeping the namespace and redefining the Go and Swift types to move the namespace part into the name breaks compatibility with nearly all existing PURLs of those types, and implementations producing and consuming PURLs of those types will need to agree on the correct form in order to interoperate.

See also #63 and #308.

@sschuberth
Copy link
Member

sschuberth commented Oct 22, 2024

I think the unfortunate thing is that PURL has a general concept of namespaces at all.

Also see #14.

From my perspective, package types with meaningful namespaces are the exception

With the large Maven ecosystem (incl. Gradle, Sbt etc.) and its groups in mind, I beg to differ.

Go and Swift are not the only package ecosystems where PURL splits the ecosystem package ID on the last slash [...] (eg NPM, Composer).

Note that NPM does have a native concept of namespaces (they're called scopes there), and Composer also suggests that the part before the slash is "your-vendor-name" and the part after the slash is "package-name", so it makes sense to treat "your-vendor-name" as a namespace.

keeping the namespace and redefining the Go and Swift types to move the namespace part into the name breaks compatibility with nearly all existing PURLs of those types

Agreed. It is what it is now, and that's why I also proposed over here that the best we can probably do is to properly document that purl treats Go and Swift as if they had namespaces, although they don't have any.

@matt-phylum
Copy link
Contributor

Maven and Gradle and Sbt are only one case. It doesn't matter if it's a big case or not. The package type is only documented once.

NPM does have scopes, but the scope is part of the package ID, and Composer is the same. Unless somebody is trying to assign meaning to the PURL namespace without understanding what it means for that package type, it's inconsequential whether the PURL contains the scope+name or it just has the ID. It only make sense to treat it as the namespace because PURL defines that there is a namespace, and otherwise it would just be the package ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PURL core specification Format and syntax that define PURL (excludes PURL type definitions)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants