Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of empty host in absolute URL #821

Closed
1 task done
kenballus opened this issue Feb 6, 2023 · 14 comments · Fixed by #1136
Closed
1 task done

Incorrect handling of empty host in absolute URL #821

kenballus opened this issue Feb 6, 2023 · 14 comments · Fixed by #1136
Labels
bug Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/ help wanted

Comments

@kenballus
Copy link
Contributor

Describe the bug

Absolute URLs are permitted to have empty hosts in RFC 3986.

Relevant grammar rules:

   host          = IP-literal / IPv4address / reg-name
   reg-name      = *( unreserved / pct-encoded / sub-delims )

Thus, a URL like a://:1 conforms to the standard.
However, yarl rejects this URL.

urllib3, CPython urllib, rfc3986, furl, and hyperlink all correctly handle this situation.

To Reproduce

Try running the following snippet:

>>> yarl.URL("a://:1")

Expected behavior

The parse should have succeeded, resulting in URL('a://:1').

Logs/tracebacks

This is the output from the snippet:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.10/site-packages/yarl/_url.py", line 163, in __new__
    raise ValueError("Invalid URL: host is required for absolute urls")
ValueError: Invalid URL: host is required for absolute urls


### Python Version

```console
$ python --version
Python 3.10.9

multidict Version

$ python -m pip show multidict
Name: multidict
Version: 6.0.4
Summary: multidict implementation
Home-page: https:/aio-libs/multidict
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache 2
Location: /home/bkallus/fuzzing/url_differential_fuzzing/url_fuzz_env/lib/python3.10/site-packages
Requires:
Required-by: yarl

yarl Version

$ python -m pip show yarl
Name: yarl
Version: 1.8.2
Summary: Yet another URL library
Home-page: https:/aio-libs/yarl/
Author: Andrew Svetlov
Author-email: [email protected]
License: Apache-2.0
Location: /home/bkallus/fuzzing/url_differential_fuzzing/url_fuzz_env/lib/python3.10/site-packages
Requires: idna, multidict
Required-by:

OS

Arch Linux

Additional context

No response

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@kenballus kenballus added the bug label Feb 6, 2023
@webknjaz
Copy link
Member

webknjaz commented Feb 6, 2023

Hi, if you could send a PR with just a test that could then be marked as xfail, it'd be useful.

@kenballus
Copy link
Contributor Author

Hi, if you could send a PR with just a test that could then be marked as xfail, it'd be useful.

Done!

I'm not too familiar with the codebase, but I imagine the fix is to delete lines 187 and 188 in yarl/_url.py:

187:                if host is None:
188:                    raise ValueError("Invalid URL: host is required for absolute urls")

A host is always permitted to be empty, unless scheme-specific rules say otherwise.

@webknjaz webknjaz added help wanted Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/ labels Nov 19, 2023
@bdraco
Copy link
Member

bdraco commented Sep 7, 2024

This is a bit more nuanced because http says empty host is invalid, but other schemes may allow it

https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2

whereas the "http" scheme considers a missing authority or empty host invalid.

@kenballus
Copy link
Contributor Author

This is a bit more nuanced because http says empty host is invalid, but other schemes may allow it

Yep! See my previous message:

A host is always permitted to be empty, unless scheme-specific rules say otherwise.

The "http" scheme is an example of a case in which scheme-specific rules do indeed say otherwise.

So a correct fix for this issue would ensure that both

  • empty hosts are rejected for schemes that require it (e.g., "http"), and
  • empty hosts are accepted for all other schemes.

@bdraco
Copy link
Member

bdraco commented Sep 8, 2024

The big challenge is having a list of each scheme and which ones support no host, and which ones don’t

@bdraco
Copy link
Member

bdraco commented Sep 8, 2024

What’s the real world use case you are solving for here. I’m assuming a is an example scheme and not one you are actually using

@kenballus
Copy link
Contributor Author

The big challenge is having a list of each scheme and which ones support no host, and which ones don’t

This library already special-cases a few schemes to give them default port numbers. This seems like it could be handled in the same way.

What’s the real world use case you are solving for here.

Minimization of programmer surprise :)

@bdraco
Copy link
Member

bdraco commented Sep 8, 2024

Let me ask the question another way: Which schemes do you care about?

@kenballus
Copy link
Contributor Author

All of them. I care about the parser being correct, not any individual scheme or schemes.

@bdraco
Copy link
Member

bdraco commented Sep 8, 2024

We can't realistically add support for every conceivable scheme or hope to maintain a list of which ones require a host and which ones don't unless there is some external data source that can provide this for us.

@kenballus
Copy link
Contributor Author

kenballus commented Sep 9, 2024

I think we're all in agreement that the current behavior is not correct for a few reasons:

  1. There exist URLs with empty authorities that are erroneously accepted, such as http://
  2. There exist URLs with empty hosts that are erroneously rejected, such as whatever://:80

Fixing either one of these problems requires enumerating which URL schemes disallow empty authorities or hosts. We already special-case certain schemes for default port numbers, so maintaining another list of schemes seems to be within scope.

unless there is some external data source that can provide this for us.

The WHATWG URL spec is this data source. It specifies that "ftp", "http", "https", "ws", and "wss" (i.e., all special schemes except "file") require nonempty hosts, and everything else allows empty hosts.

This is the rule that Ada and libcurl enforce, and is the most reasonable position in my opinion.

@bdraco
Copy link
Member

bdraco commented Sep 9, 2024

If the answer is that we only care about enforcing non-empty host on "ftp", "http", "https", "ws", and "wss", I think thats a reasonable position.

@bdraco
Copy link
Member

bdraco commented Sep 9, 2024

Let me know if #1136 is what you were looking for

@kenballus
Copy link
Contributor Author

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/ help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants