Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: Reject paths containing non-ASCII characters #3062

Closed
wants to merge 3 commits into from

Conversation

Flimm
Copy link
Contributor

@Flimm Flimm commented Sep 25, 2015

http would previously accept paths with non-ASCII characters. This
proved problematic, because multi-byte characters were encoded as
'binary', that is, the first byte was taken and the remaining bytes were
dropped for that character.

There is no sensible way to fix this without breaking backwards
compatibility for paths containing U+0080 to U+00FF characters.

We already reject paths with unescaped spaces with an exception. This
commit does the same for paths with non-ASCII characters too.

The alternative would have been to encode paths in UTF-8, but this would
cause the behaviour to silently change for paths with single-byte
non-ASCII characters (eg: the copyright character U+00A9 ©). I find it
preferable to to add to the existing prohibition of bad paths with
spaces.

Fixes #2114

http would previously accept paths with non-ASCII characters. This
proved problematic, because multi-byte characters were encoded as
'binary', that is, the first byte was taken and the remaining bytes were
dropped for that character.

There is no sensible way to fix this without breaking backwards
compatibility for paths containing U+0080 to U+00FF characters.

We already reject paths with unescaped spaces with an exception. This
commit does the same for paths with non-ASCII characters too.

The alternative would have been to encode paths in UTF-8, but this would
cause the behaviour to silently change for paths with single-byte
non-ASCII characters (eg: the copyright character U+00A9 ©). I find it
preferable to to add to the existing prohibition of bad paths with
spaces.

Bug report: nodejs#2114
@jasnell
Copy link
Member

jasnell commented Sep 25, 2015

Hmmm.. this is a step in the right direction, but it doesn't address the full problem. Strings like "a\nb" are still passed through. Can you please update the regexp to catch all whitespace (newlines, tabs, etc) ?

@mscdex mscdex added the http Issues or PRs related to the http subsystem. label Sep 25, 2015
@Flimm
Copy link
Contributor Author

Flimm commented Sep 25, 2015

Done.

@jasnell
Copy link
Member

jasnell commented Sep 25, 2015

If we're throwing against invalid whitespace, might as well make it all invalid whitespace, tabs included. They may not be as much of a risk as newlines, but they're still invalid.

@Flimm
Copy link
Contributor Author

Flimm commented Oct 1, 2015

Done.

@@ -41,13 +41,16 @@ function ClientRequest(options, cb) {
if (self.agent && self.agent.protocol)
expectedProtocol = self.agent.protocol;

if (options.path && / /.test(options.path)) {
if (options.path && ! /^[\x00-\x08\x0E-\x1F\x21-\x7F]*$/.test(options.path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be better to use a negated character set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The highest Unicode code point is U+10FFFF, but a higher one could be introduced in the future. That, along with the complications of non-BMP code points in Javascript, make writing a correct future-proof regex with a negated class tricky. In any case, both this regex and one using a negated character should be O(n), so I see no need for a change.

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Mar 22, 2016
@jasnell jasnell added the url Issues and PRs related to the legacy built-in url module. label Jun 7, 2016
@rvagg rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:01
@evanlucas
Copy link
Contributor

Was this fixed by #8923?

@bnoordhuis
Copy link
Member

No, #8923 only rejects characters <= U+0020.

@jasnell
Copy link
Member

jasnell commented Feb 28, 2017

ping @nodejs/http
@Flimm ... is this still something you'd like to pursue?

@Flimm
Copy link
Contributor Author

Flimm commented Mar 1, 2017

@jasnell Absolutely. I can rebase it or whatever is preferred now that there are merge conflicts. I can take the time to make sure the bug is still present in newer Node versions. If there is anything I can do differently this time to make sure that the pull request gets merged or rejected, let me know.

@mscdex
Copy link
Contributor

mscdex commented Mar 1, 2017

FWIW I think using a lookup table will probably yield the best performance, instead of a regexp.

@fhinkel
Copy link
Member

fhinkel commented Mar 26, 2017

ping @Flimm, can you rebase this? Thanks.

@bnoordhuis
Copy link
Member

I don't know if this can land as-is, even when rebased. Backwards compatibility is a concern - UTF-8 is used in the wild - and I wouldn't want to vouch it works with different combinations of header/body encodings.

@Flimm
Copy link
Contributor Author

Flimm commented Mar 27, 2017

@bnoordhuis Can someone help me out in knowing what the process is for getting this approach approved? Do I just need to convince one person with commit rights to merge something like this?

This patch does not behave identically to previous versions, so in that sense, it is not backwards compatible. But the way it used to behave is compeletly broken for characters greater than U+00FF, I hope you can agree. And the way it behaves for characters U+0080 to U+00FF is also weird, it behaves leniently, even though the same code already throws an exception when it comes to spaces. An exception is already thrown when the invalid character space is given as input, all this is doing is making sure an exception is thrown for other invalid characters, instead of irreversibly throwing away data, (by only considering the first byte of multi-byte characters).

@bnoordhuis If this approach is not the best one, which approach would you take instead?

@jasnell jasnell added the semver-major PRs that contain breaking changes and should be released in the next major version. label Mar 27, 2017
@jasnell
Copy link
Member

jasnell commented Mar 27, 2017

@Flimm ... this has been an ongoing issue with the current HTTP/1 implementation and is a difficult problem to address. As @bnoordhuis points out, there is a significant amount of existing code that uses UTF8 in the path that would be broken if we started rejecting such values outright. Our policy has been to avoid such breaking changes when possible unless the changes are necessary to address security concerns. Personally, I'm a big fan of strict spec compliance, in which case rejecting is technically the right thing to do, but the backwards compatibility concerns cannot be ignored and we'll need to weigh those carefully.

Another possible approach that we can take is to perform additional pct-encoding on those characters rather than throwing. Doing so would come at a performance and would likely also need to be carefully evaluated to ensure it wouldn't break existing code.

In terms of our process for getting things landed, however... this change qualifies as a semver-major, which requires sign-off from at least two members of the @nodejs/ctc before it can land.

@bnoordhuis
Copy link
Member

If this approach is not the best one, which approach would you take instead?

It's complicated. The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

It was arguably unwise to truncate codepoints > U+FF like that but it goes back all the way to node.js v0.1.x - hard to change now.

@fhinkel
Copy link
Member

fhinkel commented May 23, 2017

@Flimm Sorry that we haven't landed this yet. The general process is that a PR needs at least two approvals, and no objections. Seems like it will be very hard to find consensus on this PR. How invested are you in the change?

@Flimm
Copy link
Contributor Author

Flimm commented May 30, 2017

I'll be honest, I'm not very confident that any more effort on my side is going to help. We need a core contributor to approve or disapprove the idea of this fix. It looks like everyone is focussing on the proposed fix (throwing an exception instead of silently corrupting data), but no one is focussing on the fact that Node is currently silently corrupting data. We need a fix, even if it's not this one.

I've created a separate issue for the fact that Node is silently corrupting Unicode paths in requests, the issue is here: #13296, and I've created a pull request with a test case that illustrates the bug here: #13297

@Flimm
Copy link
Contributor Author

Flimm commented May 30, 2017

I'm OK with closing this pull request. I want the bug to be fixed, it doesn't have to be through throwing an exception on Unicode input. Thanks @fhinkel and others would made sure this pull request didn't completely fall through the cracks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. semver-major PRs that contain breaking changes and should be released in the next major version. stalled Issues and PRs that are stalled. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect URL parsing
7 participants