Regression in unicode handling in url.parse #1149

jeremys · 2011-06-03T17:20:45Z

Hi there,

I've detected a regression since Node 0.4.6 regarding the handling of unicode characters when parsing url. A test case is worth a thousand words so here we go:

Here's the gist: https://gist.github.com/1006680
The code:

var url = require('url'),
    assert = require('assert');
var test = 'http://➡.ws/pageloads';
var parsed = url.parse(test);
assert.equal(parsed.host, '➡.ws');
console.log('URL parsed correctly.');

With Node 0.4.6:

j:Code jeremy$ node -v
v0.4.6
j:Code jeremy$ node hostunicode.js 
URL parsed correctly.

With Node 0.4.8:

j:Code jeremy$ node -v
v0.4.8
j:Code jeremy$ node hostunicode.js 

node.js:134
        throw e; // process.nextTick error, or 'error' event on first tick
        ^
AssertionError: "➡.ws" == ""
    at Object.<anonymous> (/Users/jeremy/Code/hostunicode.js:5:8)
    at Module._compile (module.js:407:26)
    at Object..js (module.js:413:10)
    at Module.load (module.js:339:31)
    at Function._load (module.js:298:12)
    at Array.<anonymous> (module.js:426:10)
    at EventEmitter._tickCallback (node.js:126:26)

And I also think node does not deal very well with IDNA (see http://tools.ietf.org/html/rfc3490.html) but I will fill that seperatly with a test case.

The text was updated successfully, but these errors were encountered:

isaacs · 2011-06-03T18:40:58Z

Yeah, this happened because of a fix to prevent delimiters and other characters from getting in there.

This is one area where JavaScript's lack of proper unicode support on RegExps is really painful. It seems like we can maybe just blacklist known-disallowed characters in the 0x00-0x80 range, since even punctuation outside of ASCII seems to be allowed in a domain name.

jeremys · 2011-06-06T10:56:29Z

Or maybe a solution is to implement IDNA encoding, and process the encoding first, then apply the current blacklist on the result?

Because we have another issue regarding unicode handling in http. Even with node 0.4.6 when I request http://➡.ws, it end up requesting http://â�¡.ws which of course fail because we should encode the domain.

What would be nice is that url.parse idna-encode the hostname before applying validation. But idna encoding is a big deal. Not sure if Node can leverage this from somewhere else (Python or something).

isaacs · 2011-06-06T16:13:50Z

Yeah, I think if we're going to do IDNA (which seems necessary for http to properly request utf hostnames), we'll need to do it at the url.parse layer. I'm thinking that url.parse should automatically convert to punycode, and thus, url.format will never output a non-ascii hostname. Since browsers treat punycode and utf hostnames the same, it should be the safest approach, if not the prettiest output.

jeremys · 2011-06-06T16:34:13Z

I don't know how do you plan to do IDNA but there is this interesting lib by Google around URL parsing: http://code.google.com/p/google-url/ — It's used in Chromium. But, I'm no expert, I don't know if it can be used within Node.

isaacs · 2011-06-06T17:13:36Z

Yeah, I don't know if we want to pull the url parsing/formatting logic into the C layer. That's a pretty big library, and does a lot of stuff that node doesn't really need.

ry · 2011-06-13T16:58:03Z

See #1174

@bnoordhuis

Using @bnoordhuis's punycode lib. Close nodejs#1174 also

isaacs closed this as completed in 2a848fa Jul 6, 2011

isaacs pushed a commit to isaacs/node-v0.x-archive that referenced this issue Jul 19, 2011

Close nodejs#1149 IDNA and Punycode support in url.parse

786c714

Using @bnoordhuis's punycode lib. Close nodejs#1174 also

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in unicode handling in url.parse #1149

Regression in unicode handling in url.parse #1149

jeremys commented Jun 3, 2011

isaacs commented Jun 3, 2011

jeremys commented Jun 6, 2011

isaacs commented Jun 6, 2011

jeremys commented Jun 6, 2011

isaacs commented Jun 6, 2011

ry commented Jun 13, 2011

Regression in unicode handling in url.parse #1149

Regression in unicode handling in url.parse #1149

Comments

jeremys commented Jun 3, 2011

isaacs commented Jun 3, 2011

jeremys commented Jun 6, 2011

isaacs commented Jun 6, 2011

jeremys commented Jun 6, 2011

isaacs commented Jun 6, 2011

ry commented Jun 13, 2011