Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of implementing the Encoding Standard with ICU, and associated web platform tests #13646

Closed
domenic opened this issue Jun 13, 2017 · 14 comments
Labels
buffer Issues and PRs related to the buffer subsystem. i18n-api Issues and PRs related to the i18n implementation.

Comments

@domenic
Copy link
Contributor

domenic commented Jun 13, 2017

I originally wrote this as a comment on #13644, but I realized it was getting a bit far afield, so I wanted to split it out into a separate issue.

/cc @inexorabletash @ricea @jungkees as my Chromium encoding peeps, @hsivonen from Mozilla implementations, and @annevk as spec editor; this thread is about Node.js implementing the Encoding Standard, via ICU, and also about web platform tests. Feel free to unsubscribe if not interested.

Anyway:

I'd be curious about the web platform test pass rates for #13644 with full-icu, although I understand that might reflect more on ICU's conformance to the Encoding Standard than on the code in #13644. I have heard Chromium has a fork of some kind around ICU to better comply with the Encoding Standard, although I vaguely remember it being mostly around label processing, not around actual encoding.

@ricea also wrote a bunch of web platform tests a while back that never ended up landing, see https:/whatwg/encoding/labels/tests; maybe we should make a renewed push for those (although I see @hsivonen has some corrections based on his work). I am happy to help land, although I am not useful for reviewing from an encodings perspective.

@jasnell
Copy link
Member

jasnell commented Jun 13, 2017

first step will be to get the WPT encoding tests working. Because full-icu will be required, those will need to go into their own special bucket. Once set up, however, we can really start to get into verifying that things actually do work as spec'd. Currently, this is definitely going to be entirely limited by the extent of ICU's coverage of the encoding standard details.

@inexorabletash
Copy link

@jungshik can probably comment most authoritatively on Chromium's ICU changes. Patches are at https://cs.chromium.org/chromium/src/third_party/icu/patches/ - most changes to handle Encoding including labels have been upstreamed (behind a compile flag).

@mscdex mscdex added buffer Issues and PRs related to the buffer subsystem. i18n-api Issues and PRs related to the i18n implementation. labels Jun 13, 2017
@hsivonen
Copy link

I pushed encoding_rs to mozilla-inbound moments ago, so hopefully a Firefox nightly with an Encoding Standard-compliant back end will be available for experimentation soon. (I say compliant back end, because there are known integration bugs around BOM and EOF handling that need to be fixed as follow-ups.)

As part of the landing, I adjusted some Web Platform Tests in mozilla-central. My understanding is that they will get upstreamed in due course, but @jgraham would know on what schedule. As for the pending Web Platform Tests from the W3C i18n Core WG, I believe that to the cases that Firefox with encoding_rs doesn't pass are test suite bugs (so commented on each relevant issue of the Encoding Standard repo).

encoding_rs has some CC0 test cases baked into it. Many of those should be easy enough to extract mechanically for use with other implementations, but I don't currently have the available cycles to do it myself. (Example of Rust-emdebbed tests. Example of separate-file input and reference.)

The obvious case where Chromium is compliant but Node using ICU only probably won't be is UTF-8 error handling, where Chromium is compliant by using WTF's UTF-8 decoder instead of ICU's.

@hsivonen
Copy link

As part of the landing, I adjusted some Web Platform Tests in mozilla-central.

I changed a KOI8-U test and an ISO-2022-JP test.

@ricea
Copy link

ricea commented Jun 13, 2017

@r12a should get credit for those tests, not me.

@domenic
Copy link
Contributor Author

domenic commented Jun 13, 2017

Oh goodness, my apologies for mixing up peoples' names! I guess I skipped the stuff beyond the first/last letter.

@r12a
Copy link

r12a commented Jun 13, 2017

@domenic no worries :) Btw, i've been a little distracted recently by other things, but i know the tests need a little attention - it's just a question of finding the time. But if others want to suggest specific changes i'd be happy to consider them.

@hsivonen
Copy link

mozilla-central has worker-compatible non-WPT tests that might be of interest.

I pushed encoding_rs to mozilla-inbound moments ago, so hopefully a Firefox nightly with an Encoding Standard-compliant back end will be available for experimentation soon.

encoding_rs in now on Nightly.

@domenic
Copy link
Contributor Author

domenic commented Jun 16, 2017

We've started landing the WPTs, although they are not in a nice JSON-consumable format like URL were, so getting them running in Node is going to be much harder :-/.

@jasnell
Copy link
Member

jasnell commented Jun 16, 2017

We'll get that part figured out once the tests stablize a bit more. As I mentioned previously, the other thing we're going to need to figure out before we can test everything properly is the whole full-icu requirement. Currently, the only decoders supported out of the box by the experimental implementation are UTF-8, UTF-16le and UTF-16be.

@jungshik
Copy link

@jungshik can probably comment most authoritatively on Chromium's ICU changes. Patches are at https://cs.chromium.org/chromium/src/third_party/icu/patches/ - most changes to handle Encoding including labels have been upstreamed (behind a compile flag).

Chrome's ICU encoding changes have not been upstreamed. What's upstreamed is to exclude the ICU's encoding related C/C++ codes NOT relevant to web.

ICU encoding table changes in Chromium's copy are not just labels. Encoding tables are changed a lot. See tables whose names end with html.ucm
in https://cs.chromium.org/chromium/src/third_party/icu/source/data/mappings/ (or
https://cs.chromium.org/search/?q=file:html.ucm&type=cs )

See also eucjp-gen.sh, euckr-gen.sh, sjis-gen.sh, single_byte_gen.sh and ibm866_gen.sh in
https://cs.chromium.org/chromium/src/third_party/icu/scripts

@Trott
Copy link
Member

Trott commented Jun 7, 2018

I'm going to try to loop @TimothyGu into this issue, mostly because encodings seem to be something he cares about and WPT is something he's implemented in Node.js for the URL object.

Feel absolutely free to close this if you don't think it's likely to get any traction any time in the next year. It's been dormant for the last year already. That doesn't mean it's not worthwhile, of course.

@hsivonen
Copy link

hsivonen commented Jun 7, 2018

ICU encoding table changes in Chromium's copy are not just labels. Encoding tables are changed a lot.

FWIW, encoding_rs has C and C++17 bindings (and writing other kind of C++ bindings is an easy copy-paste job), so using it for Node might be easier than dealing with importing the ICU patches from Chromium if you can accept a Rust dependency.

The obvious case where Chromium is compliant but Node using ICU only probably won't be is UTF-8 error handling, where Chromium is compliant by using WTF's UTF-8 decoder instead of ICU's.

ICU has become compliant on this particular point (though I haven't tested to verify) during the past year.

@jasnell
Copy link
Member

jasnell commented Jun 19, 2020

There's been no further discussion on this in nearly 2 years. We have the TextEncoder and TextDecoder implementations in place. There's likely additional stuff we can do here but without a concrete pull request proposing changes, there's not going to be any value in keeping this issue open. Closing but we can reopen if someone wants to revisit this.

@jasnell jasnell closed this as completed Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. i18n-api Issues and PRs related to the i18n implementation.
Projects
None yet
Development

No branches or pull requests

9 participants