Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html output break utf-8 #275

Closed
Dushistov opened this issue May 5, 2023 · 2 comments
Closed

html output break utf-8 #275

Dushistov opened this issue May 5, 2023 · 2 comments
Labels
C-upstream-bug Category: This is a bug of compiler or dependencies (the fix may require action in the upstream)

Comments

@Dushistov
Copy link

Dushistov commented May 5, 2023

I have lines like this in my codebase:

 if !matches!(it.next(), Some('A'..='Z') | Some('А'..='Я')) {
        return false;
 }

Generating html with such command cargo llvm-cov --html test.

For letter 'Я' resulted html code looks like this:

=&apos;\320<span class='tooltip-content'>5</span></div>\257&apos;

so html generator breaks valid utf-8 sequence for 'Я' (\320\257) into two parts and insert html code between,
making it invalid.

@taiki-e taiki-e added the C-upstream-bug Category: This is a bug of compiler or dependencies (the fix may require action in the upstream) label May 5, 2023
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 8, 2024
coverage: `llvm-cov` expects column numbers to be bytes, not code points

Normally the compiler emits column numbers as a 1-based number of Unicode code points.

But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour).

For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors.

(See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 8, 2024
coverage: `llvm-cov` expects column numbers to be bytes, not code points

Normally the compiler emits column numbers as a 1-based number of Unicode code points.

But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour).

For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors.

(See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 8, 2024
coverage: `llvm-cov` expects column numbers to be bytes, not code points

Normally the compiler emits column numbers as a 1-based number of Unicode code points.

But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour).

For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors.

(See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 8, 2024
coverage: `llvm-cov` expects column numbers to be bytes, not code points

Normally the compiler emits column numbers as a 1-based number of Unicode code points.

But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour).

For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors.

(See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 9, 2024
Rollup merge of rust-lang#119033 - Zalathar:unicode, r=davidtwco

coverage: `llvm-cov` expects column numbers to be bytes, not code points

Normally the compiler emits column numbers as a 1-based number of Unicode code points.

But when we embed coverage mappings for `-Cinstrument-coverage`, those mappings will ultimately be read by the `llvm-cov` tool. That tool assumes that column numbers are 1-based numbers of *bytes*, and relies on that assumption when slicing up source code to apply highlighting (in HTML reports, and in text-based reports with colour).

For the very common case of all-ASCII source code, bytes and code points are the same, so the difference isn't noticeable. But for code that contains non-ASCII characters, emitting column numbers as code points will result in `llvm-cov` slicing strings in the wrong places, producing mangled output or fatal errors.

(See taiki-e/cargo-llvm-cov#275 as an example of what can go wrong.)
@Zalathar
Copy link

Zalathar commented Jan 9, 2024

This should now be fixed upstream in nightly as of rust-lang/rust#119033.

@taiki-e
Copy link
Owner

taiki-e commented Jan 9, 2024

@Zalathar Great! Thanks for fixing this!

@taiki-e taiki-e closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-upstream-bug Category: This is a bug of compiler or dependencies (the fix may require action in the upstream)
Projects
None yet
Development

No branches or pull requests

3 participants