Better handling of non-ascii filepaths #434

jennybc · 2022-05-05T02:09:03Z

Closes #394, closes #402, closes #403 (previous explorations)

There's a bit more going on here than we originally thought and the solution isn't quite as simple as "UTF-8 everywhere" (but it's close).

First, the file from the original issue (tidyverse/readr#1345) has a path containing non-ascii characters, but what's easier to miss is that the file also lacks a trailing newline. The failure cascade looks like this:

has_trailing_newline() implicitly expects that the path is in the native encoding and, if it's not, can potentially fail to find an existing file. When that happens, it unconditionally reports TRUE (yes, there is a trailing new line), which is wrong in this case. A file that lacks a trailing newline should be routed through the connection logic, but, in this case is read as a "regular file", which fails and results in an empty tibble.

So yes the problem is the file path encoding, but it first raises its head via has_trailing_newline(), not the main file-reading step.

Second, by exploring various solutions, I discovered that blindly applying enc2utf8() is too simple, since vroom tests itself on linux in a non-UTF-8 locale. Hence we do enc2utf8() on Windows and enc2native() otherwise. This effort did reveal a pre-existing file path problem on this build that I ultimately decided to not fix, in the "file ends with a newline" case. I think this OS-dependent approach may be currently irrelevant in vroom, because of the mio problem I show below, but might be worth it in other settings.

For normal files, on linux, in a non-UTF-8 locale, with a multi-byte file path, we get an error from mio itself, i.e. the memory mapping fails:

https:/tidyverse/vroom/runs/6315284023?check_suite_focus=true#step:7:182

> test_check("vroom")
mapping error: No such file or directory
[ FAIL 2 | WARN 0 | SKIP 2 | PASS 1065 ]

I am content to skip() here and re-consider if an actual user ever needs to do this. There is also evidence that vroom's writing functions are not entirely prepared for this scenario, but the same posture applies.

The encoding improvements in R 4.2 on Windows mean that I really need to include 4.1 to make sure I can see the problem and that I have fixed it. Also make the GHA config more obvious in other ways.

… else

Same motivations as r-lib/tzdb#13 (comment) It's too simplistic to always apply `enc2utf8()`, given that we have an existing ambition to work in a non-UTF-8 locale on unix, as evidenced by the GHA config.

DavisVaughan

Such a hard fought battle

This reverts commit 4cd8056.

…epath problem Doesn't seem like a good use of more time

jennybc force-pushed the iss394 branch 4 times, most recently from fbc524b to b3d8a0b Compare May 5, 2022 20:27

jennybc added 3 commits May 5, 2022 13:46

Add a job on Windows + R 4.1

3121838

The encoding improvements in R 4.2 on Windows mean that I really need to include 4.1 to make sure I can see the problem and that I have fixed it. Also make the GHA config more obvious in other ways.

Rename this function, because I need this function name for something…

0b0dabb

… else

Add failing tests

d39724b

jennybc force-pushed the iss394 branch from b3d8a0b to 3121838 Compare May 5, 2022 20:47

jennybc added 2 commits May 5, 2022 14:08

Reencode filepaths to UTF-8 or native encoding, depending on OS

9e86fe4

Same motivations as r-lib/tzdb#13 (comment) It's too simplistic to always apply `enc2utf8()`, given that we have an existing ambition to work in a non-UTF-8 locale on unix, as evidenced by the GHA config.

Add NEWS bullet

cce8cac

DavisVaughan approved these changes May 5, 2022

View reviewed changes

jennybc added 5 commits May 5, 2022 14:55

Start with natively encoded paths on en_US.ISO-8859-1

f47bf2c

Make the failure more informative

3788e12

What if vroom_write_lines() has its own problems?

ec536ca

Revert "Reencode the path in more places"

dc28ede

This reverts commit 4cd8056.

Decide not to solve the linux + ISO-8859-1 + multi-byte character fil…

e4c532c

…epath problem Doesn't seem like a good use of more time

jennybc force-pushed the iss394 branch from 4cd8056 to e4c532c Compare May 6, 2022 05:19

Possible simplification

36b0cc4

jennybc marked this pull request as ready for review May 6, 2022 06:02

jennybc merged commit f124fcc into main May 6, 2022

jennybc mentioned this pull request May 9, 2022

Rework filepath (re-)encoding #438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of non-ascii filepaths #434

Better handling of non-ascii filepaths #434

jennybc commented May 5, 2022 •

edited

Loading

DavisVaughan left a comment

Better handling of non-ascii filepaths #434

Better handling of non-ascii filepaths #434

Conversation

jennybc commented May 5, 2022 • edited Loading

DavisVaughan left a comment

Choose a reason for hiding this comment

jennybc commented May 5, 2022 •

edited

Loading