Problems reading filenames with accents on Windows #1345

wilsonfreitas · 2022-01-02T03:43:19Z

I get an empty dataframe when I read a csv file with accents in its name.
Once I remove the accent the function works correctly.

As the problem happens in my local file system so I didn´t know how to generate a reprex.
Here it follows attached a screenshot with an example.

Here we have the code.

readr::read_csv("Renda Fixa Pré.csv")

readr::read_csv("Renda Fixa Pre.csv")

fs::dir_ls(".", regexp = "csv$")

The csv file can be downloaded here.

jennybc · 2022-01-03T19:20:36Z

Sidebar re: this

As the problem happens in my local file system so I didn´t know how to generate a reprex.

reprex(wd = ".") is a good option when you really must demo something using local files.

wilsonfreitas · 2022-01-03T20:18:24Z

Realy tkx @jennybc
Now I generate the reprex.

We see that the filename with accent in it returns an empty dataframe and the same file with the filename without accent returns the dataframe correctly.

Is there something I am missing?

readr::read_csv("Renda Fixa Pré.csv")
#> Rows: 0 Columns: 0
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 x 0

readr::read_csv("Renda Fixa Pre.csv")
#> Rows: 15 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Nome, Tipo
#> dbl (2): Prazo, InvestimentoInicial
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 15 x 5
#>    Nome                                       Taxa Tipo  Prazo InvestimentoInic~
#>    <chr>                                     <dbl> <chr> <dbl>             <dbl>
#>  1 CDB Caruana Pre                              22 PRÉ     365              1000
#>  2 CDB Fator Pré-fixado                         22 PRÉ     365              5000
#>  3 CDB NBC Pré-fixado                           22 PRÉ     365              1000
#>  4 CDB BDMG Pré                                229 PRÉ     365             10000
#>  5 CDB BRPartners Pre                          233 PRÉ     365             20000
#>  6 CDB BCG Brasil Pré-fixado                   251 PRÉ     365            100000
#>  7 CDB Pine Pré-fixado                         258 PRÉ     365              5000
#>  8 CDB Modal Pré-fixado                        279 PRÉ     365              1000
#>  9 CDB MAXINVEST-RNX PRE                        29 PRÉ     365              1000
#> 10 CDB Agibank Pré-Fixado                      301 PRÉ     365              1000
#> 11 CDB Banco Industrial do Brasil Pré-fixado   309 PRÉ     365              5000
#> 12 CDB Luso Pre                                 31 PRÉ     365              5000
#> 13 CDB Omni Pré-Fixado                          32 PRÉ     365              5000
#> 14 CDB Avista Pre                              369 PRÉ     365              1000
#> 15 LC Avista Pre                               369 PRÉ     365              1000

fs::dir_ls(".", regexp = "csv$")
#> Renda Fixa Pre.csv  Renda Fixa PrÃ©.csv

^{Created on 2022-01-03 by the reprex package (v2.0.1)}

jennybc · 2022-01-03T20:38:12Z

I'm still looking into the path issue, to see if I can reproduce it. I have a lot of upgrading to do on my Windows VM....

jennybc · 2022-01-03T21:30:02Z

I see what you see, FYI, on Windows. I'm not really working on readr at the moment, but I might add a bit more analysis while I'm here.

I note that a work around is to explicitly call fs::path_tidy(). (I also note there's evidence of something fishy here with fs, as the result of fs::dir_ls() contains some mojibake).

library(readr)
library(fs)

read_csv(path_tidy("C:/Users/jenny/Downloads/Renda Fixa Pré.csv"))
#> Rows: 15 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Nome, Tipo
#> dbl (2): Prazo, InvestimentoInicial
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 15 x 5
#>    Nome                                       Taxa Tipo  Prazo InvestimentoInic~
#>    <chr>                                     <dbl> <chr> <dbl>             <dbl>
#>  1 CDB Caruana Pre                              22 PRÉ     365              1000
#>  2 CDB Fator Pré-fixado                         22 PRÉ     365              5000
#>  3 CDB NBC Pré-fixado                           22 PRÉ     365              1000
#>  4 CDB BDMG Pré                                229 PRÉ     365             10000
#>  5 CDB BRPartners Pre                          233 PRÉ     365             20000
#>  6 CDB BCG Brasil Pré-fixado                   251 PRÉ     365            100000
#>  7 CDB Pine Pré-fixado                         258 PRÉ     365              5000
#>  8 CDB Modal Pré-fixado                        279 PRÉ     365              1000
#>  9 CDB MAXINVEST-RNX PRE                        29 PRÉ     365              1000
#> 10 CDB Agibank Pré-Fixado                      301 PRÉ     365              1000
#> 11 CDB Banco Industrial do Brasil Pré-fixado   309 PRÉ     365              5000
#> 12 CDB Luso Pre                                 31 PRÉ     365              5000
#> 13 CDB Omni Pré-Fixado                          32 PRÉ     365              5000
#> 14 CDB Avista Pre                              369 PRÉ     365              1000
#> 15 LC Avista Pre                               369 PRÉ     365              1000

^{Created on 2022-01-03 by the reprex package (v2.0.1)}

DavisVaughan · 2022-01-03T21:36:33Z

vroom is probably missing a call to enc2utf8(file), possibly in vroom:::standardise_path()

(That happens in path_tidy() through fs:::new_fs_path())

jennybc · 2022-01-03T22:57:39Z

Yeah this is a vroom issue. Moving it there. Manually, I guess.

@wilsonfreitas Another workaround for you, until this is fixed in vroom, is to specifically request the first edition of readr.

library(readr)

with_edition(
  1,
  read_csv("C:/Users/jenny/Downloads/Renda Fixa Pré.csv")
)
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   Nome = col_character(),
#>   Taxa = col_number(),
#>   Tipo = col_character(),
#>   Prazo = col_double(),
#>   InvestimentoInicial = col_double()
#> )
#> # A tibble: 15 x 5
#>    Nome                                       Taxa Tipo  Prazo InvestimentoInic~
#>    <chr>                                     <dbl> <chr> <dbl>             <dbl>
#>  1 CDB Caruana Pre                              22 PRÉ     365              1000
#>  2 CDB Fator Pré-fixado                         22 PRÉ     365              5000
#>  3 CDB NBC Pré-fixado                           22 PRÉ     365              1000
#>  4 CDB BDMG Pré                                229 PRÉ     365             10000
#>  5 CDB BRPartners Pre                          233 PRÉ     365             20000
#>  6 CDB BCG Brasil Pré-fixado                   251 PRÉ     365            100000
#>  7 CDB Pine Pré-fixado                         258 PRÉ     365              5000
#>  8 CDB Modal Pré-fixado                        279 PRÉ     365              1000
#>  9 CDB MAXINVEST-RNX PRE                        29 PRÉ     365              1000
#> 10 CDB Agibank Pré-Fixado                      301 PRÉ     365              1000
#> 11 CDB Banco Industrial do Brasil Pré-fixado   309 PRÉ     365              5000
#> 12 CDB Luso Pre                                 31 PRÉ     365              5000
#> 13 CDB Omni Pré-Fixado                          32 PRÉ     365              5000
#> 14 CDB Avista Pre                              369 PRÉ     365              1000
#> 15 LC Avista Pre                               369 PRÉ     365              1000

^{Created on 2022-01-03 by the reprex package (v2.0.1)}

jennybc closed this as completed Jan 3, 2022

This was referenced Jan 3, 2022

File path encoding problem on Windows tidyverse/vroom#394

Closed

Add a test with non-ascii filename #1346

Open

dir_ls() trips over non-ascii file names when native encoding isn't UTF-8 r-lib/fs#366

Open

jennybc mentioned this issue May 5, 2022

Better handling of non-ascii filepaths tidyverse/vroom#434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems reading filenames with accents on Windows #1345

Problems reading filenames with accents on Windows #1345

wilsonfreitas commented Jan 2, 2022

jennybc commented Jan 3, 2022

wilsonfreitas commented Jan 3, 2022

jennybc commented Jan 3, 2022

jennybc commented Jan 3, 2022

DavisVaughan commented Jan 3, 2022 •

edited

Loading

jennybc commented Jan 3, 2022

Problems reading filenames with accents on Windows #1345

Problems reading filenames with accents on Windows #1345

Comments

wilsonfreitas commented Jan 2, 2022

jennybc commented Jan 3, 2022

wilsonfreitas commented Jan 3, 2022

jennybc commented Jan 3, 2022

jennybc commented Jan 3, 2022

DavisVaughan commented Jan 3, 2022 • edited Loading

jennybc commented Jan 3, 2022

DavisVaughan commented Jan 3, 2022 •

edited

Loading