Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File path encoding problem on Windows #394

Closed
jennybc opened this issue Jan 3, 2022 · 4 comments · Fixed by #434
Closed

File path encoding problem on Windows #394

jennybc opened this issue Jan 3, 2022 · 4 comments · Fixed by #434
Assignees
Labels
bug an unexpected problem or unintended behavior

Comments

@jennybc
Copy link
Member

jennybc commented Jan 3, 2022

Manual transfer of tidyverse/readr#1345

library(vroom)

vroom("C:/Users/jenny/Downloads/Renda Fixa Pré.csv")
#> Rows: 0 Columns: 0
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 x 0

# this should be happening automatically
vroom(
  iconv(
    "C:/Users/jenny/Downloads/Renda Fixa Pré.csv",
    to = "UTF-8"
  )
)
#> Rows: 15 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Nome, Tipo
#> dbl (2): Prazo, InvestimentoInicial
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 15 x 5
#>    Nome                                       Taxa Tipo  Prazo InvestimentoInic~
#>    <chr>                                     <dbl> <chr> <dbl>             <dbl>
#>  1 CDB Caruana Pre                              22 PRÉ     365              1000
#>  2 CDB Fator Pré-fixado                         22 PRÉ     365              5000
#>  3 CDB NBC Pré-fixado                           22 PRÉ     365              1000
#>  4 CDB BDMG Pré                                229 PRÉ     365             10000
#>  5 CDB BRPartners Pre                          233 PRÉ     365             20000
#>  6 CDB BCG Brasil Pré-fixado                   251 PRÉ     365            100000
#>  7 CDB Pine Pré-fixado                         258 PRÉ     365              5000
#>  8 CDB Modal Pré-fixado                        279 PRÉ     365              1000
#>  9 CDB MAXINVEST-RNX PRE                        29 PRÉ     365              1000
#> 10 CDB Agibank Pré-Fixado                      301 PRÉ     365              1000
#> 11 CDB Banco Industrial do Brasil Pré-fixado   309 PRÉ     365              5000
#> 12 CDB Luso Pre                                 31 PRÉ     365              5000
#> 13 CDB Omni Pré-Fixado                          32 PRÉ     365              5000
#> 14 CDB Avista Pre                              369 PRÉ     365              1000
#> 15 LC Avista Pre                               369 PRÉ     365              1000

Created on 2022-01-03 by the reprex package (v2.0.1)

@jennybc
Copy link
Member Author

jennybc commented Jan 3, 2022

Given that I see this:

https:/r-lib/vroom/blob/c5b115b1b5f21852886d6407317386b215706239/R/path.R#L22-L23

I suspect this change needs to be brought over from readr:

tidyverse/readr@df90fb9

We should also add a test. Adapt basic approach from this test in readxl:

https:/tidyverse/readxl/blob/4540dff849ce723016d0129f1ba85524faf5952b/tests/testthat/test-read-excel.R#L96-L116

@jennybc
Copy link
Member Author

jennybc commented Feb 8, 2022

One important feature of OP's original file (the one from the readr issue) is that it doesn't end in a newline.

For some reason, the very important message "Files must end with a newline" isn't captured by reprex (we should follow up on this!):

> vroom("C:/Users/jenny/Downloads/Renda Fixa Pré.csv")
Files must end with a newline
Rows: 0 Columns: 0

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 0 x 0

I also don't understand why re-encoding the path helps with that, but it does:

vroom::vroom(enc2utf8("C:/Users/jenny/Downloads/Renda Fixa Pré.csv"), n_max = 3)
#> Rows: 3 Columns: 5
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (2): Nome, Tipo
#> dbl (2): Prazo, InvestimentoInicial
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 x 5
#>   Nome                  Taxa Tipo  Prazo InvestimentoInicial
#>   <chr>                <dbl> <chr> <dbl>               <dbl>
#> 1 CDB Caruana Pre         22 PRÉ     365                1000
#> 2 CDB Fator Pré-fixado    22 PRÉ     365                5000
#> 3 CDB NBC Pré-fixado      22 PRÉ     365                1000

Created on 2022-02-08 by the reprex package (v2.0.1)

We probably still need to work on path handling and testing here, but the whole situation is more confusing than originally thought.

@jennybc
Copy link
Member Author

jennybc commented Feb 8, 2022

In the maintenance document, there are some relevant points (bold is mine):

https:/r-lib/vroom/blame/e1020f6b843dc1c882a37a6d6bb66e8f70c6a02f/MAINTENANCE.md#L12-L24

Particular points that tend to crop up is there are different code paths for the following things

  • reading from normal files or connections
  • line endings ('\r\n', '\r', or '\n')
  • files ending with a trailing newline or not.
  • Use of ALTREP or not

...

Files without a trailing newline are automatically detected and always sent down the code that reads from a connection.

The only code path that is multi-threaded is normal files, connections are read asynchronously and written to a temporary file, which is then read as normal.

This is starting to shed a little light on how the lack of trailing new line changes things. Still hard to see how path encoding is interacting with this, but it sure seems to be.

@jennybc jennybc added the bug an unexpected problem or unintended behavior label Mar 25, 2022
@jennybc jennybc self-assigned this Mar 25, 2022
@Doubt-0KB

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
3 participants