Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_* extremely slow when using guess_max parameter in v2.0 #1267

Closed
JoshuaSturm opened this issue Aug 6, 2021 · 2 comments
Closed

read_* extremely slow when using guess_max parameter in v2.0 #1267

JoshuaSturm opened this issue Aug 6, 2021 · 2 comments

Comments

@JoshuaSturm
Copy link

JoshuaSturm commented Aug 6, 2021

Hi, readr team.

Apologies if this is the wrong place to report this bug, since it's likely a vroom issue.
Reading large dataframes in readr 2.0.0 is extremely slow. I narrowed it down to the guess_max parameter; omitting it in the call would eliminate the performance degradation. However, this can cause parsing issues for large files, so I tend to keep it.
Below is a reprex with benchmarks.

I'm almost certain I used vroom in the last month or two (prior to readr 2.0) with no issues, so I think it's a recent problem.

edited to fix reprex

library(readr)
library(reprex)
library(bench)

options(
  readr.show_col_types = FALSE
)

f <- file.path(tempdir(), "tempdf.csv")

sampleData <- do.call(data.frame, replicate(100L, rep(paste0(sample(c(LETTERS, 0L:9L), size = 9L, replace = T), collapse = ""), 250000L), simplify = FALSE)) |>
  write_csv(file = f)

mark(
  old  = with_edition(1, read_csv(f)),
  old2 = with_edition(1, read_csv(f, guess_max = 250000L))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           2.04s    2.04s     0.490     216MB    0.980
#> 2 old2          3.79s    3.79s     0.264     402MB    0.264

mark(
  new  = read_csv(f),
  new2 = read_csv(f, guess_max = 250000L)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new           1.56s    1.56s   0.640      3.68MB    0.640
#> 2 new2          5.13m    5.13m   0.00325  191.03MB    0.980

Created on 2021-08-06 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2021-08-06                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                       
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.1.0)               
#>  bench       * 1.1.1      2020-01-13 [1] CRAN (R 4.1.0)               
#>  bit           4.0.4      2020-08-04 [1] CRAN (R 4.1.0)               
#>  bit64         4.0.5      2020-08-30 [1] CRAN (R 4.1.0)               
#>  cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.0)               
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.0)               
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.1.0)               
#>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)               
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.1.0)               
#>  fansi         0.5.0      2021-05-25 [1] CRAN (R 4.1.0)               
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.1.0)               
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.0)               
#>  highr         0.9        2021-04-16 [1] CRAN (R 4.1.0)               
#>  hms           1.1.0      2021-05-17 [1] CRAN (R 4.1.0)               
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.1.0)               
#>  knitr         1.33       2021-04-24 [1] CRAN (R 4.1.0)               
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.1.0)               
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.0)               
#>  pillar        1.6.2      2021-07-29 [1] CRAN (R 4.1.0)               
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)               
#>  profmem       0.6.0      2020-12-13 [1] CRAN (R 4.1.0)               
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.0)               
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.1.0)               
#>  readr       * 2.0.0      2021-07-20 [1] CRAN (R 4.1.0)               
#>  reprex      * 2.0.1      2021-08-05 [1] CRAN (R 4.1.0)               
#>  rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.0)               
#>  rmarkdown     2.10       2021-08-06 [1] CRAN (R 4.1.0)               
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.0)               
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.1.0)               
#>  stringi       1.7.3      2021-07-16 [1] CRAN (R 4.1.0)               
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.0)               
#>  styler        1.5.1.9000 2021-08-03 [1] Github (r-lib/styler@a8ec068)
#>  tibble        3.1.3      2021-07-23 [1] CRAN (R 4.1.0)               
#>  tidyselect    1.1.1      2021-04-30 [1] CRAN (R 4.1.0)               
#>  tzdb          0.1.2      2021-07-20 [1] CRAN (R 4.1.0)               
#>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)               
#>  vctrs         0.3.8      2021-04-29 [1] CRAN (R 4.1.0)               
#>  vroom         1.5.4      2021-08-05 [1] CRAN (R 4.1.0)               
#>  withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.0)               
#>  xfun          0.25       2021-08-06 [1] CRAN (R 4.1.0)               
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.1.0)               
#> 
@jimhester
Copy link
Collaborator

I am not sure that guessing with the entire file is a great strategy overall, you are basically parsing the whole file twice at least to do this, but this was clearly a major performance regression.

However thank you for opening the issue and for supplying a reproducible example, it is a big help and made tracking down the cause much more straightforward!

f <- file.path(tempdir(), "tempdf.csv")

sampleData <- do.call(data.frame, replicate(100L, rep(paste0(sample(c(LETTERS, 0L:9L), size = 9L, replace = T), collapse = ""), 250000L), simplify = FALSE)) |>
  vroom::vroom_write(file = f)

bench::mark(
  new  = vroom::vroom(f),
  new2 = vroom::vroom(f, guess_max = 250000L)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new        224.88ms 224.97ms     3.99     6.84MB    1.33 
#> 2 new2          3.78s    3.78s     0.265  191.05MB    0.265

Created on 2021-08-06 by the reprex package (v2.0.0)

The performance for this use case is still not ideal, but it should be greatly improved from the current release.

@JoshuaSturm
Copy link
Author

Great point - I will start to explicitly define column types when possible.
Thanks for the quick resolution!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 1, 2022
# vroom 1.5.7

* Jenny Bryan is now the official maintainer.

* Fix uninitialized bool detected by CRAN's UBSAN check
  (tidyverse/vroom#386)

* Fix buffer overflow when trying to parse an integer field that is
  over 64 characters long
  (tidyverse/readr#1326)

* Fix subset indexing when indexes span a file boundary multiple times
  (#383)

# vroom 1.5.6

* `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381)

* `vroom(n_max=)` now correctly handles cases when reading from a
  connection and the file does _not_ end with a newline
  (tidyverse/readr#1321)

* `vroom()` no longer issues a spurious warning when the parsing needs
* to be restarted due to the presence of embedded newlines
* (tidyverse/readr#1313) Fix performance
* issue when materializing subsetted vectors (#378)

* `vroom_format()` now uses the same internal multi-threaded code as
  `vroom_write()`, improving its performance in most cases (#377)

* `vroom_fwf()` no longer omits the last line if it does _not_ end
  with a newline (tidyverse/readr#1293)

* Empty files or files with only a header line and no data no longer
  cause a crash if read with multiple files
  (tidyverse/readr#1297)

* Files with a header but no contents, or a empty file if `col_names =
  FALSE` no longer cause a hang when `progress = TRUE`
  (tidyverse/readr#1297)

* Commented lines with comments at the end of lines no longer hang R
  (tidyverse/readr#1309)

* Comment lines containing unpaired quotes are no longer treated as
  unterminated quotations
  (tidyverse/readr#1307)

* Values with only a `Inf` or `NaN` prefix but additional data
  afterwards, like `Inform` or no longer inappropriately guessed as
  doubles (tidyverse/readr#1319)

* Time types now support `%h` format to denote hour durations greater
  than 24, like readr (tidyverse/readr#1312)

* Fix performance issue when materializing subsetted vectors (#378)


# vroom 1.5.5

* `vroom()` now supports files with only carriage return newlines
  (`\r`). (#360, tidyverse/readr#1236)

* `vroom()` now parses single digit datetimes more consistently as
  readr has done (tidyverse/readr#1276)

* `vroom()` now parses `Inf` values as doubles
  (tidyverse/readr#1283)

* `vroom()` now parses `NaN` values as doubles
  (tidyverse/readr#1277)

* `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports
  scientific notation (#364)

* `vroom()` now works around specifying a `\n` as the delimiter (#365,
  tidyverse/dplyr#5977)

* `vroom()` no longer crashes if given a `col_name` and `col_type`
  both less than the number of columns
  (tidyverse/readr#1271)

* `vroom()` no longer hangs if given an empty value for
  `locale(grouping_mark=)`
  (tidyverse/readr#1241)

* Fix performance regression when guessing with large numbers of rows
  (tidyverse/readr#1267)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants