readr::read_csv() throws confusing warning on correctly-parsed csv files with quoted newline characters #1313

djnavarro · 2021-10-08T20:10:39Z

Following up from the twitter thread 🙂

The underlying issue arises when the csv file contains newline characters in quoted fields. If the user calls readr::read_csv() it is possible to have cases where readr correctly parses the file (presumably indicating that the problem has been auto-detected) but it still throws a warning saying that problems have been detected. Seems likely to confuse some users?

Notes:

the issue isn't related to lazy loading: same behaviour occurs with lazy = TRUE and lazy = FALSE
as the reprex illustrates, the confusing warning does vanish when the user sets num_threads = 1 manually
I'm sure it's possible to strip this down to an even more minimal example, but I figured it makes most sense to reproduce the core features of the "wild caught" data that produced the problem (3 columns with quoted newlines, 1200 rows)

# setup -------------------------------------------------------------------

set.seed(1)

sample_values <- function(n, p_safe) {
  val <- rep("safe", n)
  val[runif(n) > p_safe] <- "UNSAFE\n"
  return(val)
}

# the wild-caught data had this structure
old_df <- tibble::tibble(
  a = sample_values(1200, p_safe = .99),
  b = sample_values(1200, p_safe = .01),
  c = sample_values(1200, p_safe = .01)
)

# write to temp file
path <- tempfile(pattern = "quoted_newlines_", fileext = ".csv")
write.csv(old_df, path, row.names = FALSE)


# throws confusing warning ------------------------------------------------

# read data without setting num_threads
new_df <- readr::read_csv(path, lazy = FALSE)
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 1200 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): a, b, c
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# parses correctly...
waldo::compare(old_df, new_df)
#> ✓ No differences

# ...but problems are logged (:confused_user_face:)
readr::problems(new_df)
#> # A tibble: 3 × 5
#>     row   col expected  actual    file                                          
#>   <int> <int> <chr>     <chr>     <chr>                                         
#> 1   451     1 3 columns 1 columns /tmp/RtmpO7Ly1l/quoted_newlines_1d0a7dfb0923.…
#> 2   601     1 3 columns 1 columns /tmp/RtmpO7Ly1l/quoted_newlines_1d0a7dfb0923.…
#> 3  1051     1 3 columns 1 columns /tmp/RtmpO7Ly1l/quoted_newlines_1d0a7dfb0923.…


# works -------------------------------------------------------------------

# read data setting num_threads = 1
new_df <- readr::read_csv(path, lazy = FALSE, num_threads = 1)
#> Rows: 1200 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): a, b, c
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# still parses correctly...
waldo::compare(old_df, new_df)
#> ✓ No differences

# ...and no problems logged (:happy_user_face:)
readr::problems(new_df)
#> # A tibble: 0 × 5
#> # … with 5 variables: row <int>, col <int>, expected <chr>, actual <chr>,
#> #   file <chr>


# session info ------------------------------------------------------------

# to save time:
packageVersion("readr")
#> [1] '2.0.2'

# long form
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
#>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
#>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.3      compiler_4.1.1    highr_0.9         R.methodsS3_1.8.1
#>  [5] R.utils_2.11.0    tools_4.1.1       digest_0.6.28     bit_4.0.4        
#>  [9] evaluate_0.14     lifecycle_1.0.1   tibble_3.1.4      R.cache_0.15.0   
#> [13] pkgconfig_2.0.3   rlang_0.4.11      reprex_2.0.1      rstudioapi_0.13  
#> [17] cli_3.0.1.9000    parallel_4.1.1    yaml_2.2.1        xfun_0.26        
#> [21] fastmap_1.1.0     withr_2.4.2       styler_1.6.2      stringr_1.4.0    
#> [25] knitr_1.36        fs_1.5.0          vctrs_0.3.8       hms_1.1.1        
#> [29] tidyselect_1.1.1  bit64_4.0.5       glue_1.4.2        R6_2.5.1         
#> [33] fansi_0.5.0       vroom_1.5.5       rmarkdown_2.10    waldo_0.3.0      
#> [37] purrr_0.3.4       readr_2.0.2       tzdb_0.1.2        magrittr_2.0.1   
#> [41] backports_1.2.1   ellipsis_0.3.2    htmltools_0.5.2   utf8_1.2.2       
#> [45] stringi_1.7.4     crayon_1.4.1      R.oo_1.24.0

^{Created on 2021-10-09 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

eheinzen · 2021-10-26T14:56:44Z

+1 also had this issue today--num_threads = 1 solved it, as did reverting to the older parsing engine. Would love to see a fix, though.

jimhester · 2021-11-09T12:57:49Z

Thank you for opening the issue and for supplying a reproducible example, it is a big help!

You correctly diagnosed the issue, vroom / readr was correctly parsing these files, but the warning message was spurious.

The way vroom handles embedded newlines is when we detect there is an embedded newline we restart the parsing with only a single thread. However we were not also clearing the (spurious) problems before restarting, so you ended up with a false warning as you observed.

This should now be fixed and will be in the next release of vroom.

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

jimhester closed this as completed in tidyverse/vroom@323d639 Nov 9, 2021

This was referenced Nov 9, 2021

Parsing problems with a well-formed (?) CSV #1256

Closed

Tweak problems hint #1322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readr::read_csv() throws confusing warning on correctly-parsed csv files with quoted newline characters #1313

readr::read_csv() throws confusing warning on correctly-parsed csv files with quoted newline characters #1313

djnavarro commented Oct 8, 2021 •

edited

Loading

eheinzen commented Oct 26, 2021

jimhester commented Nov 9, 2021

readr::read_csv() throws confusing warning on correctly-parsed csv files with quoted newline characters #1313

readr::read_csv() throws confusing warning on correctly-parsed csv files with quoted newline characters #1313

Comments

djnavarro commented Oct 8, 2021 • edited Loading

eheinzen commented Oct 26, 2021

jimhester commented Nov 9, 2021

djnavarro commented Oct 8, 2021 •

edited

Loading