Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] fread handling embedded NUL characters #3400

Closed
mdavy86 opened this issue Feb 13, 2019 · 1 comment · Fixed by #3505
Closed

[bug] fread handling embedded NUL characters #3400

mdavy86 opened this issue Feb 13, 2019 · 1 comment · Fixed by #3505
Labels
Milestone

Comments

@mdavy86
Copy link

mdavy86 commented Feb 13, 2019

This issue is similar to several previous issues;

loading a file containing NUL ASCII character (in bytes as.raw(0)), except that I have a minimal reproducible example which appears to cause a segfault on line fread.R@146

This example is based on simulation software output where very rarely there can be NUL characters in the body of the file (issue #2485 has already resolved NUL characters at the end of a file). It appears NUL characters at the beginning of a file are acceptable as well.

The header field is key=value pairs, and the data field is to be read into a data.table. In the example, NUL characters has been inserted into the body, you only need one to cause an error which cannot be caught with error handling.

library(data.table)
## environment
sessionInfo()
## example #1
n <- 1
bytes <- c(charToRaw("a=b\nA  B  C\n1  2  3\n"), rep(as.raw(0), n), charToRaw("4  5  6\n"))
writeBin(bytes, "test.txt")
## fread hangs
try(fread("test.txt", skip=1, header=TRUE, verbose=TRUE))

Verbose tracelog is provided using a file test1.R;

## bash here doc...
cat - > test1.R << 'EOF'
library(data.table)
## environment
sessionInfo()
## example #1
n <- 1
bytes <- c(charToRaw("a=b\nA  B  C\n1  2  3\n"), rep(as.raw(0), n), charToRaw("4  5  6\n"))
writeBin(bytes, "test.txt")
## fread hangs
try(fread("test.txt", skip=1, header=TRUE, verbose=TRUE))
EOF

Running test1.R;

$ Rscript test1.R 
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /software/statistical/R-3.5.2/lib64/R/lib/libRblas.so
LAPACK: /software/statistical/R-3.5.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8
 [5] LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.12.1 RLinuxModules_0.3

loaded via a namespace (and not attached):
[1] compiler_3.5.2    R.methodsS3_1.7.1 R.utils_2.7.0     R.oo_1.22.0
omp_get_max_threads() = 16
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 16 threads (omp_get_max_threads()=16, nth=16)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  skip num lines = 1
  show progress = 0
  0/1 column will be read as integer
[02] Opening the file
  Opening file test.txt
  File opened, size = 29 bytes.
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is co
[05] Skipping initial rows if needed
  Skipped to line 2 in the file  Positioned on line 2 starting: <<A  B  C>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=' '  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 2. This line is either column names or first data row. Line starts as: <<A  B  C>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 1 because (24 bytes from row 1 to eof) / (2 * 16 jump0size) == 0
  A line with too-few fields (1/3) was found on line 2 of sample jump 0.
  Type codes (jump 000)    : 555  Quote rule 0
  All rows were sampled since file is small so we know nrow=1 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 555
[10] Allocate memory for the datatable
  Allocating 3 column slots (3 - 0 dropped) with 1 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=16

## suspending with control-z

## Remove unwanted suspended jobs
$ jobs -l | cut -d' ' -f2 |  xargs -I{} kill -9 {}

A different error can be achieved by inserting NUL characters at the beginning of the data field (after the header field) in test2.R;

## bash here doc...
cat - > test2.R << 'EOF'
library(data.table)
## example #2
n <- 1
bytes <- c(charToRaw("a=b\n"), rep(as.raw(0), n), charToRaw("A  B  C\n1  2  3\n4  5  6\n"))
writeBin(bytes, "test.txt")
## fread
try(fread("test.txt", skip=1, header=TRUE, verbose=FALSE))
EOF

Running test2.R;

$ Rscript test2.R
Empty data.table (0 rows and 1 cols): V1
Warning message:
In fread("test.txt", skip = 1, header = TRUE, verbose = FALSE) :
  Stopped early on line 3. Expected 1 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<>>

The data.table above loads with a warning, however it is not the correct size (0 rows and 1 cols).

If you change the inserted byte to anything other than 0 [1 - 255], fread works fine (by including the byte in one of the data.table elements).

@kiwiroy
Copy link
Member

kiwiroy commented Feb 27, 2019

AFAICT the following is making for an "infinite" loop (nrowLimit is very large by default).

if (*tch=='\0') continue; // empty last line

Adding nrowLimit = myNrow to the branch appears to be a fix to force it to be the last line. See #3433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants