fread: need more flexible behavior when encountering a broken line. #2263

st-pasha · 2017-07-07T22:04:10Z

A new parameter bad.lines (or similar) is proposed. This parameter adjusts fread's strategy when dealing with lines that are "broken" (i.e. have less or more than the required number of columns). This parameter may take the following values:

"error" (default) -- stop scanning the file and raise an exception.
"fill" (currently achieved with fill=TRUE) -- any lines having too few fields are padded with NAs. Here "too few" means less than the maximum number of fields observed across all rows in the file.
"skip" -- broken lines are simply ignored.
"extract" -- any broken lines are placed into a separate datatable, whereas the "main" datatable retains empty rows in their place. The extra datatable will have at least the following fields: lineno (line number in the original data file), rowno (corresponding row number in the "main" datatable), line (the text of the line), nfields (number of fields detected on that line)

Additionally, there should be parameter report (default FALSE), which is used for strategies "fill" and "skip", and instructs fread to report to the user line numbers that were filled/skipped.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2017-07-07T22:09:22Z

the 'extract' table will use fill = TRUE by default? it's often the case that the bashed-up lines have differing #s of fields among themselves.

also, how would 'extract' determine which lines are the intended lines, vs. which are the scrap lines? just by majority rules?

lastly, i hope fill = TRUE remains since it's convenient and covers most cases.

st-pasha · 2017-07-07T22:54:34Z

Yes, there is no reason to remove fill parameter, it could just be a shortcut for bad.lines="fill" strategy.

I think it would be useful to think from the use-case perspective. What are the possible reasons to have a file with incorrect number of fields in some lines? I can think of several reasons:

The file inherently has no fixed number of fields. Maybe it's poetry. Maybe it's a binary file. Maybe something else.
The file is a data file, but has some intro / tail sections that aren't really data.
The file uses space/tab as a delimiter, but it was edited in a text editor and all trailing whitespace was removed.
The file is a concatenation from multiple sources, and they had different schemas with different number of fields. One can only hope that new fields were only added, and nothing was removed/rearranged.
Similar to (4): several datasets appear in a single file, possibly second dataset having its own headers, and is separated from the first dataset by at least one blank line.
File was corrupted somehow (eg. bad sector on a disk, or a small kid "helping" his dad while he stepped away), and now some portions of the file contain garbage. Also sometimes the last line in the file may be truncated (transmission failed, or file not closed properly).
Values written to the file were not serialized properly (in particular quotes/delimiters were not escaped), and as a result some rows appear to have more columns than necessary. Similarly, if a field contained a newline and wasn't quoted, then the row will be broken into 2 incomplete rows (which might also hurt type detection...)

Anything I missed?

MichaelChirico · 2017-07-07T23:01:51Z

That sounds pretty thorough. I believe 6. is the most common.

This falls under 4., but may be worth considering separately:

More than one data base is contained in a single file, probably separated by some YAML/metadeta header internally

st-pasha · 2017-07-08T00:48:25Z

So let's consider how would we tackle each of these situations. What would the ideal freads behavior be? What settings would allow the user to load such a file?

When the file has no stable field structure, then I think the best thing for fread is to load it as a single-column file (e.g. like readLines()). This should be the default behavior, but can also be forced via sep="".
Ideally fread should remove the intro/trailing garbage, and just read the data, telling the user that some portions of the file were skipped. It's already doing that. But it should give the user an ability to access the chopped-off data. If freads heuristic is insufficient to handle the file, there are options skip and nrows to the rescue.
If trailing whitespace was removed, then it means there were NA values there (which usually get written out as empty strings). In this case option fill=TRUE is most appropriate, and will return the file to its original shape.
Again, fill=TRUE seems to be appropriate here. Can fread detect that there are sections of file with different field count? possibly.
When there are 2+ distinct datasets in a file, then the user probably wants to read one of them. Ideally, fread would tell the user how to do that, i.e. which parameters skip/nrows to use to extract each part. fill=T is not appropriate here.
The user may want to either remove the corrupted entries, or somehow fix them manually. This should be achievable using either strategy "skip" or "extract".
If the datatable is unable to fix these problems; then the user may either want to skip them altogether (strategy "skip"), or to "extract" them and try to figure out what's going on. In the latter case it is convenient to have placeholders in the original dataset where the user may ultimately write their parsed data.

st-pasha added feature request fread labels Jul 7, 2017

st-pasha mentioned this issue Dec 14, 2017

Master task for fread bugs / proposals #2247

Closed

TobiasGold mentioned this issue Feb 28, 2019

improve fread behaviour on inconsistent number of columns #3436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread: need more flexible behavior when encountering a broken line. #2263

fread: need more flexible behavior when encountering a broken line. #2263

st-pasha commented Jul 7, 2017

MichaelChirico commented Jul 7, 2017

st-pasha commented Jul 7, 2017 •

edited

Loading

MichaelChirico commented Jul 7, 2017

st-pasha commented Jul 8, 2017

fread: need more flexible behavior when encountering a broken line. #2263

fread: need more flexible behavior when encountering a broken line. #2263

Comments

st-pasha commented Jul 7, 2017

MichaelChirico commented Jul 7, 2017

st-pasha commented Jul 7, 2017 • edited Loading

MichaelChirico commented Jul 7, 2017

st-pasha commented Jul 8, 2017

st-pasha commented Jul 7, 2017 •

edited

Loading