Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread: need more flexible behavior when encountering a broken line. #2263

Open
st-pasha opened this issue Jul 7, 2017 · 4 comments
Open

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Jul 7, 2017

A new parameter bad.lines (or similar) is proposed. This parameter adjusts fread's strategy when dealing with lines that are "broken" (i.e. have less or more than the required number of columns). This parameter may take the following values:

  • "error" (default) -- stop scanning the file and raise an exception.
  • "fill" (currently achieved with fill=TRUE) -- any lines having too few fields are padded with NAs. Here "too few" means less than the maximum number of fields observed across all rows in the file.
  • "skip" -- broken lines are simply ignored.
  • "extract" -- any broken lines are placed into a separate datatable, whereas the "main" datatable retains empty rows in their place. The extra datatable will have at least the following fields: lineno (line number in the original data file), rowno (corresponding row number in the "main" datatable), line (the text of the line), nfields (number of fields detected on that line)

Additionally, there should be parameter report (default FALSE), which is used for strategies "fill" and "skip", and instructs fread to report to the user line numbers that were filled/skipped.

@MichaelChirico
Copy link
Member

the 'extract' table will use fill = TRUE by default? it's often the case that the bashed-up lines have differing #s of fields among themselves.

also, how would 'extract' determine which lines are the intended lines, vs. which are the scrap lines? just by majority rules?

lastly, i hope fill = TRUE remains since it's convenient and covers most cases.

@st-pasha
Copy link
Contributor Author

st-pasha commented Jul 7, 2017

Yes, there is no reason to remove fill parameter, it could just be a shortcut for bad.lines="fill" strategy.

I think it would be useful to think from the use-case perspective. What are the possible reasons to have a file with incorrect number of fields in some lines? I can think of several reasons:

  1. The file inherently has no fixed number of fields. Maybe it's poetry. Maybe it's a binary file. Maybe something else.
  2. The file is a data file, but has some intro / tail sections that aren't really data.
  3. The file uses space/tab as a delimiter, but it was edited in a text editor and all trailing whitespace was removed.
  4. The file is a concatenation from multiple sources, and they had different schemas with different number of fields. One can only hope that new fields were only added, and nothing was removed/rearranged.
  5. Similar to (4): several datasets appear in a single file, possibly second dataset having its own headers, and is separated from the first dataset by at least one blank line.
  6. File was corrupted somehow (eg. bad sector on a disk, or a small kid "helping" his dad while he stepped away), and now some portions of the file contain garbage. Also sometimes the last line in the file may be truncated (transmission failed, or file not closed properly).
  7. Values written to the file were not serialized properly (in particular quotes/delimiters were not escaped), and as a result some rows appear to have more columns than necessary. Similarly, if a field contained a newline and wasn't quoted, then the row will be broken into 2 incomplete rows (which might also hurt type detection...)

Anything I missed?

@MichaelChirico
Copy link
Member

That sounds pretty thorough. I believe 6. is the most common.

This falls under 4., but may be worth considering separately:

More than one data base is contained in a single file, probably separated by some YAML/metadeta header internally

@st-pasha
Copy link
Contributor Author

st-pasha commented Jul 8, 2017

So let's consider how would we tackle each of these situations. What would the ideal freads behavior be? What settings would allow the user to load such a file?

  1. When the file has no stable field structure, then I think the best thing for fread is to load it as a single-column file (e.g. like readLines()). This should be the default behavior, but can also be forced via sep="".
  2. Ideally fread should remove the intro/trailing garbage, and just read the data, telling the user that some portions of the file were skipped. It's already doing that. But it should give the user an ability to access the chopped-off data. If freads heuristic is insufficient to handle the file, there are options skip and nrows to the rescue.
  3. If trailing whitespace was removed, then it means there were NA values there (which usually get written out as empty strings). In this case option fill=TRUE is most appropriate, and will return the file to its original shape.
  4. Again, fill=TRUE seems to be appropriate here. Can fread detect that there are sections of file with different field count? possibly.
  5. When there are 2+ distinct datasets in a file, then the user probably wants to read one of them. Ideally, fread would tell the user how to do that, i.e. which parameters skip/nrows to use to extract each part. fill=T is not appropriate here.
  6. The user may want to either remove the corrupted entries, or somehow fix them manually. This should be achievable using either strategy "skip" or "extract".
  7. If the datatable is unable to fix these problems; then the user may either want to skip them altogether (strategy "skip"), or to "extract" them and try to figure out what's going on. In the latter case it is convenient to have placeholders in the original dataset where the user may ultimately write their parsed data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants