fread for directories #2582

MichaelChirico · 2018-01-22T15:14:06Z

Some file I/O APIs I've worked with have a simple idiom for reading full directories:

path/to/
  file1.csv
  file2.csv
  file3.csv

would be read as, e.g. in spark,

spark.read.option("header", "true").csv("/path/to/*.csv")

A basic idiom has developed for fread to do this by adding bells and whistles to the following:

rbindlist(lapply(list.files('path/to', full.names = TRUE), fread))

It would be simple enough to wedge directory reading into the fread API by changing:

if (isTRUE(file.info(input)$isdir)) {
  stop("'input' can not be a directory name, but must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself.")
}

to (pseudocode around the match.call() part)

if (isTRUE(file.info(input)$isdir)) {
  return(rbindlist(lapply(list.files(input, pattern = '\\.csv', full.names = TRUE), fread, match.call())))
}

However, it might be nice to build in some flexibility to this, e.g. allowing the list.files to optionally be recursive, implementing some API for automatic source naming (if there are subdirectories, and the names of the subdirectories contain information, the manual version of this allows a bit more flexibility), specifying idcol or fill, etc.

So there's two questions here

Is this worth implementing?
What's the desired API for doing so? New named arguments to fread? Just add ... and post-process if the is.dir branch is reached? Separate function call altogether?

The text was updated successfully, but these errors were encountered:

st-pasha · 2018-01-22T20:13:17Z

My thinking was that whenever fread's input resolves to multiple files, then each of the files should be read in turn, and then returned as a list of DataTables. An attribute can be set on each of these DataTables to specify the name of its particular source. If one of the sources cannot be read, then it should be represented as an exception object in the list, while other sources should continue to parse (there could be an option to control whether to throw an error immediately, or perhaps to skip the bad files).

However I don't think that automatically rbinding all of the DataTables into a single result is a good idea. The reason being that in practice some of them may have irregularities (eg. different column names), or some of files are picked up that are not csv data at all, or perhaps a field needs to be added based on the name of the source file, etc. Adding options to support all of these intricacies would complicate the interface unnecessarily, will be time-consuming (if you get one of the options wrong you need to rescan all the files), and potentially fragile (new use cases may demand adding new options). Whereas if you just return a list of DataTables, then the user is free to do whatever she/he wants using familiar language constructs.

MichaelChirico · 2018-01-23T00:15:13Z

good feedback. actually it's a good point, as some users may want to use the forthcoming `cbindlist` as well, or to `Reduce(merge)` the items.

HughParsonage · 2018-01-23T02:28:39Z

One use-case where I've found R wanting is when the directory contains a very large number of small files (i.e. 100,000 to 1,000,000 files of 1-10 kB). In such cases, fread+rbindlist is not faster than read.csv+rbindlist and both are orders of magnitude slower than using the command line: copy /b *.csv > out.csv. Difficulties arise when using the command line option when the columns are not in the same order (i.e. use.names = TRUE would help) or when the column headers are present in each of the file (because concatenation results in a file with 100,000 headers interspersed throughout the file), but it's still much faster than the R alternatives I know.

st-pasha · 2018-01-23T02:42:29Z

Interesting use case, did you try to investigate where the bottleneck is?
Is it in the constant overhead time that fread spends detecting format of each file? (Setting some of the parameters explicitly might reduce the time then)
Or is it in rbinding itself?
Or maybe there is significant overhead from R itself trying to read the directory?

HughParsonage · 2018-01-23T03:02:38Z

It looks something like this which is not as bad as I remember (Matt, can you stop improving the package? It's ruining my anecdotes.)

[System.IO.Directory]::GetFiles("address", "*.*").Count
463716

system.time(list.files(path = "address",
                       pattern = "\\.csv$",
                       full.names = TRUE))
# user  system elapsed 
# 7.66    1.32    8.99
Files <- list.files(path = "address",
                    pattern = "\\.csv$",
                    full.names = TRUE)

system.time(lapply(Files[1:100], fread, fill = TRUE))
# user  system elapsed 
# 2.38    0.08    2.59 
system.time(lapply(Files[1:1e2], fread, sep = ",", colClasses = "character", fill = TRUE))
# user  system elapsed 
# 1.58    0.10    1.67 

system.time(lapply(Files[1:1e4], fread, sep = ",", colClasses = "character", fill = TRUE))
# user  system elapsed 
# 6.97    5.00   22.67

MichaelChirico · 2018-01-23T03:02:45Z

@st-pasha I assume it's in fread overhead, I seem to recall running a benchmark where read.csv is faster for very small files (like <20 rows)

also, relating to your first comment, returning the source attributes in the names would highlight the utility of #1948 as well for manipulating these objects in post.

franknarf1 · 2018-06-13T09:57:57Z

Another use case I'm running into (.. not sure how different it is from the preceding):

I wrote a helper function to read a csv inside tar.gz like

fread("7z -so mycsv.tar.gz | 7z x -si -so -ttar")

but now I have tar.gz containing multiple csvs (that should have identical column names and classes), and it seems I'll need to go another way (I guess: run the 7z call then lapply fread on the files it drops > confirm columns match > rbindlist).

MichaelChirico · 2020-05-12T14:18:00Z

Just to have some thoughts written down:

There's a pretty simple version of this where we just wrap fread('/path/to/dir', ...) to lapply(list.files('/path/to/dir'), fread, ...)...

there's also a substantially more involved version where directory-level fread is all done in C, and the pre-amble stuff (ncol/nrow/type detection) is done first in a loop, then we allocate all the memory once & either (1) fill the table in parallel over files, nthread=1 within files or (2) fill the table in serial over files, parallel within file.

Definitely the first version should use the simple approach, but it almost surely won't be faster (for many use cases) than using the terminal to cat the files to tempfile() first & reading that. If we understand well when this latter approach is preferred, we might leverage file.append (possibly excluding headers?)...

jangorecki · 2020-05-12T14:33:41Z

IMO it is not good if fread would return a list rather than data.table or data.frame, unless we provide an extra argument. I mean that changing "dir" to "dir/file1.csv" should not change the class of returned object. Eventually when providing non-scalar files c("dir/file1.csv","dir/file2.csv"), then it make sense to return a list of data.tables.
lapply(, fread) seems quite good to return list already.
If we want to fread a directory, maybe we could expect all files to be similar schema, and then extra argument on how to merge/bind those files could useful, so it can still return just a data.table.

MichaelChirico · 2020-05-12T15:30:05Z

A simple how = c('list', 'rbindlist', 'cbindlist', 'mergelist') (or similar) could be good once #4370 is done

MichaelChirico · 2020-07-22T06:00:18Z

Idle musing -- if how='rbindlist', we should probably do something like: read schema from the first file, then supply that as colClasses for subsequent files for efficiency. As inspired here

jangorecki · 2020-07-26T13:45:36Z

Unless fill=T expected

MichaelChirico self-assigned this Feb 2, 2018

MichaelChirico mentioned this issue Feb 3, 2018

tables could look for en-list-ed data.tables as well #2606

Open

MichaelChirico mentioned this issue Apr 2, 2018

fread could accept multiple files #586

Open

st-pasha mentioned this issue Apr 2, 2018

Master task for fread bugs / proposals #2247

Closed

This was referenced Jun 23, 2018

split.data.table is slow? (O(n)) #2950

Open

Any scope for rbindlapply? #1649

Open

MichaelChirico mentioned this issue Sep 7, 2019

fread could attempt auto-unzip #3834

Closed

jangorecki added the fread label Sep 21, 2019

MichaelChirico mentioned this issue Jun 22, 2020

Columns appearing in the function in by= disappers in j #1427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread for directories #2582

fread for directories #2582

MichaelChirico commented Jan 22, 2018

st-pasha commented Jan 22, 2018

MichaelChirico commented Jan 23, 2018 via email •

edited

Loading

HughParsonage commented Jan 23, 2018

st-pasha commented Jan 23, 2018

HughParsonage commented Jan 23, 2018

MichaelChirico commented Jan 23, 2018 •

edited

Loading

franknarf1 commented Jun 13, 2018

MichaelChirico commented May 12, 2020

jangorecki commented May 12, 2020

MichaelChirico commented May 12, 2020

MichaelChirico commented Jul 22, 2020

jangorecki commented Jul 26, 2020

fread for directories #2582

fread for directories #2582

Comments

MichaelChirico commented Jan 22, 2018

st-pasha commented Jan 22, 2018

MichaelChirico commented Jan 23, 2018 via email • edited Loading

HughParsonage commented Jan 23, 2018

st-pasha commented Jan 23, 2018

HughParsonage commented Jan 23, 2018

MichaelChirico commented Jan 23, 2018 • edited Loading

franknarf1 commented Jun 13, 2018

MichaelChirico commented May 12, 2020

jangorecki commented May 12, 2020

MichaelChirico commented May 12, 2020

MichaelChirico commented Jul 22, 2020

jangorecki commented Jul 26, 2020

MichaelChirico commented Jan 23, 2018 via email •

edited

Loading

MichaelChirico commented Jan 23, 2018 •

edited

Loading