Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread for directories #2582

Open
MichaelChirico opened this issue Jan 22, 2018 · 12 comments
Open

fread for directories #2582

MichaelChirico opened this issue Jan 22, 2018 · 12 comments
Assignees
Labels

Comments

@MichaelChirico
Copy link
Member

Some file I/O APIs I've worked with have a simple idiom for reading full directories:

path/to/
  file1.csv
  file2.csv
  file3.csv

would be read as, e.g. in spark,

spark.read.option("header", "true").csv("/path/to/*.csv")

A basic idiom has developed for fread to do this by adding bells and whistles to the following:

rbindlist(lapply(list.files('path/to', full.names = TRUE), fread))

It would be simple enough to wedge directory reading into the fread API by changing:

if (isTRUE(file.info(input)$isdir)) {
  stop("'input' can not be a directory name, but must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself.")
}

to (pseudocode around the match.call() part)

if (isTRUE(file.info(input)$isdir)) {
  return(rbindlist(lapply(list.files(input, pattern = '\\.csv', full.names = TRUE), fread, match.call())))
}

However, it might be nice to build in some flexibility to this, e.g. allowing the list.files to optionally be recursive, implementing some API for automatic source naming (if there are subdirectories, and the names of the subdirectories contain information, the manual version of this allows a bit more flexibility), specifying idcol or fill, etc.

So there's two questions here

  1. Is this worth implementing?
  2. What's the desired API for doing so? New named arguments to fread? Just add ... and post-process if the is.dir branch is reached? Separate function call altogether?
@st-pasha
Copy link
Contributor

My thinking was that whenever fread's input resolves to multiple files, then each of the files should be read in turn, and then returned as a list of DataTables. An attribute can be set on each of these DataTables to specify the name of its particular source. If one of the sources cannot be read, then it should be represented as an exception object in the list, while other sources should continue to parse (there could be an option to control whether to throw an error immediately, or perhaps to skip the bad files).

However I don't think that automatically rbinding all of the DataTables into a single result is a good idea. The reason being that in practice some of them may have irregularities (eg. different column names), or some of files are picked up that are not csv data at all, or perhaps a field needs to be added based on the name of the source file, etc. Adding options to support all of these intricacies would complicate the interface unnecessarily, will be time-consuming (if you get one of the options wrong you need to rescan all the files), and potentially fragile (new use cases may demand adding new options). Whereas if you just return a list of DataTables, then the user is free to do whatever she/he wants using familiar language constructs.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 23, 2018 via email

@HughParsonage
Copy link
Member

One use-case where I've found R wanting is when the directory contains a very large number of small files (i.e. 100,000 to 1,000,000 files of 1-10 kB). In such cases, fread+rbindlist is not faster than read.csv+rbindlist and both are orders of magnitude slower than using the command line: copy /b *.csv > out.csv. Difficulties arise when using the command line option when the columns are not in the same order (i.e. use.names = TRUE would help) or when the column headers are present in each of the file (because concatenation results in a file with 100,000 headers interspersed throughout the file), but it's still much faster than the R alternatives I know.

@st-pasha
Copy link
Contributor

Interesting use case, did you try to investigate where the bottleneck is?
Is it in the constant overhead time that fread spends detecting format of each file? (Setting some of the parameters explicitly might reduce the time then)
Or is it in rbinding itself?
Or maybe there is significant overhead from R itself trying to read the directory?

@HughParsonage
Copy link
Member

It looks something like this which is not as bad as I remember (Matt, can you stop improving the package? It's ruining my anecdotes.)

[System.IO.Directory]::GetFiles("address", "*.*").Count
463716
system.time(list.files(path = "address",
                       pattern = "\\.csv$",
                       full.names = TRUE))
# user  system elapsed 
# 7.66    1.32    8.99
Files <- list.files(path = "address",
                    pattern = "\\.csv$",
                    full.names = TRUE)

system.time(lapply(Files[1:100], fread, fill = TRUE))
# user  system elapsed 
# 2.38    0.08    2.59 
system.time(lapply(Files[1:1e2], fread, sep = ",", colClasses = "character", fill = TRUE))
# user  system elapsed 
# 1.58    0.10    1.67 

system.time(lapply(Files[1:1e4], fread, sep = ",", colClasses = "character", fill = TRUE))
# user  system elapsed 
# 6.97    5.00   22.67

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 23, 2018

@st-pasha I assume it's in fread overhead, I seem to recall running a benchmark where read.csv is faster for very small files (like <20 rows)

also, relating to your first comment, returning the source attributes in the names would highlight the utility of #1948 as well for manipulating these objects in post.

@franknarf1
Copy link
Contributor

Another use case I'm running into (.. not sure how different it is from the preceding):

I wrote a helper function to read a csv inside tar.gz like

fread("7z -so mycsv.tar.gz | 7z x -si -so -ttar")

but now I have tar.gz containing multiple csvs (that should have identical column names and classes), and it seems I'll need to go another way (I guess: run the 7z call then lapply fread on the files it drops > confirm columns match > rbindlist).

@MichaelChirico
Copy link
Member Author

Just to have some thoughts written down:

There's a pretty simple version of this where we just wrap fread('/path/to/dir', ...) to lapply(list.files('/path/to/dir'), fread, ...)...

there's also a substantially more involved version where directory-level fread is all done in C, and the pre-amble stuff (ncol/nrow/type detection) is done first in a loop, then we allocate all the memory once & either (1) fill the table in parallel over files, nthread=1 within files or (2) fill the table in serial over files, parallel within file.

Definitely the first version should use the simple approach, but it almost surely won't be faster (for many use cases) than using the terminal to cat the files to tempfile() first & reading that. If we understand well when this latter approach is preferred, we might leverage file.append (possibly excluding headers?)...

@jangorecki
Copy link
Member

IMO it is not good if fread would return a list rather than data.table or data.frame, unless we provide an extra argument. I mean that changing "dir" to "dir/file1.csv" should not change the class of returned object. Eventually when providing non-scalar files c("dir/file1.csv","dir/file2.csv"), then it make sense to return a list of data.tables.
lapply(, fread) seems quite good to return list already.
If we want to fread a directory, maybe we could expect all files to be similar schema, and then extra argument on how to merge/bind those files could useful, so it can still return just a data.table.

@MichaelChirico
Copy link
Member Author

A simple how = c('list', 'rbindlist', 'cbindlist', 'mergelist') (or similar) could be good once #4370 is done

@MichaelChirico
Copy link
Member Author

Idle musing -- if how='rbindlist', we should probably do something like: read schema from the first file, then supply that as colClasses for subsequent files for efficiency. As inspired here

@jangorecki
Copy link
Member

Unless fill=T expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants