-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fwrite length for gzip output #6393
base: master
Are you sure you want to change the base?
Conversation
* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code
Generated via commit cdf4277 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 3 minutes and 29 seconds Time taken to run |
You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
|
Put PR #5513 in this PR with new param compressLevel. |
It definitely is appropriate!
For me, this is the main point of the regression tests (i.e., not strictly aiming to target historical cases, but rather future-proofing the current performance or ensuring things do not become slower in the long run) so not having a 'Slow' version/label at present sounds acceptable |
OK, I think that's where I was stuck -- which labels are appropriate for such a use case. I guess I thought some of them were required, but looking at https:/tdhock/atime/blob/main/man/atime_versions.Rd What labels are best here, then, |
I suppose we'll have to see if this makes it faster or just retains the same performance - if it's the former, then clearly 'Slow'/'Fast', but if it's the latter (which is what we're at right now), then I'd say maybe 'Before' and 'Current'?
Yup 'After' will become redundant soon in context |
Currently, I'm thinking 'Baseline' and 'Post-gzip refactor' |
I documented the current process here, https:/Rdatatable/data.table/wiki/Performance-testing So in this case, since there is no regression, Before/Regression/Fixed should not be used, to avoid confusion with other cases that actually are real historical regressions. I would suggest Fast for the old commit before this PR. (no Slow necessary) |
Sounds reasonable for this PR (to compare 'Fast' and 'HEAD'), but then will we be only running the old commit for the regression test we make for this case after this PR has been merged? |
Thanks Toby, I had looked at the .ci/atime/tests.R script and some {atime} documentation directly and didn't think to check the Wiki. Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R? |
@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense? |
yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics. |
OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version. One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default For testing impact, I have this little program :
With scipen = 0
With scipen = 999
In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead. In my opinion, I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :
but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing. OK for another one to continue and test that there is not time regression. |
2 and 3 sounds good to me
Should I go ahead and make a PR for this quick addition?
I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly |
this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too. Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage. edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
"~/R/data.table",
N = 2^seq(1, 20),
pkg.edit.fun=edit.data.table,
setup = {
set.seed(1)
NC = 10
L <- data.table(i=1:N)
L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
},
expr = {
data.table::fwrite(L, out.csv, compress="gzip")
},
Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https:/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https:/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393) |
Co-authored-by: Michael Chirico <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6393 +/- ##
==========================================
- Coverage 98.62% 98.55% -0.08%
==========================================
Files 79 79
Lines 14448 14503 +55
==========================================
+ Hits 14249 14293 +44
- Misses 199 210 +11 ☔ View full report in Codecov by Sentry. |
Closes #6356. Closes #5506.
This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.
zlib
C code
#pragma omp parallel for
for chunk loop and#pragma omp ordered
for the writing and summarizing part.malloc
occur early and no need for an header buffer.=-
or=*
. Lot of work remains. Use ofindent
command ?