Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set automatically allocates new column slots if needed #5269

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Nov 21, 2021

Closes #1831
Closes #4100 (what's left is the duplicate of #496)
Towards #496 establishing consistent behavior between set and :=
Towards #678 closing oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.

@codecov
Copy link

codecov bot commented Nov 21, 2021

Codecov Report

Merging #5269 (f95c1c4) into master (473d8b8) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5269   +/-   ##
=======================================
  Coverage   99.50%   99.51%           
=======================================
  Files          77       77           
  Lines       14660    14667    +7     
=======================================
+ Hits        14588    14596    +8     
+ Misses         72       71    -1     
Impacted Files Coverage Δ
R/data.table.R 99.89% <100.00%> (+<0.01%) ⬆️
src/assign.c 100.00% <100.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 473d8b8...f95c1c4. Read the comment docs.

@mattdowle
Copy link
Member

mattdowle commented Dec 3, 2021

Looks good in that this solution works and it's a good issue to fix. Just wondering about any performance decrease, particularly when updating existing columns by name inside a loop. When set() was enhanced to allow adding new columns, that was done at C level inside Cassign, iirc. I realize the over-allocation might be tricky to achieve at C level though.
First step I guess then is a benchmark of looped set(). When j is integer I wouldn't expect there to be a measurable difference as the extra is.character(j) should be insignificant and the && will short-circuit. Good to confirm though. Then a benchmark of looped set() when j is an existing column name and there are a lot of columns too, say 10,000. In that case chmatch will run once here at R level and then a second time inside Cassign. I'd expect that to be significant, and if so it could be more efficient (I'd guess twice as fast since chmatch would take most of the time) by adding the over-allocation capability at C level in Cassign so that chmatch only runs once.

@mattdowle mattdowle added the WIP label Dec 3, 2021
@ben-schwen
Copy link
Member Author

ben-schwen commented Dec 14, 2021

Benchmark - Updating existing columns

Setup

I used two different installations of R but both with version 4.1.1. And put everything into a script I call twice.

First line is always the benchmark with an integer set, second one with a character set.

Script

library(data.table)
# ensure that if branch hits
options(datatable.alloccol=1024L)
set.seed(373)
DT = as.data.table(matrix(rnorm(1e4*1e4), ncol=1e4))
cols = lapply(1:100, function(i) sample(ncol(DT), 2e3))
colsns = lapply(1:100, function(i) colnames(DT)[sample(ncol(DT), 2e3)])
R.home()
microbenchmark::microbenchmark(times=10,
  for (i in cols) {set(DT, j=i, value=DT$V1)},
  for (i in colsns) {set(DT, j=i, value=DT$V1)}
)

100 set / 2000 cols each

Current (update.dev.pkg())

[1] "/opt/R-4.1.1"
Unit: seconds
                                                    expr      min       lq     mean   median       uq      max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) } 1.276253 1.279709 1.285705 1.282165 1.285725 1.306588    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1.301202 1.302665 1.336406 1.305066 1.310549 1.617736    10

Suggested Version

[1] "/usr/lib/R"
Unit: seconds
                                                    expr      min       lq     mean   median       uq      max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) } 1.356582 1.362484 1.368112 1.365384 1.368992 1.392386    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1.377221 1.380865 1.415789 1.384860 1.390200 1.700783    10

10000 set / 10 cols each

options(datatable.alloccol=1L) so we run into edgy case

Current (update.dev.pkg())

[1] "/opt/R-4.1.1"
Unit: milliseconds
                                                    expr       min        lq     mean    median        uq       max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) }  633.9781  638.1574  660.568  643.1093  676.7702  723.2221    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1673.4164 1691.8153 1746.374 1711.6982 1779.9412 1979.1221    10

Suggested Version

[1] "/usr/lib/R"
Unit: milliseconds
                                                    expr       min        lq     mean    median        uq       max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) }  667.0488  669.4836  681.223  676.7265  697.8435  700.7359    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1590.1081 1614.8530 1644.618 1621.1307 1624.6613 1893.9807    10

Adding 10k columns to empty data.table

library(data.table)
options(datatable.alloccol=1e4)
set.seed(373)
cols = paste0("V", 1:1e4)
R.home()
microbenchmark::microbenchmark(times=10,
  { DT = data.table(); for (i in cols) {set(DT, j=i, value=i)} }
)

Current (update.dev.pkg())

[1] "/opt/R-4.1.1"
Unit: milliseconds
                                                                                    expr      min       lq     mean   median       uq      max neval
 {     DT = data.table()     for (i in cols) {         set(DT, j = i, value = i)     } } 152.2915 154.1718 157.0469 154.8296 158.0875 167.0684    10

Suggested Version

[1] "/usr/lib/R"
Unit: milliseconds
                                                                                    expr     min       lq     mean   median       uq      max neval
 {     DT = data.table()     for (i in cols) {         set(DT, j = i, value = i)     } } 186.834 189.5498 192.6968 192.3136 194.6215 201.9198    10

Naturally we loose some ms for the check of is.character(j) and due to the substitute, assign procedure.

R/data.table.R Outdated Show resolved Hide resolved
@ben-schwen
Copy link
Member Author

ben-schwen commented Dec 27, 2021

@mattdowle I tried to move everything down to the C level, hence, we save the 2nd chmatch call and added some benchmarks. We also simplify the Cassign logic here and remove some edgy cases as e.g. with readRDS(), load() or structure()

What is still open is the question of whether we can also save the substitute/assign environment assignment at the R level or move it down to C.

In case this is possible, we should change it also at :=, setDT, etc.

R/data.table.R Outdated
name = substitute(x)
x = .Call(Cassign,x,i,j,NULL,value)
if (is.name(name))
assign(as.character(name),x,parent.frame(),inherits=TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assign() looks pretty complicated to reason about, I would like to see more tests of robustness here (setDT() within a function, places where inherits=TRUE is important, etc.).

At a glance I can't tell why this is needed though -- it definitely warrants a comment explaining why we need to branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified now the whole code. I also added a comment why we need the assign.

Ultimately it would be nice to let set work also in the cases where it needs to overallocate and is inside a function, howeover, for making this to work we would need to find the right environment where to assign (might not be the first where x is present) and it is not clear whether it would be the last due to scoping

Copy link

github-actions bot commented Sep 11, 2024

Comparison Plot

Generated via commit 26cdf8f

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 3 minutes and 35 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 45 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants