set automatically allocates new column slots if needed #5269

ben-schwen · 2021-11-21T10:46:13Z

Closes #1831
Closes #4100 (what's left is the duplicate of #496)
Towards #496 establishing consistent behavior between set and :=
Towards #678 closing oldtncol = TRUELENGTH(dt); // TO DO: oldtncol can be just called tl now, as we won't realloc here any more.

codecov · 2021-11-21T10:58:16Z

Codecov Report

Merging #5269 (f95c1c4) into master (473d8b8) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5269   +/-   ##
=======================================
  Coverage   99.50%   99.51%           
=======================================
  Files          77       77           
  Lines       14660    14667    +7     
=======================================
+ Hits        14588    14596    +8     
+ Misses         72       71    -1

Impacted Files	Coverage Δ
R/data.table.R	`99.89% <100.00%> (+<0.01%)`	⬆️
src/assign.c	`100.00% <100.00%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 473d8b8...f95c1c4. Read the comment docs.

mattdowle · 2021-12-03T18:36:57Z

Looks good in that this solution works and it's a good issue to fix. Just wondering about any performance decrease, particularly when updating existing columns by name inside a loop. When set() was enhanced to allow adding new columns, that was done at C level inside Cassign, iirc. I realize the over-allocation might be tricky to achieve at C level though.
First step I guess then is a benchmark of looped set(). When j is integer I wouldn't expect there to be a measurable difference as the extra is.character(j) should be insignificant and the && will short-circuit. Good to confirm though. Then a benchmark of looped set() when j is an existing column name and there are a lot of columns too, say 10,000. In that case chmatch will run once here at R level and then a second time inside Cassign. I'd expect that to be significant, and if so it could be more efficient (I'd guess twice as fast since chmatch would take most of the time) by adding the over-allocation capability at C level in Cassign so that chmatch only runs once.

ben-schwen · 2021-12-14T23:18:03Z

Benchmark - Updating existing columns

Setup

I used two different installations of R but both with version 4.1.1. And put everything into a script I call twice.

First line is always the benchmark with an integer set, second one with a character set.

Script

library(data.table)
# ensure that if branch hits
options(datatable.alloccol=1024L)
set.seed(373)
DT = as.data.table(matrix(rnorm(1e4*1e4), ncol=1e4))
cols = lapply(1:100, function(i) sample(ncol(DT), 2e3))
colsns = lapply(1:100, function(i) colnames(DT)[sample(ncol(DT), 2e3)])
R.home()
microbenchmark::microbenchmark(times=10,
  for (i in cols) {set(DT, j=i, value=DT$V1)},
  for (i in colsns) {set(DT, j=i, value=DT$V1)}
)

100 set / 2000 cols each

Current (`update.dev.pkg()`)

[1] "/opt/R-4.1.1"
Unit: seconds
                                                    expr      min       lq     mean   median       uq      max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) } 1.276253 1.279709 1.285705 1.282165 1.285725 1.306588    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1.301202 1.302665 1.336406 1.305066 1.310549 1.617736    10

Suggested Version

[1] "/usr/lib/R"
Unit: seconds
                                                    expr      min       lq     mean   median       uq      max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) } 1.356582 1.362484 1.368112 1.365384 1.368992 1.392386    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1.377221 1.380865 1.415789 1.384860 1.390200 1.700783    10

10000 set / 10 cols each

options(datatable.alloccol=1L) so we run into edgy case

Current (`update.dev.pkg()`)

[1] "/opt/R-4.1.1"
Unit: milliseconds
                                                    expr       min        lq     mean    median        uq       max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) }  633.9781  638.1574  660.568  643.1093  676.7702  723.2221    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1673.4164 1691.8153 1746.374 1711.6982 1779.9412 1979.1221    10

Suggested Version

[1] "/usr/lib/R"
Unit: milliseconds
                                                    expr       min        lq     mean    median        uq       max neval
   for (i in cols) {     set(DT, j = i, value = DT$V1) }  667.0488  669.4836  681.223  676.7265  697.8435  700.7359    10
 for (i in colsns) {     set(DT, j = i, value = DT$V1) } 1590.1081 1614.8530 1644.618 1621.1307 1624.6613 1893.9807    10

Adding 10k columns to empty data.table

library(data.table)
options(datatable.alloccol=1e4)
set.seed(373)
cols = paste0("V", 1:1e4)
R.home()
microbenchmark::microbenchmark(times=10,
  { DT = data.table(); for (i in cols) {set(DT, j=i, value=i)} }
)

Current (`update.dev.pkg()`)

[1] "/opt/R-4.1.1"
Unit: milliseconds
                                                                                    expr      min       lq     mean   median       uq      max neval
 {     DT = data.table()     for (i in cols) {         set(DT, j = i, value = i)     } } 152.2915 154.1718 157.0469 154.8296 158.0875 167.0684    10

Suggested Version

[1] "/usr/lib/R"
Unit: milliseconds
                                                                                    expr     min       lq     mean   median       uq      max neval
 {     DT = data.table()     for (i in cols) {         set(DT, j = i, value = i)     } } 186.834 189.5498 192.6968 192.3136 194.6215 201.9198    10

Naturally we loose some ms for the check of is.character(j) and due to the substitute, assign procedure.

R/data.table.R

inst/tests/tests.Rraw

ben-schwen · 2021-12-27T13:30:47Z

@mattdowle I tried to move everything down to the C level, hence, we save the 2nd chmatch call and added some benchmarks. We also simplify the Cassign logic here and remove some edgy cases as e.g. with readRDS(), load() or structure()

What is still open is the question of whether we can also save the substitute/assign environment assignment at the R level or move it down to C.

In case this is possible, we should change it also at :=, setDT, etc.

MichaelChirico · 2024-09-09T16:01:23Z

R/data.table.R

+ name = substitute(x)
+ x = .Call(Cassign,x,i,j,NULL,value)
+ if (is.name(name))
+ assign(as.character(name),x,parent.frame(),inherits=TRUE)


This assign() looks pretty complicated to reason about, I would like to see more tests of robustness here (setDT() within a function, places where inherits=TRUE is important, etc.).

At a glance I can't tell why this is needed though -- it definitely warrants a comment explaining why we need to branch.

I simplified now the whole code. I also added a comment why we need the assign.

Ultimately it would be nice to let set work also in the cases where it needs to overallocate and is inside a function, howeover, for making this to work we would need to find the right environment where to assign (might not be the first where x is present) and it is not clear whether it would be the last due to scoping

github-actions · 2024-09-11T09:53:07Z

Generated via commit 26cdf8f

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 3 minutes and 35 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 45 seconds

ben-schwen added 7 commits November 21, 2021 10:37

add re-overallocation to set

53d5160

refine test

a035fbe

avoid double substitute by calling Calloccolwrapper directly

315eb3c

add length check to avoid chmatch

b1db2ab

add news

7146873

alter old test

67a0218

merge master

6b35f8e

ben-schwen added 2 commits November 21, 2021 12:48

add coverage

cda7455

remove old error which is not reachable anymore

b61e7ff

mattdowle added the WIP label Dec 3, 2021

grammar

517c62a

MichaelChirico reviewed Dec 19, 2021

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Dec 19, 2021

View reviewed changes

inst/tests/tests.Rraw Outdated Show resolved Hide resolved

ben-schwen added 10 commits December 19, 2021 17:43

add issue number to tests

7469494

remove 2nd eval

e9d0e9c

merge master

ca86485

move set to c level

592853d

proper use of alloccol

86dfc42

update test

3eebecf

rewrite R call

1aa5534

fix tests

249a471

exchange alloccol with shallow

f65a8cc

add test for TRUELENGTH=0 set

f95c1c4

ben-schwen removed the WIP label Dec 27, 2021

ben-schwen mentioned this pull request May 5, 2022

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Open

MichaelChirico reviewed Sep 9, 2024

View reviewed changes

ben-schwen added 2 commits September 11, 2024 11:37

rewrite set

e4521a1

Merge branch 'master' into set_re-overallocate

03833c9

ben-schwen requested a review from HughParsonage as a code owner September 11, 2024 09:42

fix typo

ad8c97d

ben-schwen added 4 commits September 11, 2024 11:55

add tests

6fd0907

change test numbering

4a96f80

add test

074b805

fix inherits

26cdf8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set automatically allocates new column slots if needed #5269

set automatically allocates new column slots if needed #5269

ben-schwen commented Nov 21, 2021 •

edited

Loading

codecov bot commented Nov 21, 2021 •

edited

Loading

mattdowle commented Dec 3, 2021 •

edited

Loading

ben-schwen commented Dec 14, 2021 •

edited

Loading

ben-schwen commented Dec 27, 2021 •

edited

Loading

MichaelChirico Sep 9, 2024

ben-schwen Sep 11, 2024

github-actions bot commented Sep 11, 2024 •

edited

Loading

set automatically allocates new column slots if needed #5269

Are you sure you want to change the base?

set automatically allocates new column slots if needed #5269

Conversation

ben-schwen commented Nov 21, 2021 • edited Loading

codecov bot commented Nov 21, 2021 • edited Loading

Codecov Report

mattdowle commented Dec 3, 2021 • edited Loading

ben-schwen commented Dec 14, 2021 • edited Loading

Benchmark - Updating existing columns

Setup

Script

100 set / 2000 cols each

Current (update.dev.pkg())

Suggested Version

10000 set / 10 cols each

Current (update.dev.pkg())

Suggested Version

Adding 10k columns to empty data.table

Current (update.dev.pkg())

Suggested Version

ben-schwen commented Dec 27, 2021 • edited Loading

MichaelChirico Sep 9, 2024

Choose a reason for hiding this comment

ben-schwen Sep 11, 2024

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2024 • edited Loading

ben-schwen commented Nov 21, 2021 •

edited

Loading

codecov bot commented Nov 21, 2021 •

edited

Loading

mattdowle commented Dec 3, 2021 •

edited

Loading

ben-schwen commented Dec 14, 2021 •

edited

Loading

Current (`update.dev.pkg()`)

Current (`update.dev.pkg()`)

Current (`update.dev.pkg()`)

ben-schwen commented Dec 27, 2021 •

edited

Loading

github-actions bot commented Sep 11, 2024 •

edited

Loading