Delete rows by reference #635

arunsrinivasan · 2014-06-08T13:16:55Z

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :

 DT[i,.:=NULL]   # delete rows by reference

 DT[,.:=NULL]    # error("must specify i to delete rows. To delete all rows from a table use DT[TRUE,.:=NULL], or, DT=DT[0].  This is deliberately a little harder, to avoid accidents such as "delete from table" a coomon accident in SQL.")

We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.

The text was updated successfully, but these errors were encountered:

zx8754 · 2015-10-01T10:12:41Z

Subset data table without using <-

mattdowle · 2015-10-01T17:11:06Z

http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table

mattdowle · 2015-10-01T17:24:36Z

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

mattdowle · 2015-10-29T18:50:07Z

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

mattdowle · 2015-10-29T18:52:58Z

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

eantonya · 2015-10-29T19:11:33Z

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

franknarf1 · 2015-10-29T19:35:21Z

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

andrewrech · 2016-08-23T01:07:05Z

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)

vinhdizzo · 2016-08-25T16:12:48Z

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0

jarppiko · 2016-11-18T08:18:11Z

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
>

This is my very first GitHub post, btw.

MiloParigi · 2018-01-03T14:56:07Z

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @jarno-p awnser and later on modified to be more efficient and works with [] references, don't you think ?

MichaelChirico · 2018-01-08T17:27:52Z

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @jarno-p would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are truly by reference, whereas row deletions as exemplified thus far are only fast (compared to full copies), and not truly by reference.

jarppiko · 2018-01-09T02:53:31Z

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

matthiaskaeding · 2018-01-18T20:07:25Z

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @jarno-p):

del_rows <- function(X,delete) {
  
  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)
  
  for(j in seq_len(ncol(X))) {
    
    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)
    
  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

UweBlock · 2019-08-06T05:35:09Z

There is an interesting question on SO:

Subsetting a large vector uses unnecessarily large amounts of memory

Not directly related to data.table but a potential use case for deleting rows by reference.

jangorecki · 2020-04-07T10:17:43Z

I provided one design idea to address this issue in #4345 (comment)
Are there any other ideas? If not it should be safe to start working on implementation of the idea.

jangorecki · 2020-04-07T17:47:29Z

Proof of concept based on #4345 (comment)

setsubset = function(x, i) {
  stopifnot(is.data.table(x), is.integer(i))
  if (!length(i)) return(x)
  if (anyNA(i) || anyDuplicated(i) || any(i < 1L || i > nrow(x) || is.unsorted(i))) stop("i must be non-NA, no dups, in range of 1:nrow(x) and sorted")
  drop = setdiff(1:nrow(x), i)
  last_ii = drop[1L]-1L
  do_i = i[i > last_ii]
  for (ii in do_i) {
    last_ii = last_ii+1L
    set(x, last_ii, names(x), as.list(x[ii]))
  }
  ## we need to set true length here but this needs C
  invisible(x)
}

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

jangorecki · 2022-11-17T01:30:20Z

working example using setsubset branch

library(data.table)
setsubset = data.table:::setsubset

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     3     6
#4:     4     5
#5:     5     4
#6:     6     3
#7:     7     2
#8:     8     1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     6     3
#4:     7     2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     3     6
#4:     4     5
#5:     5     4
#6:     6     3
#7:     7     2
#8:     8     1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
#       a     b
#   <int> <int>
#1:     3     6
#2:     5     4
#3:     7     2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE

franknarf1 · 2022-11-17T06:18:43Z

Cool!

I'm wondering if the set loop can be avoided along these lines:

x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
# set truelength, drop .keep

Or if setorder isn't reliably order-preserving (the documentation doesn't advertise it):

x[, .old_index := .I]
x[i, .keep := TRUE]
setorder(x, .keep, .old_index, na.last=TRUE)
# set truelength, drop .keep, .old_index

It looks like memory addresses survive these operations (?).

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)

mem = c(address(x), sapply(x, address))
x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
x[, .keep := NULL]
all.equal(x[seq_along(i)], X[i])
all.equal(c(address(x), sapply(x, address)), mem)

If the current approach is necessary, I think you could swap...

  drop = setdiff(seq_len(nrow(x)), i)
  last_ii = drop[1L]-1L

for something like

  first_drop = match(FALSE, seq_along(i) == i, nomatch = tail(i, 1L)+1L)
  last_ii = first_drop - 1L

Why? Speed (avoiding setdiff and scaling with nrow(x)) and handling the edge case where all rows are kept

jangorecki · 2022-11-17T09:27:30Z

Wonderful idea!
Would you like to take over the branch?
Ultimately we want this functionality to be inside [ rather than just new function...

jangorecki · 2023-11-16T21:14:07Z

In case of updates, you will see comments here, or in exists, in a linked PR.

BTW. I have impression that many readers misunderstood benefits of this function. It will most likely be slower than copy as it has to be made now. It will not have to allocate memory for extra copy of data.table, but that allocated memory is released just after assignment, so it is really relevant only when you run into OOM error, that could be avoidable if your data would be twice smaller.

This comment has been minimized.

Sign in to view

mattdowle changed the title ~~[R-Forge #2092] Delete rows by reference~~ Delete rows by reference Oct 29, 2015

This comment has been minimized.

Sign in to view

jangorecki mentioned this issue Aug 11, 2017

Delete all rows based on a where clause #2297

Closed

This comment has been minimized.

Sign in to view

MichaelChirico mentioned this issue May 18, 2018

Efficiently remove rows #2890

Closed

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

MichaelChirico mentioned this issue May 2, 2019

[R-Forge #1458] Add fast insert() function #660

Open

This comment has been minimized.

Sign in to view

MichaelChirico added the High label May 30, 2020

MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete rows by reference #635

Delete rows by reference #635

arunsrinivasan commented Jun 8, 2014

This comment has been minimized.

zx8754 commented Oct 1, 2015

mattdowle commented Oct 1, 2015

mattdowle commented Oct 1, 2015

mattdowle commented Oct 29, 2015

mattdowle commented Oct 29, 2015

eantonya commented Oct 29, 2015

franknarf1 commented Oct 29, 2015

andrewrech commented Aug 23, 2016 •

edited

Loading

vinhdizzo commented Aug 25, 2016

jarppiko commented Nov 18, 2016 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

MiloParigi commented Jan 3, 2018

MichaelChirico commented Jan 8, 2018

jarppiko commented Jan 9, 2018

matthiaskaeding commented Jan 18, 2018

This comment has been minimized.

UweBlock commented Aug 6, 2019

This comment has been minimized.

jangorecki commented Apr 7, 2020

jangorecki commented Apr 7, 2020 •

edited

Loading

jangorecki commented Nov 17, 2022

franknarf1 commented Nov 17, 2022

jangorecki commented Nov 17, 2022

This comment was marked as resolved.

This comment was marked as resolved.

jangorecki commented Nov 16, 2023

Delete rows by reference #635

Delete rows by reference #635

Comments

arunsrinivasan commented Jun 8, 2014

This comment has been minimized.

zx8754 commented Oct 1, 2015

mattdowle commented Oct 1, 2015

mattdowle commented Oct 1, 2015

mattdowle commented Oct 29, 2015

mattdowle commented Oct 29, 2015

eantonya commented Oct 29, 2015

franknarf1 commented Oct 29, 2015

andrewrech commented Aug 23, 2016 • edited Loading

vinhdizzo commented Aug 25, 2016

jarppiko commented Nov 18, 2016 • edited Loading

This comment has been minimized.

This comment has been minimized.

MiloParigi commented Jan 3, 2018

MichaelChirico commented Jan 8, 2018

jarppiko commented Jan 9, 2018

matthiaskaeding commented Jan 18, 2018

This comment has been minimized.

UweBlock commented Aug 6, 2019

This comment has been minimized.

jangorecki commented Apr 7, 2020

jangorecki commented Apr 7, 2020 • edited Loading

jangorecki commented Nov 17, 2022

franknarf1 commented Nov 17, 2022

jangorecki commented Nov 17, 2022

This comment was marked as resolved.

This comment was marked as resolved.

jangorecki commented Nov 16, 2023

andrewrech commented Aug 23, 2016 •

edited

Loading

jarppiko commented Nov 18, 2016 •

edited

Loading

jangorecki commented Apr 7, 2020 •

edited

Loading