Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete rows by reference #635

Open
arunsrinivasan opened this issue Jun 8, 2014 · 28 comments
Open

Delete rows by reference #635

arunsrinivasan opened this issue Jun 8, 2014 · 28 comments
Labels
feature request top request One of our most-requested issues

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

Since deleting 1 column is DT[,colname:=NULL], and deleting rows is the same as deleting all columns for those rows, and we wish to use hierarchical indexes to find the rows to delete by reference, we just need a LHS to indicate "all" columns, leading to :

 DT[i,.:=NULL]   # delete rows by reference

 DT[,.:=NULL]    # error("must specify i to delete rows. To delete all rows from a table use DT[TRUE,.:=NULL], or, DT=DT[0].  This is deliberately a little harder, to avoid accidents such as "delete from table" a coomon accident in SQL.")

We can also add an attribute "read only" or "protect" to a data.table, and if the user had protected the data.table in that way, .:= would not work on it.

@Co0olCat

This comment has been minimized.

@zx8754
Copy link

zx8754 commented Oct 1, 2015

Subset data table without using <-

@mattdowle
Copy link
Member

http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table

@mattdowle
Copy link
Member

Just delete by reference is not that hard. The benefit would be mainly memory efficiency rather than speed so much.

@mattdowle
Copy link
Member

How about adding both
delete(DT, b>=8 | a<=3)
and
DT[b>=8 | a<=8, .ROW:=NULL]
The advantage of the latter would be combining with other features of [] such as row numbers in i, join in i and roll. All benefiting from [i,j,by] optimization.
As per : http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-r-data-table/10791729?noredirect=1#comment54633906_10791729

@mattdowle
Copy link
Member

More advanced example :

DT[ b>=8, .SD[1, .ROW:=NULL], by=group]
# remove by reference the 1st observation in each group within a subset

Is .ROW the right name for this new symbol?

@mattdowle mattdowle changed the title [R-Forge #2092] Delete rows by reference Delete rows by reference Oct 29, 2015
@eantonya
Copy link
Contributor

Re right name: doesn't .SD already carry the right meaning for that (instead of introducing a new name a la .ROW)?

@franknarf1
Copy link
Contributor

I think syntax for selecting rows to keep (which just deletes their complement) would be convenient.

delete(DT, b >= 8 | a <= 3) # or
keep(  DT, b <  8 & a >  3)

I don't know that there's a sensible way to extend this logic to work inside j. I'd just as well have Matt's second example only work via

badrows = DT[b >= 8, .I[1], by=g]$V1
delete(DT, badrows)

Just as new columns cannot be created by set (last I checked), it could be that row modifications cannot be done inside [.data.table.

@andrewrech
Copy link

andrewrech commented Aug 23, 2016

if anyone needs a quick-and-dirty solution, as I did, here is a memory-efficient function to select rows for each col then delete by reference based on a SO answer by vc273.

## ---- Deleting rows by reference using data.table*
## ---- *not exactly!

# Example dt
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, col := 1:1e6, with = F] }
keep.idxs = sample(1e6, 9e4, FALSE) # keep 90% of

delete <- function(DT, keep.idxs){
cols <- copy(names(DT))
DT_subset <- DT[[1]][keep.idxs] %>% as.data.table
setnames(DT_subset, ".", cols[1])
for (col in cols){
  DT_subset[, (col) := DT[[col]][keep.idxs]]
  set(DT, NULL, col, NULL)
}
return(DT_subset)
}

str(delete(DT, keep.idxs))
str(DT)

@vinhdizzo
Copy link

@andrewrech I can't get your code to work. I'm on the dev version of data.table, and when I run your code, I end up with an empty data.table:

> dim(d1)
[1] 0 0

@jarppiko
Copy link

jarppiko commented Nov 18, 2016

To complement @andrewrech's answer. Here is code as function and example of its usage.

delete <- function(DT, del.idxs) {           # pls note 'del.idxs' vs. 'keep.idxs'
  keep.idxs <- setdiff(DT[, .I], del.idxs);  # select row indexes to keep
  cols = names(DT);
  DT.subset <- data.table(DT[[1]][keep.idxs]); # this is the subsetted table
  setnames(DT.subset, cols[1]);
  for (col in cols[2:length(cols)]) {
    DT.subset[, (col) := DT[[col]][keep.idxs]];
    DT[, (col) := NULL];  # delete
  }
   return(DT.subset);
}

And example of its usage:

dat <- delete(dat, del.idxs)

Where "dat" is a data.table. Removing 14k rows from 1.4M rows takes 0.25 sec on my laptop.

> dim(dat)
[1] 1419393      25
> system.time(dat <- delete(dat,del.idxs))
   user  system elapsed 
   0.23    0.02    0.25 
> dim(dat)
[1] 1404715      25
> 

This is my very first GitHub post, btw.

@skanskan

This comment has been minimized.

@vikram-rawat

This comment has been minimized.

@MiloParigi
Copy link

What kind of work needs to be done in order to add this functionality to data.table ? Would be glad to help, but not totally sure where to start !

The delete function could be added using @jarno-p awnser and later on modified to be more efficient and works with [] references, don't you think ?

@MichaelChirico
Copy link
Member

I think the open question is the best API. data.table-like syntax would suggest the following should "work":

DT[rows_to_delete := NULL]

The functional approach of @jarno-p would be a change from this, where row deletion would become functional & require DT <- f(DT) constructions. This may be best since := usages are truly by reference, whereas row deletions as exemplified thus far are only fast (compared to full copies), and not truly by reference.

@jarppiko
Copy link

jarppiko commented Jan 9, 2018

Although I am all but qualified to comment, should the syntax user perspective be more like:

DT[ i , .SR := NULL ]

Where the "i" is a DT-expression to select rows. .SR is similar to .SD, except it is always defined within DT and it includes references to all the rows selected by i. But such an approach may add overhead in expressions not intending to delete rows.

Alternative way is to change the behavior of .SD and have it defined also when by-expression is not used and when used without "by", .SD would refer to the whole rows instead (.SD excludes grouping columns).

@matthiaskaeding
Copy link

An approach to bypass X <- f(X) might be to find out the name of X via deparse + substitute and than use the assign function. E.g. like this (adjusting the function of @jarno-p):

del_rows <- function(X,delete) {
  
  keep <- -delete
  name_of_X <- deparse(substitute(X))
  X_names <- copy(names(X))
  X_new <- X[keep,X_names[1L],with=F]
  set(X,i=NULL,j=1L,value=NULL)
  
  for(j in seq_len(ncol(X))) {
    
    set(X_new,i=NULL,j=X_names[1L+j],value=X[[1L]][keep] )
    set(X,i=NULL,j=1L,value=NULL)
    
  }
  assign(name_of_X,value=X_new, envir = .GlobalEnv)
}

You would need to find out the environment of X for general cases.

@raneameya

This comment has been minimized.

@UweBlock
Copy link
Contributor

UweBlock commented Aug 6, 2019

There is an interesting question on SO:

Subsetting a large vector uses unnecessarily large amounts of memory

Not directly related to data.table but a potential use case for deleting rows by reference.

@rkanitz

This comment has been minimized.

@jangorecki
Copy link
Member

I provided one design idea to address this issue in #4345 (comment)
Are there any other ideas? If not it should be safe to start working on implementation of the idea.

@jangorecki
Copy link
Member

jangorecki commented Apr 7, 2020

Proof of concept based on #4345 (comment)

setsubset = function(x, i) {
  stopifnot(is.data.table(x), is.integer(i))
  if (!length(i)) return(x)
  if (anyNA(i) || anyDuplicated(i) || any(i < 1L || i > nrow(x) || is.unsorted(i))) stop("i must be non-NA, no dups, in range of 1:nrow(x) and sorted")
  drop = setdiff(1:nrow(x), i)
  last_ii = drop[1L]-1L
  do_i = i[i > last_ii]
  for (ii in do_i) {
    last_ii = last_ii+1L
    set(x, last_ii, names(x), as.list(x[ii]))
  }
  ## we need to set true length here but this needs C
  invisible(x)
}

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
address(x)
sapply(x, address)
setsubset(x, i)
address(x)
sapply(x, address)
all.equal(x[seq_along(i)], X[i])

@MichaelChirico MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020
@jangorecki
Copy link
Member

working example using setsubset branch

library(data.table)
setsubset = data.table:::setsubset

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     3     6
#4:     4     5
#5:     5     4
#6:     6     3
#7:     7     2
#8:     8     1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     6     3
#4:     7     2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(3L, 5L, 7L)
x
#       a     b
#   <int> <int>
#1:     1     8
#2:     2     7
#3:     3     6
#4:     4     5
#5:     5     4
#6:     6     3
#7:     7     2
#8:     8     1
mem = c(address(x), sapply(x, address))
setsubset(x, i)
x
#       a     b
#   <int> <int>
#1:     3     6
#2:     5     4
#3:     7     2
all.equal(x, X[i])
#[1] TRUE
all.equal(c(address(x), sapply(x, address)), mem)
#[1] TRUE

@franknarf1
Copy link
Contributor

Cool!

I'm wondering if the set loop can be avoided along these lines:

x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
# set truelength, drop .keep

Or if setorder isn't reliably order-preserving (the documentation doesn't advertise it):

x[, .old_index := .I]
x[i, .keep := TRUE]
setorder(x, .keep, .old_index, na.last=TRUE)
# set truelength, drop .keep, .old_index

It looks like memory addresses survive these operations (?).

x = data.table(a = 1:8, b = 8:1)
X = copy(x)
i = c(1:2, 6:7)

mem = c(address(x), sapply(x, address))
x[i, .keep := TRUE]
setorder(x, .keep, na.last=TRUE)
x[, .keep := NULL]
all.equal(x[seq_along(i)], X[i])
all.equal(c(address(x), sapply(x, address)), mem)

If the current approach is necessary, I think you could swap...

  drop = setdiff(seq_len(nrow(x)), i)
  last_ii = drop[1L]-1L

for something like

  first_drop = match(FALSE, seq_along(i) == i, nomatch = tail(i, 1L)+1L)
  last_ii = first_drop - 1L

Why? Speed (avoiding setdiff and scaling with nrow(x)) and handling the edge case where all rows are kept

@jangorecki
Copy link
Member

Wonderful idea!
Would you like to take over the branch?
Ultimately we want this functionality to be inside [ rather than just new function...

@michaelsimmler

This comment was marked as resolved.

@waynelapierre

This comment was marked as resolved.

@jangorecki
Copy link
Member

In case of updates, you will see comments here, or in exists, in a linked PR.

BTW. I have impression that many readers misunderstood benefits of this function. It will most likely be slower than copy as it has to be made now. It will not have to allocate memory for extra copy of data.table, but that allocated memory is released just after assignment, so it is really relevant only when you run into OOM error, that could be avoidable if your data would be twice smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests