Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column Select Helper #4248

Closed
wants to merge 10 commits into from
Closed

Column Select Helper #4248

wants to merge 10 commits into from

Conversation

ColeMiller1
Copy link
Contributor

@ColeMiller1 ColeMiller1 commented Feb 18, 2020

Closes #4115
Closes #4231.
Towards #852.

33 errors until by is included...

See also @jangorecki very detailed approach in C in #4174. I see this as a stepping stone to getting towards Jan's C solution as we solve some of the API issues.

Mostly internal replacing the .SDcols evaluation with a new helper function with slight refactoring.

Functionality differences:

  1. If .SDcols evaluates to (-1L) and the integer is within the column range of dt, the set difference is returned (e.g. someone did .SDcols = (-3L) on a 5 column data.table which would return c(1L, 2L, 4L, 5L).
  2. If .SDcols is a logical greater than length 1 but less than the length of the dt, a warning message is displayed. The long term solution will be for it to throw an error and later on silently fail.

Internal Differences

  1. The way 1:3 or V1:V3 is evaluated was changed to be more performant.
x = data.table(V1 = 1, V2 = 2, V3 = 1, V4= 5, V5 = 5)
colsub = quote(V1:V3)
microbenchmark::microbenchmark(
  new_way = {
    rnge = data.table:::chmatch(c(as.character(colsub[[2L]]), as.character(colsub[[3L]])), names(x))
    cols = rnge[1L]:rnge[2L]
  },
  old_way =  eval(colsub, data.table:::setattr(as.list(seq_along(x)), 'names', names(x)), parent.frame())
)
##Unit: microseconds
##    expr  min    lq   mean median    uq   max neval
## new_way 20.8 22.05 31.324   23.0 28.85 145.4   100
## old_way 27.8 28.90 35.059   29.7 30.70 137.2   100

colsub = quote(1:3)
microbenchmark::microbenchmark(
  new_way =  eval(colsub),
  old_way =  eval(colsub, data.table:::setattr(as.list(seq_along(x)), 'names', names(x)), parent.frame())
)
##Unit: microseconds
##    expr  min   lq   mean median    uq  max neval
## new_way 10.6 11.4 12.689   11.8 12.20 67.4   100
## old_way 27.2 28.0 30.421   28.5 29.15 70.9   100
  1. The cool new %iscall% is used less frequently because the parsing is evaluated differently.
  2. The evaluation of patterns does not actually include do_patterns. In the context of .SDcols, j, or by, the argument of patterns(..., cols = SOMETHING) does not make sense. Instead, grep is used directly in the context of names(x).

Random note

  1. At the end of the tests, there is a new variable DT in the global environment. Both 2036.1 and 2036.2 produce the new variable. The tests are below:
setup = c('DT = data.table(a = 1)')
writeLines(c(setup, 'DT[ , a := 1]'), tmp<-tempfile())
test(2036.1, !any(grepl("1:     1", capture.output(source(tmp, echo = TRUE)), fixed = TRUE)))
## test force-printing still works
writeLines(c(setup, 'DT[ , a := 1][]'), tmp)
test(2036.2, source(tmp, echo = TRUE), output = "1:\\s+1")

To Do:

  • See if any more tests are needed

Long Term:

  • Include by vars and j vars
  • Consolidate other column select which would likely include duplicated, unique, setcolorder, setnames, and probably more.
  • See what vignettes could be updated.

@codecov
Copy link

codecov bot commented Feb 18, 2020

Codecov Report

Merging #4248 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4248   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          72       72           
  Lines       13916    13937   +21     
=======================================
+ Hits        13862    13883   +21     
  Misses         54       54           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b1832...4961257. Read the comment docs.

R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated

if (is.call(colsub)){
# fix for #1216, make sure the parentheses are peeled from expr of the form (((1:4)))
while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should include { here too maybe?

if (length(colsub) == 3L && colsub[[1L]] == ":") {
if (is.name(colsub[[2L]])){
# cols is of the format a:c
rnge = chmatch(c(as.character(colsub[[2L]]), as.character(colsub[[3L]])), names(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about !is.name(colsub[[3L]])?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, some recent commit in my branch was addressing cases like var1:5 or 1:var1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now both need to be names. If either is a character, a new error is raised. Otherwise, it evaluates in the parent frame. I largely followed Jan's work so now V2 -V1 errors as well.

R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Looks great overall! A much needed refactor

@jangorecki
Copy link
Member

My general advice would be to split j handling from .SDcols handling into separte function, that eventually share some internal functions.

@jangorecki
Copy link
Member

jangorecki commented Feb 21, 2020

.SDcols now accepts a single name as an argument such as data.table(V1 = 1)[, .SD, .SDcols = V1]

I would say this is unexpected. Unless of course is a character vector specifying columns, or a numeric, or logic of length ncol(DT), or a function.

See my inline comment as well.

@jangorecki
Copy link
Member

jangorecki commented Feb 21, 2020

If .SDcols is TRUE, all columns will be returned which is a step towards group_by_all or just a generic way to select all columns.

IMO .SDcols when being logical should always be equal length to ncol, otherwise raise exception.

See my inline comment as well.

R/utils.R Outdated
Comment on lines 215 to 222
if (is.logical(cols)) {
if ((col_len <- length(cols)) == 1L) {
cols = rep(cols, length.out = x_len)
} else if (col_len != x_len) {
## TODO change to error in 2022
warning(gettextf("When %s is a logical vector, each column should have a TRUE or FALSE entry. The current logical vector of length %d will be repeated to length of data.table. Warning will change to error in the next verion.", mode, col_len, domain = "R-data.table"))
cols = rep(cols, length.out = x_len)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: recycling logical vectors. Base recycles vectors to the length of the data.frame. See iris[, TRUE] or iris[, c(TRUE, FALSE)]. I guess I'm leaning towards allowing vectors repeat.

But more broadly, I am interested in some way to select all the variables. colsub = TRUE seemed like a quick and easy way. names(.SD) could work but if this function was ever applied to duplicated or melt or any other functions that allow column selection, I am not sure names(.SD) would make sense. Maybe a made up call all_cols() that we could evaluate?

Copy link
Member

@jangorecki jangorecki Feb 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your use of colsub=TRUE can be achieved by not passing SDcols, or passing names, or rep(TRUE, ncol(DT)) or seq_along. Most of which requires you to refer to DT, which is not chaining friendly. You can always use a function (...) TRUE to workaround this limitation.
In the issue related to logical vector in SDcols we discussed recycling and so far conclusion was to not recycle.

R/utils.R Outdated Show resolved Hide resolved
@ColeMiller1 ColeMiller1 removed the WIP label Feb 23, 2020
@ColeMiller1
Copy link
Contributor Author

I think everything has been addressed. Rolled back the two semi-big changes where .SDcols = TRUE would return all columns and .SDcols = V1 where V1 was a variable name within dt would return the V1 column.

I did not include { brackets. (({(1:3)})) will currently work and ((((V1:V3))) will currently work. I am unsure of the use case of ({V1:V3}) but if there is still desire, it would be easy to add.

For additional scrutiny, please take a second look at the warnings / stops.

Thanks for all the comments and time you two spent reviewing.

@ColeMiller1
Copy link
Contributor Author

Also, I am more than happy to change this to C. Based on my initial benchmarks, Jan's C method is generally faster - I have a "negate" attribute while my timings for Jan's include :::. I think the R could be refactored to improve performance (e.g., if V1:V3 matches, there's no need to do additional checks) but then I would have to repeat similar code multiple times.

Similarly, I could start work on the by or j aspects but that would increase the reach of this PR further than ideal. After merging, I'd start work on by as I think it's a better candidate than j.

# remotes::install_github("Rdatatable/data.table", ref = "colselect")
library(data.table)

x = as.data.table(lapply(1:5, c))

e = quote(1:3)
microbenchmark::microbenchmark(
col_helper(x, e, ".SDcols"),
data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                          expr  min    lq   mean median    uq   max neval
                                   col_helper(x, e, ".SDcols") 17.1 18.10 20.966  19.80 20.40 104.9   100
 data.table:::exprCols(x, 1:3, ".SDcols", TRUE, environment()) 16.5 17.45 19.264  19.05 19.85  56.6   100

e = quote(V1:V3)
microbenchmark::microbenchmark(
  col_helper(x, e, ".SDcols"),
  data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                            expr  min   lq   mean median   uq   max neval
                                     col_helper(x, e, ".SDcols") 21.4 22.4 25.375   23.7 24.3 104.9   100
 data.table:::exprCols(x, V1:V3, ".SDcols", TRUE, environment()) 17.2 18.1 20.082   20.4 20.9  56.2   100

cols = c("V1", "V2", "V3")
e = quote(cols) 
microbenchmark::microbenchmark(
  col_helper(x, e, ".SDcols"),
  data.table:::exprCols(x, cols, ".SDcols", TRUE, environment())
)
Unit: microseconds
                                                           expr  min   lq   mean median   uq  max neval
                                    col_helper(x, e, ".SDcols") 11.1 12.0 13.740  13.20 13.9 44.7   100
 data.table:::exprCols(x, cols, ".SDcols", TRUE, environment()) 17.5 18.7 21.313  19.35 20.2 92.0   100

@ColeMiller1 ColeMiller1 reopened this Feb 27, 2020
@jangorecki
Copy link
Member

jangorecki commented Feb 27, 2020

If j is going to be handled in this PR, then probably best to do it now. Once PR will be ready to merge then we need to run revdeps check against this branch. What could also be useful is to give an option to escape it, so users who might got affected by this change can easily turn it off.

@ColeMiller1
Copy link
Contributor Author

Just saw the edit. I will work on including by. The only way I would start work on j is if you would be cool with a select.j(...) helper function in j. Otherwise, keeping a global option for j wouldn't really get us towards a modular data.table and would only increase the complexity of the code.

@jangorecki
Copy link
Member

Of course the option would be only for a while, to ensure users are not affected.

@ColeMiller1
Copy link
Contributor Author

The strangest part of by is that it allows vectors in the parent.frame to be used for grouping. Even stranger is that we also allow these vectors to be names or even arguments in lists.

While I don't mind having a slow deprecation of names being in the parent.frame, it would make this PR easier if we could break the use of names in list referring to variables out of frame. Otherwise, it is a lot of checks for a use case that should be discouraged - we should have never let out-of-frame variables be in the by!

library(data.table)
n = 5L
dt = data.table(V1 = rnorm(n))
set.seed(1L)
out_var = sample(n, n, replace = TRUE)

dt[, sum(V1), by = out_var] ##slow deprecation
dt[, sum(V1), by = list(out_var)] ##break

@ColeMiller1
Copy link
Contributor Author

The silence is deafening for breaking by = list(parent_frame_variable) :). I have fixed it but I still plan on trying to start a slow deprecation process.

I assume changing errors / warnings is OK. I also assume new behavior is OK (e.g., by = is.factor). Is new naming convention OK assuming reverse dependency is fine?

DT = data.table(a = 1:10)
DT[ , b := 10:1]
## test 1984.04
## current:
data.table(expression = c(1, 0), V1 = c(6, 5))
   expression    V1
        <num> <num>
1:          1     6
2:          0     5

## proposed:
DT[ , mean(b), by = eval(expression(a %% 2))]
##       a    V1
##   <num> <num>
##1:     1     6
##2:     0     5

@jangorecki
Copy link
Member

jangorecki commented Mar 14, 2020

New behaviour like providing function to 'by' is better to be avoided. If no one requested that then there is not really a need (yet) to have it, and maintain it. Changed messages are fine. Changed behaviour is fine as long as there is an issue for that, where there is an agreement on the change. Changed behaviour that is not really a fix, but change to API has to be optional, like list(parent_scope_var). Then affected users can migrate the code more easily.

@ColeMiller1
Copy link
Contributor Author

This PR is complete for .SDcols. My goal towards a modular [data.table was to introduce consistency for column selection and reuse code where possible. I am closing as that goal does not seem possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

internal error when specifying .SDcols .SDcols with logical vector
3 participants