Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date and POSIXct coerced to numeric when calculating median by group #3079

Closed
Henrik-P opened this issue Sep 27, 2018 · 8 comments · Fixed by #3564
Closed

Date and POSIXct coerced to numeric when calculating median by group #3079

Henrik-P opened this issue Sep 27, 2018 · 8 comments · Fixed by #3564
Labels
GForce issues relating to optimized grouping calculations (GForce) idate/itime
Milestone

Comments

@Henrik-P
Copy link

I have data with class Date ('date') and POSIXct ('time'), and a grouping variable 'g'

d <- data.table(
   date = as.Date(c("2018-01-01", "2018-01-03", "2018-01-08",
                 "2018-01-10", "2018-01-25", "2018-01-30")),
   g = rep(letters[1:2], each = 3))
d[ , time := as.POSIXct(date)]

When calculating median of 'date' and 'time' by group, the result is coerced to numeric:

d[ , median(date), by = g]
#    g    V1
# 1: a 17534
# 2: b 17556

d[ , median(time), by = g]
#    g         V1
# 1: a 1514937600
# 2: b 1516838400

However, 'date' and 'time' is not coerced when calculating median without grouping:

d[ , median(date)]
# [1] "2018-01-09"

d[ , median(time)]
# [1] "2018-01-09 01:00:00 CET"

Other things I've tried which don't coerce:

Mean 'date' and 'time' by group:

d[ , mean(date), by = g]
#    g         V1
# 1: a 2018-01-04
# 2: b 2018-01-21

d[ , mean(time), by = g]
#    g                  V1
# 1: a 2018-01-04 01:00:00
# 2: b 2018-01-21 17:00:00

Median 'date' and 'time' by group using aggregate:

aggregate(date ~ g, data = d, median)
#   g       date
# 1 a 2018-01-03
# 2 b 2018-01-25

aggregate(time ~ g, data = d, median)
#   g                time
# 1 a 2018-01-03 01:00:00
# 2 b 2018-01-25 01:00:00

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
data.table_1.11.6

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 27, 2018 via email

@Henrik-P
Copy link
Author

Thanks Michael. Here we go:

d[ , median(date), by = g, verbose = TRUE]

Detected that j uses these columns: date
Finding groups using forderv ... 0.000sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'median(date)'
GForce optimized j to 'gmedian(date)'
Making each group and running j (GForce TRUE) ... 0.020sec
g V1
1: a 17534
2: b 17556


d[ , median(time), by = g, verbose = TRUE]

Detected that j uses these columns: time
Finding groups using forderv ... 0.000sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'median(time)'
GForce optimized j to 'gmedian(time)'
Making each group and running j (GForce TRUE) ... 0.000sec
g V1
1: a 1514937600
2: b 1516838400

@MichaelChirico
Copy link
Member

Thanks... I suspected GForce was at root here but am still surprised that mean works (verbose = TRUE for those shows it's still activated)....

For now, if you're just interested in moving on, you can temporarily disable GForce:

old = getOption('datatable.optimize')
options(datatable.optimize = 1L)
d[ , median(time), by = g]
#    g                  V1
# 1: a 2018-01-03 08:00:00
# 2: b 2018-01-25 08:00:00
options(datatable.optimize = old)

@Henrik-P
Copy link
Author

Thanks a lot for your rapid response. I think this may be a regression - as far as I recall this used to work in earlier versions.

@MichaelChirico
Copy link
Member

It certainly sounds familiar... there may be an outstanding issue...

@MichaelChirico
Copy link
Member

Very similar to #1876

@franknarf1
Copy link
Contributor

franknarf1 commented Sep 27, 2018

Another bug, maybe related -- gmedian is coercing integers to reals, in contrast with base median:

> d[, idate := as.IDate(date)]
> d[, dput(median(idate)), by=g]
structure(17534L, class = c("IDate", "Date"))
structure(17556L, class = c("IDate", "Date"))
   g         V1
1: a 2018-01-03
2: b 2018-01-25
> d[, median(idate), by=g][, dput(V1)]
c(17534, 17556)
[1] 17534 17556

I guess one could argue that the base median behavior is wrong (since return type is unpredictable)

@jangorecki
Copy link
Member

Very rarely I'm against consistency with base R but in this particular case of gmedian I prefer to have double returned always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce) idate/itime
Projects
None yet
5 participants