1.13.0 slow down in a repeated loop on list column #4658

sandoronodi · 2020-08-03T17:44:01Z

I have noticed a huge performance drop in data.table loop operations, possibly due to the new version upgrade

library(data.table)
library(microbenchmark)

dt <- data.table('id'=1:20000,
                 'list_col'=sample(c('', '', 'a', 'a:b', 'a:b:c'), 20000, TRUE))
feature <- 'list_col'

microbenchmark(
  long_dt <- dt[, c("id", feature), with = FALSE][
    , feature_names := {
      x <- get(feature)
      stringr::str_split(x, ':')
    }][
      , .(
        feature_names = paste0(feature, "_", unlist(feature_names))
      )
      , by = "id"]
  , times = 10
  , unit = 'ms'
)

data.table 1.12.8, default settings, using 6 threads:

      min       lq     mean   median       uq      max neval
 122.2447 149.6991 173.3268 183.5777 193.9876 201.7234    10

data.table 1.13.0, default settings, using 6 threads:

      min      lq     mean   median      uq      max neval
 12820.75 12913.1 12989.59 13007.94 13065.1 13097.85    10

Also, I have tried several different threads and throttle combinations, but have seen no improvements at all

The text was updated successfully, but these errors were encountered:

tdeenes · 2020-08-03T21:07:29Z

Probably a duplicate of #4646, and see also #4655 .

Notes to @sandoronodi

if possible, do not use irrelevant packages in a minimal reproducible example [MRE]: we have base::strsplit, so stringr::str_split is unnecessary (BTW, the latter is just a wrapper of stringi::stri_split - better to use that package)
split your processing pipeline into minimal steps, and profile those minimal steps: in the current case, dt[, c("id", feature), with = FALSE] just returns the original table, the splitting step could be created directly when you create the example table, and the performance degradation occurs in the unlist-by-group step (so only this last step is relevant)

ColeMiller1 · 2020-08-03T21:58:44Z

@sandoronodi thanks for the report and I can reproduce. It is very related to #4646. I tried the #4655 workaround and it seems to address performance. Could you see if it solves your actual use case as well?

remotes::install_github("https:/Rdatatable/data.table", ref  = "extract_performance")

Also, and I am not sure if this is necessarily a more readable approach, but this is faster method to get the same result. You could probably tweak it more if this is a bottleneck, but this would allow you to move forward with 1.13.0.

library(data.table)

dt <- data.table('id'=1:20000,
                 'list_col'=sample(c('', '', 'a', 'a:b', 'a:b:c'), 20000, TRUE))
feature <- 'list_col'

dt[, {
    
    x = get(feature)
    l = strsplit(x, ":")
    
    lens = lengths(l)
    lens[lens == 0L] = 1L ##for those without matches, we'll still have `list_col_` for each row based on OP. Therefore, we need those rows.
    
    partial_text = paste0(feature, "_")
    
    list(id = rep(id, lens),
         feature_names = unlist(Map(function(y) if (length(y)) paste0(partial_text, y) else partial_text, l), use.names = FALSE)
    )}
]

## A tibble: 2 x 13
##  expression               min  median `itr/sec` mem_alloc
##  <bch:expr>             <bch> <bch:t>     <dbl> <bch:byt>
##1 potential_solution      50ms  56.5ms     18.0     1.33MB
##2 OP_extract_perf_branch 156ms 158.3ms      6.24       2MB

mattdowle changed the title ~~Performance issues in a repeated loop at v1.30.0~~ 1.13.0 slow down in a repeated loop on list column Aug 4, 2020

mattdowle added this to the 1.13.1 milestone Aug 4, 2020

ColeMiller1 mentioned this issue Aug 4, 2020

[[ by group performance #4655

Merged

jangorecki added the performance label Aug 26, 2020

mattdowle closed this as completed in #4655 Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.13.0 slow down in a repeated loop on list column #4658

1.13.0 slow down in a repeated loop on list column #4658

sandoronodi commented Aug 3, 2020

tdeenes commented Aug 3, 2020

ColeMiller1 commented Aug 3, 2020

1.13.0 slow down in a repeated loop on list column #4658

1.13.0 slow down in a repeated loop on list column #4658

Comments

sandoronodi commented Aug 3, 2020

tdeenes commented Aug 3, 2020

ColeMiller1 commented Aug 3, 2020