Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress indicator/bar for "by" operations #3060

Closed
eantonya opened this issue Sep 21, 2018 · 13 comments · Fixed by #6228
Closed

progress indicator/bar for "by" operations #3060

eantonya opened this issue Sep 21, 2018 · 13 comments · Fixed by #6228
Labels
feature request top request One of our most-requested issues

Comments

@eantonya
Copy link
Contributor

eantonya commented Sep 21, 2018

I frequently run by operations that take minutes and sometimes hours to complete. To deal with the uncertainty of what's going on, I often resort to printing .BY in the i-expression:

dt = data.table(a = 1:5)
dt[, {print(.BY)
      Sys.sleep(10)
      5}
   , by = a]

Would be great to have an automatic progress bar for these, similar to fread, if the expected run time is greater than e.g. 30s. Expected run time can be estimated from actual average time spent running the i-expression for each by item so far * number of remaining by items.

@franknarf1
Copy link
Contributor

Dupe of closed #1409

I manually add progress-printers when I see an operation going slowly, but it'd be nice if it were automatic whenever time > 30 s as you suggest.

@mattdowle
Copy link
Member

Related: #3050

@jangorecki
Copy link
Member

@franknarf1 agree with Frank, but lets print it only for interactive().

@eantonya
Copy link
Contributor Author

eantonya commented Sep 25, 2018

I've been thinking about this, and ideally I'd like to see the following information:

  • number of groups processed
  • total number of groups, or groups remaining
  • time elapsed
  • total predicted time, or time remaining
  • maybe current group (but I'm not sure how you'd do this without creating a mess)

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 26, 2018 via email

@Kodiologist
Copy link

Kodiologist commented Oct 1, 2019

I too would like this as an opt-in feature of data.table. In the meantime, here's a code snippet to manually add a progress bar with a percentage, given a data table d. Replace GROUP_EXPRESSION and J_EXPRESSION to taste.

bar = txtProgressBar(style = 3, min = 0,
    max = nrow(unique(d[, GROUP_EXPRESSION))))
d[, by = GROUP_EXPRESSION,
   {setTxtProgressBar(bar, .GRP)
    J_EXPRESSION}]
close(bar)

@jangorecki
Copy link
Member

if the expected run time is greater than e.g. 30s. Expected run time can be estimated from actual average time spent running the i-expression for each by item so far * number of remaining by items.

i is not run for each item in by, even if it would, then still such estimation would be quite inaccurate because it doesn't say anything about j, to make that automatically we would have to measure time of iterations over progressing groups evaluated in j, and then decide to print at some threshold. This unfortunatelly would impose an overhead, it may be better to enable this feature explicitly rather than automatically.

@MichaelChirico
Copy link
Member

the same can be said of the progress bar for fread/fwrite right? at least w.r.t. the overhead concern

@Kodiologist
Copy link

You could use pbapply as a dependency, since it already implements time estimation. In my example above, you'd replace txtProgressBar with startpb and setTxtProgressBar with setpb.

@jangorecki
Copy link
Member

jangorecki commented Jun 28, 2020

Time estimation, once we start to evaluate j, is not something we should be worried about. Previous comment was about time estimation before we start evaluating j. Recent PR by @MichaelChirico provides efficient way to know how many groups we have being inside j, so .GRP/.NGRP is not a problem. Problem is that to handle that automatically, when time is longer than a threshold, we have to sum timings always, for all iterations, and this will add overhead. Just R C API impose quite big overhead there already so it doesn't look fine for me to add this one, by default.

@Kodiologist
Copy link

Having to turn it on explicitly with an argument, like progress = TRUE, makes sense to me.

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
@joshhwuu
Copy link
Member

What would be the best way to opt-in to this feature? I was wondering what something like showProgress = TRUE would look like within the square bracket syntax.

@Kodiologist
Copy link

I imagine a typical call would look like dt[, by = ..., progress = TRUE, ...], where the last ... is j.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants