Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New vignette for sim_fixed_n() in parallel #113

Merged
merged 4 commits into from
Sep 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ articles:
contents:
- arbitraryhazard
- modestWLRTVignette
- parallel
- pvalue_maxcomboVignette
- simtrialroutines
- workflow
278 changes: 278 additions & 0 deletions vignettes/parallel.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
---
title: "Simulating time-to-event trials in parallel"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Simulating time-to-event trials in parallel}
%\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

## Overview

This vignette demonstrates the ability to implement `sim_fixed_n()` using user-defined backends to parallelize simulations. We will consider the backends supported by the [future](https://future.futureverse.org/) framework.

The backends supported by the future package include:

* `sequential` - default and non-parallel backend.
* `multisession` - uses multiple background R sessions on a single machine.
* `multicore` - uses multiple forked R processes on a single non-Windows machine outside of RStudio.
* `cluster` - supports external R sessions across multiple machines.

You can also choose other backend types supported by additional future extension packages, such as the HPC job scheduler backends from future.batchtools.

The function `sim_fixed_n()` provides a simulation workflow for a two-arm trial with a single endpoint.
We can vary the parameters of the trial using different functions outlined in the documentation.
This function now provides users the opportunity to implement their simulations using the previously described parallel backends to accelerate the computation.

## Background

Without specifying a backend, `sim_fixed_n()` will execute sequentially.
The sequential execution will run all `n_sim` iterations within the same process or session of R.
In order to execute in parallel, we must define the environment prior to calling the function.
Setting your seed prior to calling the function will ensure that the results are reproducible.

Suppose you want to investigate the duration of a trial under two possible enrollments strategies.
Both enrollments are piecewise, but have varying durations and rates.

```{r dependencies, message=FALSE, warning=FALSE}
library(simtrial)
library(tibble)
library(future)
library(doFuture)
```

```{r set-plan-sequential, echo=FALSE}
plan("sequential") # Ensure that the backend is sequential
```

```{r enrollments, fig.height=4, fig.width=6, fig.align="center"}
set.seed(1)

n <- 5000
enroll_rate1 <- tibble(rate = c(5, 20, 10), duration = c(100, 150, 150))
enroll_rate2 <- tibble(rate = c(10, 15, 30), duration = c(150, 175, 75))
x1 <- rpw_enroll(n = n, enroll_rate = enroll_rate1)
x2 <- rpw_enroll(n = n, enroll_rate = enroll_rate2)

plot(
x1, 1:n,
type = "l",
col = "blue",
xlim = c(0, max(x1, x2)),
main = "Piecewise enrollments",
xlab = "Time",
ylab = "Enrollment"
)
lines(x2, 1:n, col = "orange")
legend(
250, 1500,
legend = c("Enrollment 1", "Enrollment 2"),
col = c("blue", "orange"),
lty = c(1, 1)
)
```

We see that *Enrollment 2* enrolls individuals more quickly than *Enrollment 1* at onset.
Later, *Enrollment 1* will outpace *Enrollment 2* before eventually being overtaken again.
As such, we want to consider how the duration of the study changes under these enrollments.

## The sequential run

Naively, we can execute these simulations sequentially.
We set the target of a total enrollment of 3000 individuals with the trial ending after observing 700 events.
We use `timing_type = 2` to return the correct trial duration.

```{r confirm-sequential}
set.seed(1)

n_sim <- 200

start_sequential <- proc.time()

seq_result1 <- sim_fixed_n(
n_sim = n_sim,
sample_size = 3000,
target_event = 700,
enroll_rate = enroll_rate1,
timing_type = 2 # Time until targeted event count achieved
)

seq_result2 <- sim_fixed_n(
n_sim = n_sim,
sample_size = 3000,
target_event = 700,
enroll_rate = enroll_rate2,
timing_type = 2 # Time until targeted event count achieved
)

duration_sequential <- proc.time() - start_sequential
```

A message automatically appears in the console that indicates what backend is being used for processing.

The calls to `proc.time()` allow us to evaluate the computation time of these procedures.
This function provides three outputs, we will focus on user and elapsed time.
User time represents the CPU time spent evaluating the function and elapsed time represents the "wall clock" time spent by the end user waiting for the results.

```{r sequential-time}
print(duration_sequential)
```

We can see that the CPU time is `r sprintf("%.2f", duration_sequential[[1]])` and the elapsed time is `r sprintf("%.2f", duration_sequential[[3]])` seconds.
These provide our baseline for the computation time.

As you may have anticipated, we see that for a lower number of events, enrollment 2 has a shorter average duration of
`r sprintf("%.1f", mean(seq_result2$duration))`
over that of enrollment 1, which is
`r sprintf("%.1f", mean(seq_result1$duration))`.

We also see that there is a distinction between the duration of the study under the proposed enrollment strategies.

```{r sequential-display-results, eval=FALSE, echo=FALSE}
seq_result1 %>%
head(5) %>%
kable(digits = 2)
seq_result2 %>%
head(5) %>%
kable(digits = 2)
```

## Setting up a parallel backend

If we instead, wanted to run more simulations for each enrollment, we can expect the time to run our simulations to increase.
As we vary and increase the number of parameter inputs that we consider, we expect the simulation process to continue to increase in duration.
To help combat the growing computational burden, we can run these simulations in parallel using the `multisession` backend available to us in `plan()`.

We can adjust the default number of cores with the function `parallelly::availableCores()`.
The multisession backend will automatically use all available cores by default.
To initialize our backend, we change our plan.

```{r multisession}
plan(multisession, workers = availableCores())
```

## Execution in parallel

Once we have configured the backend details, we can execute the same code as before to automatically distribute the `n_sim` simulations across the available cores.

```{r confirm-multisession}
set.seed(1)

start_sequential <- proc.time()

seq_result1m <- sim_fixed_n(
n_sim = n_sim,
sample_size = 3000,
target_event = 700,
enroll_rate = enroll_rate1,
timing_type = 2 # Time until targeted event count achieved
)

seq_result2m <- sim_fixed_n(
n_sim = n_sim,
sample_size = 3000,
target_event = 700,
enroll_rate = enroll_rate2,
timing_type = 2 # Time until targeted event count achieved
)

duration_sequential <- proc.time() - start_sequential
```

```{r time-parallel}
print(duration_sequential)
```

We can see that the CPU time is `r sprintf("%.2f", duration_sequential[[1]])` and the elapsed time is `r sprintf("%.2f", duration_sequential[[3]])` seconds.
The user time here appears to be drastically reduced because of how R keeps track of time; the time used by the parent process and not the children processes are reported for the user time.
Therefore, we compare the elapsed time to see the real-world impact of the parallelization.

To change the implementation back to a sequential backend, we simply use what is below.

```{r plan-sequential}
plan(sequential)
```

We can also verify that the simulation results are identical because of setting a seed and that the backend type will not affect the results. Below, it is clear that the results from our sequential and multisession backends match completely.

```{r compare-results}
sum(seq_result1 != seq_result1m)
sum(seq_result2 != seq_result2m)
```

*Note:* A parallel implementation may not always be faster than a serial implementation.
If there is substantial overhead associated with executing in parallel, sequential evaluation may be faster.
For a low number of simulations or available cores, it may be preferable to continue computation in serial rather than parallel.
We leave it to the end user to determine this difference based on the resources available to them.

## A nested parallel example

We provide an additional example using a nested parallel structure for users with more extensive resources, such as high-performance computing clusters, available to them.
Because these resources are not commonly available, we will not execute the below code herein.
Consider that you have two accessible nodes, each with three cores (shown in the diagram below).

```{r schema, echo=FALSE, fig.cap="Available resource schematic.", fig.align="center", out.width="90%"}
knitr::include_graphics("schema.png")
```

Ideally, all available resources will be used when executing the simulations.
To do this, we need to correctly define our backend using `plan()` and run the same code as previously.
The different structures, or topologies, for a backend can be changed with a more in depth explanation given in the [future topologies vignette](https://future.futureverse.org/articles/future-3-topologies.html).
Our below example follows closely their example.

In our below snippet, we consider the two nodes named `n1` and `n2` and create a function to select the number of cores to use on those named nodes.
While trivial here, a courteous user of shared machines would specify fewer than all available cores and can do such using a modification of the below code.
We then implement our backend using a list that follows the hierarchy of the available resources.

```{r nested-topology, eval=FALSE}
nodes <- c("n1", "n2")
customCores <- function() {
switch(Sys.info()[["nodename"]],
"n1" = 3L, # Modify here for number of cores on node1
"n2" = 3L, # Modify here for number of cores on node2
## Default:
availableCores()
)
}
plan(list(
tweak(cluster, workers = nodes),
tweak(multisession, workers = customCores)
))
```

The function `tweak()` is necessary to override the inherent protection of nested parallelism, meant to help avoid overloading one's resources by errantly starting too many processes.
Because of the need to tweak backends, the message echoed to the console for nested backends reflects the highest level of the nested hierarchy.

With the backend in place, we then can run the identical code from before using all available resources and return the same results as before.

```{r confirm-cluster, eval=FALSE}
set.seed(1)

enroll_rates <- list(enroll_rate1, enroll_rate2)

seq_resultc <- foreach::foreach(
i = 1:2,
.combine = "list",
.options.future = list(seed = TRUE)
) %dofuture% {
sim_fixed_n(
n_sim = n_sim,
sample_size = 3000,
target_event = 700,
enroll_rate = enroll_rates[[i]],
timing_type = 2 # Time until targeted event count achieved
)
}
```

Then, we reset the `plan` to sequential to avoid accidentally continuing to execute later calls within these resources.

```{r plan-sequential2, eval=FALSE}
plan(sequential)
```
Binary file added vignettes/schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.