Support for GPU scheduling with Slurm #4308

adamnovak · 2023-01-05T23:06:29Z

As noted in ComparativeGenomicsToolkit/cactus#887, people want to use Cactus with GPU support on Slurm, but Toil donen't yet know how to ask for GPUs on Slurm, and we don't have a GPU Slurm cluster to test with yet.

We can probably just try throwing --gres=gpu:<count> into the submission commands, and hope that all Slurm clusters with GPUs use that name. Which I think they might, because despite the "generic resource" name of GRES, the documentation talsk about some pretty tight integration that Slurm has with e.g. nVidia's CUDA.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1257

The text was updated successfully, but these errors were encountered:

oneillkza · 2023-01-05T23:44:51Z

It might be worth looking at the discussions the Nextflow folks had around this four years ago: nextflow-io/nextflow#997

oneillkza · 2023-01-05T23:48:12Z

And yep, I think it may just be as simple as throwing --gres=gpu:<count> into the submission commands.

oneillkza · 2023-01-06T00:18:47Z

I'd be happy to give this a try on our cluster -- I have a test run of Cactus all ready to go.

I guess in the meantime I'll have to try submitting to a chunk of a node and having Cactus/Toil use the singleMachine batchSystem.

oneillkza · 2023-01-06T17:46:00Z

I believe @thiagogenez also has a GPU cluster at the EBI, and is interested in this functionality.

thiagogenez · 2023-01-09T12:38:27Z

thanks @oneillkza for letting me know about this issue.

Yes, I'm interested to see this functionality working with Cactus.

So far, I have run Cactus on a Slrum cluster without Toil scheduling capabilities. I set Toil to use singleMachine approach and schedule GPU jobs for GPU nodes using a script acting as an external job scheduler.

thiagogenez · 2023-01-11T13:41:43Z

And yep, I think it may just be as simple as throwing --gres=gpu:<count> into the submission commands.

Hi @adamnovak
Same here with me. Just adding --gres=gpu:4 to grab a GPU-enabled worker.

Ex: srun --gres=gpu:4 --mem 200gb -t 30 --pty bash

adamnovak · 2023-01-11T17:36:46Z

It sounds like there's a lot of appetite to get this working outside UC.

If someone wanted to do a PR for this I could make sure to review it and get it merged.

To implement this, the SlurmBatchSystem would need an implementation of _check_accelerator_request() that overrides the default and returns True if jobs don't request any accelerators with a type other than "gpu". It would work a lot like the Kubernetes version:

toil/src/toil/batchSystems/kubernetes.py

Lines 657 to 664 in 8a0d05c

 def _check_accelerator_request(self, requirer: Requirer) -> None: 

 for accelerator in requirer.accelerators: 

 if accelerator['kind'] != 'gpu' and 'model' not in accelerator: 

 # We can only provide GPUs or things with a model right now 

 raise InsufficientSystemResources(requirer, 'accelerators', details=[ 

 f'The accelerator {accelerator} could not be provided.', 

 'The Toil Kubernetes batch system only knows how to request gpu accelerators or accelerators with a defined model.' 

 ])

Then we'd have to change the SlurmBatchSystem.Worker's prepareSbatch() to take an argument reflecting the number of GPUs to request, and make it generate the --gres flag in the command line it prepares.

Then we'd need to manage to actually supply that argument to prepareSbatch(). We'd need to thread the argument though prepareSubmission(), and because that is a method from the base AbstractBatchSystem.Worker class, we'd need to change its interface to allow the GPU information to come through it. We'd also need to change the place where prepareSubmission() is called so that it can pass the GPU information through, which means we'd need to extend the tuples we store in the AbstractGridEngineBatchSystem.Worker.waitingJobs list and in the inter-thread newJobs queue that appears at AbstractGridEngineBatchSystem.newJobs and AbstractGridEngineBatchSystem.Worker.newJobs. That could be accomplished by pulling out the right information from jobDesc.accelerators when we put a tuple into the inter-thread queue.

Then we'd just need to get other AbstractGridEngineBatchSystem.Worker implementations to tolerate the new argument to their prepareSubmission() implementations, and everything ought to start working.

thiagogenez · 2023-01-11T23:52:29Z

thanks @adamnovak

I'm interested to propose a PR to solve this issue. Will have a look. Cheers

adamnovak · 2023-03-01T21:09:43Z

We fixed this in #4350.

oneillkza mentioned this issue Jan 6, 2023

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster ComparativeGenomicsToolkit/cactus#887

Open

glennhickey mentioned this issue Jan 9, 2023

Support GPU Scheduling for Mesos (with aws provisioner) for Cactus #4309

Open

adamnovak mentioned this issue Jan 10, 2023

Set TOIL_SLURM_ARGS per step for a CWL workflow #3231

Open

unito-bot assigned Hexotical Jan 30, 2023

adamnovak closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for GPU scheduling with Slurm #4308

Support for GPU scheduling with Slurm #4308

adamnovak commented Jan 5, 2023 •

edited by unito-bot

Loading

oneillkza commented Jan 5, 2023

oneillkza commented Jan 5, 2023

oneillkza commented Jan 6, 2023

oneillkza commented Jan 6, 2023

thiagogenez commented Jan 9, 2023 •

edited

Loading

thiagogenez commented Jan 11, 2023

adamnovak commented Jan 11, 2023

thiagogenez commented Jan 11, 2023

adamnovak commented Mar 1, 2023

Support for GPU scheduling with Slurm #4308

Support for GPU scheduling with Slurm #4308

Comments

adamnovak commented Jan 5, 2023 • edited by unito-bot Loading

oneillkza commented Jan 5, 2023

oneillkza commented Jan 5, 2023

oneillkza commented Jan 6, 2023

oneillkza commented Jan 6, 2023

thiagogenez commented Jan 9, 2023 • edited Loading

thiagogenez commented Jan 11, 2023

adamnovak commented Jan 11, 2023

thiagogenez commented Jan 11, 2023

adamnovak commented Mar 1, 2023

adamnovak commented Jan 5, 2023 •

edited by unito-bot

Loading

thiagogenez commented Jan 9, 2023 •

edited

Loading