Jia multi gpu eval #16428

liyongsea · 2022-03-26T19:03:50Z

What does this PR do?

This PR provdie a script to run codeparrot human eval in multi GPU mode
The code is a enhanced version of human_eval.py

Performance data

The performance is check using the following command

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=40 --HF_ALLOW_CODE_EVAL="1" --device_int=0 --num_tasks=8 --batch_size=10

One should obtain something similar to

Results: {'pass@1': 0.1375, 'pass@10': 0.21251641317430792}

Time benchmark

First configure accelerate

accelerate config

Remember to put more than 1 process if want to leverage multip gpu.
Then run

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=200 --HF_ALLOW_CODE_EVAL="1" --batch_size=10

One should obtain something similar to

Results: {'pass@1': 0.03807926829268292, 'pass@10': 0.056158491636524525, 'pass@100': 0.0739473539338153}

Here is time performance benchmark under a GPU VM with 4xT4

Number of processes	Evaluation time for lvwerra/codeparrot-small
1	2:10:00
4	35:00

You could also test the large model by setting --model_ckpt lvwerra/codeparrot, you might probably need to set a small batch size (1 is usually adapted)
Warning: when testing with the large model, accelerate is loading the model on all the processes. You would need about 60G RAM on your VM. (I am not sure there is another alternative)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-03-26T19:18:31Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik · 2022-03-28T08:49:42Z

Hey @liyongsea, thanks for your PR! Could you rebase/merge on main so that the CI passes? Thank you!

lvwerra · 2022-03-28T09:02:48Z

There seems to be also a minor issue with the code style. The CI enforces some standards and you can fix any formatting issues with:

make fixup

For more information on both rebasing/merging and quality checks you can checkout the contributing guide.

lvwerra

Hi @liyongsea

Thank you so much for implementing this. I think this will be a great addition for everybody working on these models as evaluation is definitely a bottleneck at the moment - I am really excited about this feature!

I left a few comments. My main concern is about the current distribution of the data to multiple GPUs which requires the matching the number of generations with a multiple of the GPUs. I think we should remove that before for a better generalisability. E.g. if we would like to run a temperature sweep for the large CodeParrot model I would likely use a machine with 16 GPUs. Another use case is when we want to run AlphaCode evaluation with O(100k-1M) generations we would likely need to run it on even larger infra. Happy to help figuring out how to make it work with the padding across processes. I left a few suggestions how it might work.

Another note is that I think we should replace human_eval.py with your script as your script also allows to evaluate on a single GPU and I would like to avoid unnecessary code duplication which would just make it more confusing for the user. We probably just need to add a note in the instruction readme about setting the number of workers.

examples/research_projects/codeparrot/scripts/human_eval_mgpu.py

liyongsea · 2022-03-28T10:32:12Z

There seems to be also a minor issue with the code style. The CI enforces some standards and you can fix any formatting issues with:
make fixup
For more information on both rebasing/merging and quality checks you can checkout the contributing guide.

Thank you for your feedback @lvwerra here is what I am going the do:

Rebase and fix the formatting issue
use task id to reorganize the output
remove padding before generation to avoid the 'n_copies' tricks

liyongsea · 2022-04-02T19:53:41Z

hi @lvwerra
I finished the implementation. Please start review. Here are some area you might need to pay extra attention.

I keep the behavior of sending copies of prompts, such that when generating 1M prediction per prompt, the same task is also distributed across worker
Padding is removed before generation. However, the code still requires n_tasks x n_samples to be multiple of n_processes. Because the length of the dataloader should be a multiple of n_processes (do you know how to easily avoid this). Do you think it is ok? Or shall I change the behavior?

lvwerra

Hi @liyongsea

Thanks for making the changes - I think it is much cleaner now! I left a few small comments and nits. The main change is about the tokenization of the prompts which should simplify the code even further.

I think your first point is fine, that shouldn't be an issue. I am sure generation is the current bottleneck and not sending a few copies of the prompts.

Regarding your second point: I don't think this is true - accelerate should be able to handle that! It just means that during the last batch of jobs not all GPUs will be utilized but I think that is fine. Did you run into any problems? If not we could remove the ValueError.

examples/research_projects/codeparrot/scripts/human_eval.py

liyongsea · 2022-04-06T11:49:52Z

@lvwerra Thank you for your suggestions. I learnt a lot ! It is integrated into the PR, I will test it.

Regarding your second point: I don't think this is true - accelerate should be able to handle that! It just means that during the last batch of jobs not all GPUs will be utilized but I think that is fine. Did you run into any problems? If not we could remove the ValueError.

I have an error message when doing so, I will dig into it

liyongsea · 2022-04-06T19:14:56Z

examples/research_projects/codeparrot/scripts/human_eval.py

+ for step, batch in tqdm(enumerate(dataloader)):
+ with torch.no_grad():
+ # do not generate if the batch is empty, it could be the case when multi GPU is used
+ if len(batch["ids"]):


@lvwerra by doing this, I removed the division constraint. It surprise me a bit, because I thought I would at least need to assign generated_tokens to an empty tensor. I think it should be the last point that you need to pay attention

I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.

liyongsea · 2022-04-06T19:18:33Z

Hi @lvwerra I updated the code according to your suggestion. And I remove the exception regarding the division issue (please see my comment above).
I really appreciated your time. Tell me if you spot anything.

lvwerra

Awesome, we are close to the finish line. We'll make a test run on our side as well (cc @loubnabnl).

In the meantime @muellerzr or @sgugger would you mind having a quick look, too? The goal is to move the old evaluation script to accelerate so we can make the time consuming generations faster with parallelization. In this script we need to generate k code completions for each problem in HumanEval which are then evaluated at the end (that part is accelerate agnostic as it runs on the CPU). Let us know if we are using accelerate correctly for that task. Thanks!

lvwerra · 2022-04-07T10:17:29Z

examples/research_projects/codeparrot/scripts/human_eval.py

+ for step, batch in tqdm(enumerate(dataloader)):
+ with torch.no_grad():
+ # do not generate if the batch is empty, it could be the case when multi GPU is used
+ if len(batch["ids"]):


I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.

lvwerra · 2022-04-07T10:21:26Z

examples/research_projects/codeparrot/scripts/human_eval.py

 from arguments import HumanEvalArguments
 from transformers import (
 AutoModelForCausalLM,
 AutoTokenizer,
 HfArgumentParser,
 StoppingCriteria,
 StoppingCriteriaList,
- pipeline,
 set_seed,


It just occurred to me that we should probably switch to the set_seed function from accelerate:
https:/huggingface/accelerate/blob/2c554b056cf10f7ae26e5d04cb60bf9dc35812e4/src/accelerate/utils.py#L120-L137
Otherwise what could happen is that each worker has the same seed and thus generates the same sequences even when sampling. We should set device_specific=True such that each worker gets a different seed.

Good idea, I was wondering the same issue as well. I will change it

liyongsea · 2022-04-07T12:48:56Z

You can go the the branch empty_tensor, it is the same PR as this one by remove the if condition and an exception.
Then config accelerate to 4 process and run

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=10 --HF_ALLOW_CODE_EVAL="1" --num_tasks=6 --batch_size=10

You will notice something like

TypeError: : only integer tensors of a single element can be converted to an index

Where accelerate is sending an empty tensor to some processes during the last iteration

I will also take a look into accelerate to see if I find anything

muellerzr

It all looks pretty good to me! There was one small nit I found where we could speed things up slightly.

(Also unsure of your empty tensor issue)

muellerzr · 2022-04-07T13:22:26Z

examples/research_projects/codeparrot/scripts/human_eval.py

+ generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
+ generated_tasks = accelerator.gather(generated_tasks).cpu().numpy()


You should be able to reduce this to:

generated_tokens, generated_tasks = accelerator.gather(generated_tokens, generated_tasks) generated_tokens = generated_tokens.cpu().numpy() generated_tasks = generated_tasks.cpu().numpy()

And quick playing on my own machine reduced this by roughly half. (it's in microseconds, but it's a small quick efficiency boost)

Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.

Yeah, was surprised as well.

Here's what I toyed with:

from accelerate import Accelerator import torch accelerator = Accelerator() a = torch.tensor([1,2,3]) b = torch.tensor([[1],[2],[3]])

And then for timing:

%%timeit -n 10 c,d = accelerator.gather((a,b)) c = c.cpu().numpy() d = d.cpu().numpy()

%%timeit -n 10 c = accelerator.gather(a).cpu().numpy() d = accelerator.gather(b).cpu().numpy()

The first on Colab took 25µs as the best, while the latter took 48.9

Thank you for sharing! I will do the update

sgugger

LGTM as long as @lvwerra is happy :-)

sgugger · 2022-04-07T14:14:24Z

examples/research_projects/codeparrot/scripts/human_eval.py

+ generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
+ generated_tasks = accelerator.gather(generated_tasks).cpu().numpy()


Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.

liyongsea · 2022-04-07T21:51:07Z

@lvwerra for the empty tensor issue, I found this in the documentation
https:/huggingface/accelerate/blob/1d95ebdaa40ed42d0bb2b41ba5531b49eb8948b1/src/accelerate/data_loader.py#L204
I am really not sure if it is related. Tomorrow I will make a smaller example to verify this

loubnabnl · 2022-04-08T15:44:42Z

Hi @liyongsea , thank you for your contributions. I made a test run on my side and the code works fine!
For an NVIDIA Tesla A100 with 2 GPUs and the parameters you're using, I get the following results

#processes	Time	pass@1	pass@10	pass@100
1	3:07:26	3.41%	5.39%	7.01%
2	1:36:34	3.51%	5.63%	6.67%

lvwerra · 2022-04-08T15:58:14Z

So as the numbers seem to add up I think we can merge this once the set_seed is added. The check for empty batches is simple enough so we can leave it like that for now in my opinion - no reason to spend much time on that. Thanks again for adding this!

1. Tokenize all input at the same time 2. use attention_mask to get the input length 3. other small fixes Co-authored-by: Leandro von Werra <[email protected]>

liyongsea · 2022-04-08T17:23:38Z

So as the numbers seem to add up I think we can merge this once the set_seed is added. The check for empty batches is simple enough so we can leave it like that for now in my opinion - no reason to spend much time on that. Thanks again for adding this!

Hi @lvwerra
I have finished the modifications.

accelerate.utils.set_seed is used with device_specific=True
I have updated the accelerated version which solve the empty tensor issue

Feel free to run the whole script again and merge if you are ok !

* add simple multi gpu complet * add human_eval_multi_gpu * use copy strategy to distribute across gpu, to avoid padding * add doc string * update code style * use task id to arrange output * truncate input to avoid zero pad * Stop the copy mechanism * update style * restore copies to scale better in distributed mode * update style * replace human eval * Apply suggestions from code review 1. Tokenize all input at the same time 2. use attention_mask to get the input length 3. other small fixes Co-authored-by: Leandro von Werra <[email protected]> * correct typo and update docstring * update code style * remove num sample division constraint * remove max len calculation * use accelerator.gather once to speed up * use accelerate set_seed; update accelerate version * correct gather bug Co-authored-by: Leandro von Werra <[email protected]>

lvwerra self-requested a review March 28, 2022 08:15

lvwerra reviewed Mar 28, 2022

View reviewed changes

liyongsea force-pushed the jia_multi_gpu_eval branch 2 times, most recently from c24e6ae to 0a3877d Compare April 1, 2022 18:20

lvwerra reviewed Apr 6, 2022

View reviewed changes

liyongsea force-pushed the jia_multi_gpu_eval branch from 2f19e79 to ec365ed Compare April 6, 2022 18:04

liyongsea commented Apr 6, 2022

View reviewed changes

lvwerra reviewed Apr 7, 2022

View reviewed changes

lvwerra requested review from muellerzr and sgugger April 7, 2022 10:33

muellerzr reviewed Apr 7, 2022

View reviewed changes

sgugger approved these changes Apr 7, 2022

View reviewed changes

liyongsea force-pushed the jia_multi_gpu_eval branch from 6ba3cb4 to 843612b Compare April 8, 2022 10:07

Jia Li added 6 commits April 8, 2022 18:54

add simple multi gpu complet

ff79069

add human_eval_multi_gpu

e963ee1

use copy strategy to distribute across gpu, to avoid padding

56125b7

add doc string

0e52f39

update code style

1efe3b9

use task id to arrange output

eed304a

Jia Li and others added 13 commits April 8, 2022 18:54

truncate input to avoid zero pad

ef34194

Stop the copy mechanism

d1bda1c

update style

f8ddd58

restore copies to scale better in distributed mode

a10bdb5

update style

4bc10e5

replace human eval

5ae4d5a

Apply suggestions from code review

29f8efe

1. Tokenize all input at the same time 2. use attention_mask to get the input length 3. other small fixes Co-authored-by: Leandro von Werra <[email protected]>

correct typo and update docstring

ce2817f

update code style

5d80313

remove num sample division constraint

a82db19

remove max len calculation

7779645

use accelerator.gather once to speed up

012f4ff

use accelerate set_seed; update accelerate version

bad2991

liyongsea force-pushed the jia_multi_gpu_eval branch from 843612b to bad2991 Compare April 8, 2022 16:54

correct gather bug

c2a6e8f

lvwerra merged commit 4868a83 into huggingface:main Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jia multi gpu eval #16428

Jia multi gpu eval #16428

liyongsea commented Mar 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 26, 2022 •

edited

Loading

LysandreJik commented Mar 28, 2022

lvwerra commented Mar 28, 2022

lvwerra left a comment

liyongsea commented Mar 28, 2022 •

edited

Loading

liyongsea commented Apr 2, 2022 •

edited

Loading

lvwerra left a comment

liyongsea commented Apr 6, 2022

liyongsea Apr 6, 2022

lvwerra Apr 7, 2022

liyongsea commented Apr 6, 2022

lvwerra left a comment

lvwerra Apr 7, 2022

lvwerra Apr 7, 2022

liyongsea Apr 7, 2022

liyongsea commented Apr 7, 2022

muellerzr left a comment

muellerzr Apr 7, 2022

sgugger Apr 7, 2022

muellerzr Apr 7, 2022

liyongsea Apr 7, 2022

sgugger left a comment

sgugger Apr 7, 2022

liyongsea commented Apr 7, 2022

loubnabnl commented Apr 8, 2022

lvwerra commented Apr 8, 2022

liyongsea commented Apr 8, 2022 •

edited

Loading

		generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
		generated_tasks = accelerator.gather(generated_tasks).cpu().numpy()

Jia multi gpu eval #16428

Jia multi gpu eval #16428

Conversation

liyongsea commented Mar 26, 2022 • edited Loading

What does this PR do?

Performance data

Time benchmark

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Mar 26, 2022 • edited Loading

LysandreJik commented Mar 28, 2022

lvwerra commented Mar 28, 2022

lvwerra left a comment

Choose a reason for hiding this comment

liyongsea commented Mar 28, 2022 • edited Loading

liyongsea commented Apr 2, 2022 • edited Loading

lvwerra left a comment

Choose a reason for hiding this comment

liyongsea commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyongsea commented Apr 6, 2022

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyongsea commented Apr 7, 2022

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyongsea commented Apr 7, 2022

loubnabnl commented Apr 8, 2022

lvwerra commented Apr 8, 2022

liyongsea commented Apr 8, 2022 • edited Loading

liyongsea commented Mar 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 26, 2022 •

edited

Loading

liyongsea commented Mar 28, 2022 •

edited

Loading

liyongsea commented Apr 2, 2022 •

edited

Loading

liyongsea commented Apr 8, 2022 •

edited

Loading