-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jia multi gpu eval #16428
Jia multi gpu eval #16428
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hey @liyongsea, thanks for your PR! Could you rebase/merge on |
There seems to be also a minor issue with the code style. The CI enforces some standards and you can fix any formatting issues with: make fixup For more information on both rebasing/merging and quality checks you can checkout the contributing guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liyongsea
Thank you so much for implementing this. I think this will be a great addition for everybody working on these models as evaluation is definitely a bottleneck at the moment - I am really excited about this feature!
I left a few comments. My main concern is about the current distribution of the data to multiple GPUs which requires the matching the number of generations with a multiple of the GPUs. I think we should remove that before for a better generalisability. E.g. if we would like to run a temperature sweep for the large CodeParrot model I would likely use a machine with 16 GPUs. Another use case is when we want to run AlphaCode evaluation with O(100k-1M) generations we would likely need to run it on even larger infra. Happy to help figuring out how to make it work with the padding across processes. I left a few suggestions how it might work.
Another note is that I think we should replace human_eval.py
with your script as your script also allows to evaluate on a single GPU and I would like to avoid unnecessary code duplication which would just make it more confusing for the user. We probably just need to add a note in the instruction readme about setting the number of workers.
examples/research_projects/codeparrot/scripts/human_eval_mgpu.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/human_eval_mgpu.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/human_eval_mgpu.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/human_eval_mgpu.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/human_eval_mgpu.py
Outdated
Show resolved
Hide resolved
Thank you for your feedback @lvwerra here is what I am going the do:
|
c24e6ae
to
0a3877d
Compare
hi @lvwerra
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liyongsea
Thanks for making the changes - I think it is much cleaner now! I left a few small comments and nits. The main change is about the tokenization of the prompts which should simplify the code even further.
I think your first point is fine, that shouldn't be an issue. I am sure generation is the current bottleneck and not sending a few copies of the prompts.
Regarding your second point: I don't think this is true - accelerate
should be able to handle that! It just means that during the last batch of jobs not all GPUs will be utilized but I think that is fine. Did you run into any problems? If not we could remove the ValueError
.
@lvwerra Thank you for your suggestions. I learnt a lot ! It is integrated into the PR, I will test it.
I have an error message when doing so, I will dig into it |
2f19e79
to
ec365ed
Compare
for step, batch in tqdm(enumerate(dataloader)): | ||
with torch.no_grad(): | ||
# do not generate if the batch is empty, it could be the case when multi GPU is used | ||
if len(batch["ids"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lvwerra by doing this, I removed the division constraint. It surprise me a bit, because I thought I would at least need to assign generated_tokens
to an empty tensor. I think it should be the last point that you need to pay attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.
Hi @lvwerra I updated the code according to your suggestion. And I remove the exception regarding the division issue (please see my comment above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, we are close to the finish line. We'll make a test run on our side as well (cc @loubnabnl).
In the meantime @muellerzr or @sgugger would you mind having a quick look, too? The goal is to move the old evaluation script to accelerate
so we can make the time consuming generations faster with parallelization. In this script we need to generate k code completions for each problem in HumanEval
which are then evaluated at the end (that part is accelerate
agnostic as it runs on the CPU). Let us know if we are using accelerate
correctly for that task. Thanks!
for step, batch in tqdm(enumerate(dataloader)): | ||
with torch.no_grad(): | ||
# do not generate if the batch is empty, it could be the case when multi GPU is used | ||
if len(batch["ids"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.
from arguments import HumanEvalArguments | ||
from transformers import ( | ||
AutoModelForCausalLM, | ||
AutoTokenizer, | ||
HfArgumentParser, | ||
StoppingCriteria, | ||
StoppingCriteriaList, | ||
pipeline, | ||
set_seed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me that we should probably switch to the set_seed
function from accelerate
:
https:/huggingface/accelerate/blob/2c554b056cf10f7ae26e5d04cb60bf9dc35812e4/src/accelerate/utils.py#L120-L137
Otherwise what could happen is that each worker has the same seed and thus generates the same sequences even when sampling. We should set device_specific=True
such that each worker gets a different seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I was wondering the same issue as well. I will change it
You can go the the branch empty_tensor, it is the same PR as this one by remove the if condition and an exception.
You will notice something like
Where accelerate is sending an empty tensor to some processes during the last iteration I will also take a look into accelerate to see if I find anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all looks pretty good to me! There was one small nit I found where we could speed things up slightly.
(Also unsure of your empty tensor issue)
generated_tokens = accelerator.gather(generated_tokens).cpu().numpy() | ||
generated_tasks = accelerator.gather(generated_tasks).cpu().numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to reduce this to:
generated_tokens, generated_tasks = accelerator.gather(generated_tokens, generated_tasks)
generated_tokens = generated_tokens.cpu().numpy()
generated_tasks = generated_tasks.cpu().numpy()
And quick playing on my own machine reduced this by roughly half. (it's in microseconds, but it's a small quick efficiency boost)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was surprised as well.
Here's what I toyed with:
from accelerate import Accelerator
import torch
accelerator = Accelerator()
a = torch.tensor([1,2,3])
b = torch.tensor([[1],[2],[3]])
And then for timing:
%%timeit -n 10
c,d = accelerator.gather((a,b))
c = c.cpu().numpy()
d = d.cpu().numpy()
%%timeit -n 10
c = accelerator.gather(a).cpu().numpy()
d = accelerator.gather(b).cpu().numpy()
The first on Colab took 25µs as the best, while the latter took 48.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing! I will do the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as @lvwerra is happy :-)
generated_tokens = accelerator.gather(generated_tokens).cpu().numpy() | ||
generated_tasks = accelerator.gather(generated_tasks).cpu().numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.
@lvwerra for the empty tensor issue, I found this in the documentation |
6ba3cb4
to
843612b
Compare
Hi @liyongsea , thank you for your contributions. I made a test run on my side and the code works fine!
|
So as the numbers seem to add up I think we can merge this once the |
1. Tokenize all input at the same time 2. use attention_mask to get the input length 3. other small fixes Co-authored-by: Leandro von Werra <[email protected]>
843612b
to
bad2991
Compare
Hi @lvwerra
Feel free to run the whole script again and merge if you are ok ! |
* add simple multi gpu complet * add human_eval_multi_gpu * use copy strategy to distribute across gpu, to avoid padding * add doc string * update code style * use task id to arrange output * truncate input to avoid zero pad * Stop the copy mechanism * update style * restore copies to scale better in distributed mode * update style * replace human eval * Apply suggestions from code review 1. Tokenize all input at the same time 2. use attention_mask to get the input length 3. other small fixes Co-authored-by: Leandro von Werra <[email protected]> * correct typo and update docstring * update code style * remove num sample division constraint * remove max len calculation * use accelerator.gather once to speed up * use accelerate set_seed; update accelerate version * correct gather bug Co-authored-by: Leandro von Werra <[email protected]>
What does this PR do?
This PR provdie a script to run codeparrot human eval in multi GPU mode
The code is a enhanced version of human_eval.py
Performance data
The performance is check using the following command
One should obtain something similar to
Time benchmark
First configure accelerate
Remember to put more than 1 process if want to leverage multip gpu.
Then run
One should obtain something similar to
Here is time performance benchmark under a GPU VM with 4xT4
You could also test the large model by setting
--model_ckpt lvwerra/codeparrot
, you might probably need to set a small batch size (1 is usually adapted)Warning: when testing with the large model, accelerate is loading the model on all the processes. You would need about 60G RAM on your VM. (I am not sure there is another alternative)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.