Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jia multi gpu eval #16428

Merged
merged 20 commits into from
Apr 11, 2022
Merged

Jia multi gpu eval #16428

merged 20 commits into from
Apr 11, 2022

Conversation

liyongsea
Copy link
Contributor

@liyongsea liyongsea commented Mar 26, 2022

What does this PR do?

This PR provdie a script to run codeparrot human eval in multi GPU mode
The code is a enhanced version of human_eval.py

Performance data

The performance is check using the following command

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=40 --HF_ALLOW_CODE_EVAL="1" --device_int=0 --num_tasks=8 --batch_size=10

One should obtain something similar to

Results: {'pass@1': 0.1375, 'pass@10': 0.21251641317430792}

Time benchmark

First configure accelerate

accelerate config

Remember to put more than 1 process if want to leverage multip gpu.
Then run

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=200 --HF_ALLOW_CODE_EVAL="1" --batch_size=10

One should obtain something similar to

Results: {'pass@1': 0.03807926829268292, 'pass@10': 0.056158491636524525, 'pass@100': 0.0739473539338153}                                                                                                  

Here is time performance benchmark under a GPU VM with 4xT4

Number of processes Evaluation time for lvwerra/codeparrot-small
1 2:10:00
4 35:00

You could also test the large model by setting --model_ckpt lvwerra/codeparrot, you might probably need to set a small batch size (1 is usually adapted)
Warning: when testing with the large model, accelerate is loading the model on all the processes. You would need about 60G RAM on your VM. (I am not sure there is another alternative)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 26, 2022

The documentation is not available anymore as the PR was closed or merged.

@lvwerra lvwerra self-requested a review March 28, 2022 08:15
@LysandreJik
Copy link
Member

Hey @liyongsea, thanks for your PR! Could you rebase/merge on main so that the CI passes? Thank you!

@lvwerra
Copy link
Member

lvwerra commented Mar 28, 2022

There seems to be also a minor issue with the code style. The CI enforces some standards and you can fix any formatting issues with:

make fixup

For more information on both rebasing/merging and quality checks you can checkout the contributing guide.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liyongsea

Thank you so much for implementing this. I think this will be a great addition for everybody working on these models as evaluation is definitely a bottleneck at the moment - I am really excited about this feature!

I left a few comments. My main concern is about the current distribution of the data to multiple GPUs which requires the matching the number of generations with a multiple of the GPUs. I think we should remove that before for a better generalisability. E.g. if we would like to run a temperature sweep for the large CodeParrot model I would likely use a machine with 16 GPUs. Another use case is when we want to run AlphaCode evaluation with O(100k-1M) generations we would likely need to run it on even larger infra. Happy to help figuring out how to make it work with the padding across processes. I left a few suggestions how it might work.

Another note is that I think we should replace human_eval.py with your script as your script also allows to evaluate on a single GPU and I would like to avoid unnecessary code duplication which would just make it more confusing for the user. We probably just need to add a note in the instruction readme about setting the number of workers.

@liyongsea
Copy link
Contributor Author

liyongsea commented Mar 28, 2022

There seems to be also a minor issue with the code style. The CI enforces some standards and you can fix any formatting issues with:

make fixup

For more information on both rebasing/merging and quality checks you can checkout the contributing guide.

Thank you for your feedback @lvwerra here is what I am going the do:

  • Rebase and fix the formatting issue
  • use task id to reorganize the output
  • remove padding before generation to avoid the 'n_copies' tricks

@liyongsea liyongsea force-pushed the jia_multi_gpu_eval branch 2 times, most recently from c24e6ae to 0a3877d Compare April 1, 2022 18:20
@liyongsea
Copy link
Contributor Author

liyongsea commented Apr 2, 2022

hi @lvwerra
I finished the implementation. Please start review. Here are some area you might need to pay extra attention.

  • I keep the behavior of sending copies of prompts, such that when generating 1M prediction per prompt, the same task is also distributed across worker
  • Padding is removed before generation. However, the code still requires n_tasks x n_samples to be multiple of n_processes. Because the length of the dataloader should be a multiple of n_processes (do you know how to easily avoid this). Do you think it is ok? Or shall I change the behavior?

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liyongsea

Thanks for making the changes - I think it is much cleaner now! I left a few small comments and nits. The main change is about the tokenization of the prompts which should simplify the code even further.

I think your first point is fine, that shouldn't be an issue. I am sure generation is the current bottleneck and not sending a few copies of the prompts.

Regarding your second point: I don't think this is true - accelerate should be able to handle that! It just means that during the last batch of jobs not all GPUs will be utilized but I think that is fine. Did you run into any problems? If not we could remove the ValueError.

@liyongsea
Copy link
Contributor Author

@lvwerra Thank you for your suggestions. I learnt a lot ! It is integrated into the PR, I will test it.

Regarding your second point: I don't think this is true - accelerate should be able to handle that! It just means that during the last batch of jobs not all GPUs will be utilized but I think that is fine. Did you run into any problems? If not we could remove the ValueError.

I have an error message when doing so, I will dig into it

for step, batch in tqdm(enumerate(dataloader)):
with torch.no_grad():
# do not generate if the batch is empty, it could be the case when multi GPU is used
if len(batch["ids"]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvwerra by doing this, I removed the division constraint. It surprise me a bit, because I thought I would at least need to assign generated_tokens to an empty tensor. I think it should be the last point that you need to pay attention

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.

@liyongsea
Copy link
Contributor Author

Hi @lvwerra I updated the code according to your suggestion. And I remove the exception regarding the division issue (please see my comment above).
I really appreciated your time. Tell me if you spot anything.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, we are close to the finish line. We'll make a test run on our side as well (cc @loubnabnl).

In the meantime @muellerzr or @sgugger would you mind having a quick look, too? The goal is to move the old evaluation script to accelerate so we can make the time consuming generations faster with parallelization. In this script we need to generate k code completions for each problem in HumanEval which are then evaluated at the end (that part is accelerate agnostic as it runs on the CPU). Let us know if we are using accelerate correctly for that task. Thanks!

for step, batch in tqdm(enumerate(dataloader)):
with torch.no_grad():
# do not generate if the batch is empty, it could be the case when multi GPU is used
if len(batch["ids"]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious - is it actually empty sometimes? I don't remember needing to do any of that in the training script. I think the dataloader would just stop for that worker earlier and never enter the loop in the last iteration.

from arguments import HumanEvalArguments
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
HfArgumentParser,
StoppingCriteria,
StoppingCriteriaList,
pipeline,
set_seed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that we should probably switch to the set_seed function from accelerate:
https:/huggingface/accelerate/blob/2c554b056cf10f7ae26e5d04cb60bf9dc35812e4/src/accelerate/utils.py#L120-L137
Otherwise what could happen is that each worker has the same seed and thus generates the same sequences even when sampling. We should set device_specific=True such that each worker gets a different seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I was wondering the same issue as well. I will change it

@liyongsea
Copy link
Contributor Author

You can go the the branch empty_tensor, it is the same PR as this one by remove the if condition and an exception.
Then config accelerate to 4 process and run

accelerate launch scripts/human_eval.py --model_ckpt lvwerra/codeparrot-small --do_sample True --temperature 0.2 --top_p 0.95 --n_samples=10 --HF_ALLOW_CODE_EVAL="1" --num_tasks=6 --batch_size=10

You will notice something like

TypeError: : only integer tensors of a single element can be converted to an index

Where accelerate is sending an empty tensor to some processes during the last iteration

I will also take a look into accelerate to see if I find anything

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks pretty good to me! There was one small nit I found where we could speed things up slightly.

(Also unsure of your empty tensor issue)

Comment on lines 133 to 134
generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
generated_tasks = accelerator.gather(generated_tasks).cpu().numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to reduce this to:

generated_tokens, generated_tasks = accelerator.gather(generated_tokens, generated_tasks)
generated_tokens = generated_tokens.cpu().numpy()
generated_tasks = generated_tasks.cpu().numpy()

And quick playing on my own machine reduced this by roughly half. (it's in microseconds, but it's a small quick efficiency boost)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was surprised as well.

Here's what I toyed with:

from accelerate import Accelerator
import torch

accelerator = Accelerator()
a = torch.tensor([1,2,3])
b = torch.tensor([[1],[2],[3]])

And then for timing:

%%timeit -n 10
c,d = accelerator.gather((a,b))
c = c.cpu().numpy()
d = d.cpu().numpy()
%%timeit -n 10
c = accelerator.gather(a).cpu().numpy()
d = accelerator.gather(b).cpu().numpy()

The first on Colab took 25µs as the best, while the latter took 48.9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing! I will do the update

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as @lvwerra is happy :-)

Comment on lines 133 to 134
generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
generated_tasks = accelerator.gather(generated_tasks).cpu().numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting it gives a boost as it's not doing anything different then gathering the two in a list comprehension.

@liyongsea
Copy link
Contributor Author

@lvwerra for the empty tensor issue, I found this in the documentation
https:/huggingface/accelerate/blob/1d95ebdaa40ed42d0bb2b41ba5531b49eb8948b1/src/accelerate/data_loader.py#L204
I am really not sure if it is related. Tomorrow I will make a smaller example to verify this

@loubnabnl
Copy link
Contributor

Hi @liyongsea , thank you for your contributions. I made a test run on my side and the code works fine!
For an NVIDIA Tesla A100 with 2 GPUs and the parameters you're using, I get the following results

#processes Time pass@1 pass@10 pass@100
1 3:07:26 3.41% 5.39% 7.01%
2 1:36:34 3.51% 5.63% 6.67%

@lvwerra
Copy link
Member

lvwerra commented Apr 8, 2022

So as the numbers seem to add up I think we can merge this once the set_seed is added. The check for empty batches is simple enough so we can leave it like that for now in my opinion - no reason to spend much time on that. Thanks again for adding this!

@liyongsea
Copy link
Contributor Author

liyongsea commented Apr 8, 2022

So as the numbers seem to add up I think we can merge this once the set_seed is added. The check for empty batches is simple enough so we can leave it like that for now in my opinion - no reason to spend much time on that. Thanks again for adding this!

Hi @lvwerra
I have finished the modifications.

  • accelerate.utils.set_seed is used with device_specific=True
  • I have updated the accelerated version which solve the empty tensor issue

Feel free to run the whole script again and merge if you are ok !

@lvwerra lvwerra merged commit 4868a83 into huggingface:main Apr 11, 2022
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* add simple multi gpu complet

* add human_eval_multi_gpu

* use copy strategy to distribute across gpu, to avoid padding

* add doc string

* update code style

* use task id to arrange output

* truncate input to avoid zero pad

* Stop the copy mechanism

* update style

* restore copies to scale better in distributed mode

* update style

* replace human eval

* Apply suggestions from code review

1. Tokenize all input at the same time
2. use attention_mask to get the input length
3. other small fixes

Co-authored-by: Leandro von Werra <[email protected]>

* correct typo and update docstring

* update code style

* remove num sample division constraint

* remove max len calculation

* use accelerator.gather once to speed up

* use accelerate set_seed; update accelerate version

* correct gather bug

Co-authored-by: Leandro von Werra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants