Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 4 No52】add fp16 for dist #51669

Closed
wants to merge 31 commits into from

Conversation

enkilee
Copy link
Contributor

@enkilee enkilee commented Mar 14, 2023

PR types

Others

PR changes

Others

Describe

add fp16 for dist
add docs: PaddlePaddle/docs#5730

@paddle-bot
Copy link

paddle-bot bot commented Mar 14, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@enkilee enkilee changed the title test dist support fp16 【Hackathon 4 No52】add fp16 for dist Mar 15, 2023
@enkilee
Copy link
Contributor Author

enkilee commented Mar 22, 2023

Traceback (most recent call last):
2023-03-22 23:09:51 10 File "test_dist_op.py", line 119, in test_check_grad
2023-03-22 23:09:51 11 self.check_grad(
2023-03-22 23:09:51 12 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2211, in check_grad
2023-03-22 23:09:51 13 self.check_grad_with_place(
2023-03-22 23:09:51 14 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2387, in check_grad_with_place
2023-03-22 23:09:51 15 self._assert_is_close(
2023-03-22 23:09:51 16 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2184, in _assert_is_close
2023-03-22 23:09:51 17 self.assertLessEqual(max_diff, max_relative_error, err_msg())
2023-03-22 23:09:51 18 AssertionError: 0.04123 not less than or equal to 0.005 : Operator dist error, Gradient Check On Place(gpu:0) variable X (shape: (4, 1, 4, 8), dtype: float16) max gradient diff 4.122925e-02 over limit 5.000000e-03, the first error element is 36, expected 2.960205e-03, but got 2.838135e-03.

请问,像这种grad check超了的,该如何分析?

@zhangting2020
Copy link
Contributor

Traceback (most recent call last): 2023-03-22 23:09:51 10 File "test_dist_op.py", line 119, in test_check_grad 2023-03-22 23:09:51 11 self.check_grad( 2023-03-22 23:09:51 12 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2211, in check_grad 2023-03-22 23:09:51 13 self.check_grad_with_place( 2023-03-22 23:09:51 14 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2387, in check_grad_with_place 2023-03-22 23:09:51 15 self._assert_is_close( 2023-03-22 23:09:51 16 File "/mnt/paddle/build/python/paddle/fluid/tests/unittests/eager_op_test.py", line 2184, in _assert_is_close 2023-03-22 23:09:51 17 self.assertLessEqual(max_diff, max_relative_error, err_msg()) 2023-03-22 23:09:51 18 AssertionError: 0.04123 not less than or equal to 0.005 : Operator dist error, Gradient Check On Place(gpu:0) variable X (shape: (4, 1, 4, 8), dtype: float16) max gradient diff 4.122925e-02 over limit 5.000000e-03, the first error element is 36, expected 2.960205e-03, but got 2.838135e-03.

请问,像这种grad check超了的,该如何分析?

从前、反向的实现上去检查,看哪个计算过程对精度损失比较严重

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Apr 24, 2023

Sorry to inform you that 8eae1b5's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@luotao1
Copy link
Contributor

luotao1 commented Apr 28, 2023

close due to the following PR is merged:

@luotao1 luotao1 closed this Apr 28, 2023
@enkilee enkilee deleted the Hackathon4-no52-dist branch July 6, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants