Fix expert grad scaling problem with ZeRO optimizer #6546

wyooyw · 2024-09-17T15:09:05Z

work:

expert gradient average: divide edp_world_size -> divide dp_world_size
unit test: make sure model with different dp/ep has same expert gradient

wyooyw · 2024-09-17T15:18:03Z

@microsoft-github-policy-service agree

tohtana

Thank you @wyooyw! This looks good to me.
@inkcherry Do you have any suggestion? You worked on a similar issue in #5259.

ranzhejiang · 2024-09-18T01:54:00Z

@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079

wyooyw · 2024-09-18T02:19:56Z

@wyooyw It seems that you should also delete or comment https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1072 when you delete https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1079

Thank you for your suggestion. This redundant line of code has been removed.

ranzhejiang · 2024-09-18T05:00:26Z

deepspeed/runtime/zero/stage_1_and_2.py

@@ -1115,8 +1114,7 @@ def average_tensor(self, tensor):
                    curr_size += numel
                    prev_id, prev_process_group = partition_id, process_group

-            if not self.ipg_bucket_has_moe_params:
-                tensor.div_(dist.get_world_size(group=self.dp_process_group) / float(self.sequence_parallel_size))


If only grad for expert is not correct, we only need to make 'grad_reduc' divide edp_world_size -> divide dp_world_size, why we need use 'tensor' for divide, it may contain more data not only gradient ? I just feel confused about here.

In my understanding, there are only gradients waiting to do all-reduce in 'tensor'.

From the code, 'tensor' may be a buffer in 'self.ipg_buffer' or the gradient of 'self.extra_large_param_to_reduce' . So, 'tensor' is composed of data from one or more weight gradients, and the data pointer of 'grad_reduc' points to an address within 'tensor'.

According to the comments in the code, the logic of the old version code is:

Averages gradients at parameter level if ipg has a moe param, i.e. do average on 'grad_reduc'

Otherwise averaging is done at the entire buffer level at the end of the loop, i.e. do average on 'tensor'.

He did this because he wanted to divide the expert gradient by edp_size and the non-expert gradient by dp_size, so he must do the average at the parameter level when there is a moe param. But in our PR, we divide all weight gradients by dp_size, so we can directly do the average at the entire buffer level.

In addition, maybe I need also delete those old comments.

Thank you for clarification, I agree with you for deleting those old comments.

Fix Expert Grad Scale Problem With Zero Optimizer

607d8c9

wyooyw requested review from tjruwase and loadams as code owners September 17, 2024 15:09

wyooyw mentioned this pull request Sep 17, 2024

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

Open

wyooyw changed the title ~~Fix Expert Grad Scaling Problem With Zero Optimizer~~ Sep 17, 2024

tjruwase requested review from tohtana and removed request for loadams September 17, 2024 16:21

tohtana reviewed Sep 17, 2024

View reviewed changes

remove useless code

5a44f8c

ranzhejiang reviewed Sep 18, 2024

View reviewed changes

remove useless comments

b1231c4

wyooyw force-pushed the fix_expert_weight_grad_with_zero branch from 6e1e90c to b1231c4 Compare September 18, 2024 07:01

tohtana and others added 4 commits September 26, 2024 22:31

Merge branch 'master' into fix_expert_weight_grad_with_zero

14d002d

Merge branch 'master' into fix_expert_weight_grad_with_zero

76dda2a

Merge branch 'master' into fix_expert_weight_grad_with_zero

d0de160

Merge branch 'master' into fix_expert_weight_grad_with_zero

28b2aff

tohtana enabled auto-merge October 14, 2024 19:36

tohtana approved these changes Oct 14, 2024

View reviewed changes

tohtana added this pull request to the merge queue Oct 14, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 15, 2024

tohtana added this pull request to the merge queue Oct 17, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 18, 2024

tohtana added this pull request to the merge queue Oct 18, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 18, 2024

tohtana added this pull request to the merge queue Oct 23, 2024

Merged via the queue into microsoft:master with commit b647fb2 Oct 23, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix expert grad scaling problem with ZeRO optimizer #6546

Fix expert grad scaling problem with ZeRO optimizer #6546

wyooyw commented Sep 17, 2024 •

edited

Loading

wyooyw commented Sep 17, 2024

tohtana left a comment

ranzhejiang commented Sep 18, 2024

wyooyw commented Sep 18, 2024

ranzhejiang Sep 18, 2024

wyooyw Sep 18, 2024

ranzhejiang Sep 18, 2024

Fix expert grad scaling problem with ZeRO optimizer #6546

Fix expert grad scaling problem with ZeRO optimizer #6546

Conversation

wyooyw commented Sep 17, 2024 • edited Loading

wyooyw commented Sep 17, 2024

tohtana left a comment

Choose a reason for hiding this comment

ranzhejiang commented Sep 18, 2024

wyooyw commented Sep 18, 2024

ranzhejiang Sep 18, 2024

Choose a reason for hiding this comment

wyooyw Sep 18, 2024

Choose a reason for hiding this comment

ranzhejiang Sep 18, 2024

Choose a reason for hiding this comment

wyooyw commented Sep 17, 2024 •

edited

Loading