-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/feature: Fixing mixed precision training #290
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deltheil PTAL. @isamu-isozaki could you rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See runtime bugs found
@limiteinductive Ok, a slight update. For bfloat16 it doesn't seem like grad scaling is necessary because the number of exponent bits is the exact same as float32. Though in fp16 it is necessary. The steps needed to make this work for float16 seem like, traditionally,
but in accelerate they do
and then do
I think I'll try to find a simpler way since these need the model to be "wrapped". |
Now that I think about it, even in bfloat16 the gradients should be fp32 |
Ok fixed. For this training, the models getting trained have to be in fp32 while all other models can be self.dtype. |
… into grad_scaler
… into grad_scaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deltheil PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, PTAL cc @limiteinductive
@@ -105,6 +106,9 @@ def wrapper(self: Trainer[BaseConfig, Any], config: ModelConfigT) -> fl.Module: | |||
if config.requires_grad is not None: | |||
model.requires_grad_(requires_grad=config.requires_grad) | |||
learnable_parameters = [param for param in model.parameters() if param.requires_grad] | |||
if self.config.training.automatic_mixed_precision: | |||
for learnable_parameter in learnable_parameters: | |||
learnable_parameter.to(dtype=float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is because with AMP some layers/ops behave better (range) with float32, right? Doesn't this deserve a comment?
automatic_mixed_precision: bool = ( | ||
True # Enables automatic mixed precision which allows float32 gradients while working with lower precision. | ||
) | ||
dtype: str = "float32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since AMP is on by default (opt-out), and given that this dtype
config is used jointly given the autocast(dtype=self.dtype, enabled=self.config.training.automatic_mixed_precision)
call, any reason not to pick a sane default here? Namely: either float16
or bfloat16
. Might also deserve a comment.
(accelerate even offers to "Choose from ‘no’,‘fp16’,‘bf16 or ‘fp8’.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could set it to False by default, and even though float32 + amp doesn't do much, I don't see this as a big issue; I would rather keep the API straightforward.
if self.scaler is None: | ||
backward(tensors=scaled_loss) | ||
return | ||
self.scaler.scale(scaled_loss).backward() # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: as discussed offline, a follow up PR will be needed to properly unscale the gradients before gradient clipping (which happens in between the backward step and optimizer step, as per https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients)
But also, what about gradient accumulation? https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients
I found that if we do # For all parameters we train in automatic mixed precision we want them to be in float32.
for learnable_parameter in learnable_parameters:
learnable_parameter.to(dtype=float32) while it's more clear it doesn't actually set the dtype of all the parameters to float32. So, the correct way to do it is to set the parameter dtype in the model before register_model(when requires_grad is set), and then if not amp, set grad to self.dtype I think. |
This makes it so we can do lower precision training. @piercus might be interested