-
Notifications
You must be signed in to change notification settings - Fork 4.1k
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Distributed Training randomly stuck in trainings loop #6524
Comments
Hi @raeudigerRaeffi, |
Hi @tohtana thanks for your reply. Sadly this did not fix my issue I am still running into the training randomly getting stuck. The GPU usage is also odd this time with all cards except one being at 100%. Interestingly though I just looked through the stack and noticed that this time all active threads are stuck in the same function (partition_grads) which was previously never the case. To maybe add further to this we are able to run our code successfully on A100 gpus and the issue only occurs when we switch to P5 machines on was that run on H100.
|
Hi I have a script that runs with the DataParralell trainer on a machine with 8 H100 GPUs (aws p5 VM) with deepspeed. When we run the script it starts to randomly get stuck forever at some iteration relatively late in the process (between 2000 - 4000th iteration).
We start the script with the following command:
accelerate launch src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
The gpus are only at 30% memory occupied and util is at 0%.
The stack trace of the relevant processes looks the following:
pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {}
Process 39: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)
Thread 39 (idle): "MainThread"
backward (torch/autograd/init.py:266)
backward (torch/_tensor.py:522)
backward (deepspeed/runtime/fp16/loss_scaler.py:63)
backward (deepspeed/runtime/zero/stage3.py:2213)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (deepspeed/runtime/engine.py:1976)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:2126)
training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:410)
training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:540)
main (src/model_back/healing/scripts/fine_tune_accelerate.py:583)
(src/model_back/healing/scripts/fine_tune_accelerate.py:587)
Thread 930 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 4067 (active)
all_gather_into_tensor (torch/distributed/distributed_c10d.py:2709)
wrapper (torch/distributed/c10d_logger.py:72)
all_gather_into_tensor (deepspeed/comm/torch.py:219)
_fn (torch/_dynamo/eval_frame.py:489)
all_gather_into_tensor (deepspeed/comm/comm.py:305)
log_wrapper (deepspeed/comm/comm.py:117)
allgather_fn (deepspeed/comm/comm.py:320)
wrapped_fn (deepspeed/utils/nvtx.py:15)
_dist_allgather_fn (deepspeed/runtime/zero/partition_parameters.py:93)
all_gather_coalesced (deepspeed/runtime/zero/partition_parameters.py:1217)
wrapped_fn (deepspeed/utils/nvtx.py:15)
_all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:463)
__all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:434)
wrapped_fn (deepspeed/utils/nvtx.py:15)
fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:385)
decorate_context (torch/utils/_contextlib.py:115)
wrapped_fn (deepspeed/utils/nvtx.py:15)
_fn (torch/_dynamo/eval_frame.py:489)
pre_sub_module_backward_function (deepspeed/runtime/zero/parameter_offload.py:474)
decorate_context (torch/utils/_contextlib.py:115)
_run_before_backward_function (deepspeed/runtime/zero/parameter_offload.py:339)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (deepspeed/runtime/zero/parameter_offload.py:358)
apply (torch/autograd/function.py:289)
backward (torch/autograd/init.py:266)
backward (torch/utils/checkpoint.py:320)
apply (torch/autograd/function.py:289)
Thread 4069 (idle)
Thread 4070 (idle)
Thread 4071 (idle)
Thread 4072 (idle)
Thread 4073 (idle)
Thread 4074 (idle)
Thread 4075 (idle)
Process 40: /usr/bin/python3.10 -u src/model_back/healing/scripts/fine_tune_accelerate.py --config_file src/model_back/healing/configs/mixtral_8x7b/config.yaml
Python v3.10.12 (/usr/bin/python3.10)
Thread 40 (idle): "MainThread"
backward (torch/autograd/init.py:266)
backward (torch/_tensor.py:522)
backward (deepspeed/runtime/fp16/loss_scaler.py:63)
backward (deepspeed/runtime/zero/stage3.py:2213)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (deepspeed/runtime/engine.py:1976)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:2126)
training_loop (src/model_back/healing/scripts/fine_tune_accelerate.py:410)
training_function (src/model_back/healing/scripts/fine_tune_accelerate.py:540)
main (src/model_back/healing/scripts/fine_tune_accelerate.py:583)
(src/model_back/healing/scripts/fine_tune_accelerate.py:587)
Thread 924 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 4040 (idle)
Thread 4044 (active)
all_gather_into_tensor (torch/distributed/distributed_c10d.py:2709)
wrapper (torch/distributed/c10d_logger.py:72)
all_gather_into_tensor (deepspeed/comm/torch.py:219)
_fn (torch/_dynamo/eval_frame.py:489)
all_gather_into_tensor (deepspeed/comm/comm.py:305)
The text was updated successfully, but these errors were encountered: