-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] any clue for MFU drop?
bug
Something isn't working
training
#6727
opened Nov 8, 2024 by
SeunghyunSEO
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
bug
Something isn't working
rocm
AMD/ROCm/HIP issues
training
#6725
opened Nov 8, 2024 by
nikhil-tensorwave
CUBLAS_STATUS_NOT_SUPPORTED
bug
Something isn't working
training
#6723
opened Nov 7, 2024 by
niebowen666
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6718
opened Nov 6, 2024 by
jerrychenhf
[BUG] The problem of using Deepspeed to start training
bug
Something isn't working
training
#6715
opened Nov 5, 2024 by
sanxiaojijiaben
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs
bug
Something isn't working
training
#6713
opened Nov 5, 2024 by
molang66
[BUG] NCCL Timeout When Pre-traing "ds_train_bert_nvidia_data_bsz32k_seq512".
bug
Something isn't working
training
#6705
opened Nov 3, 2024 by
always-H
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
bug
Something isn't working
training
#6691
opened Oct 30, 2024 by
purefall
[BUG] While submodule forward process in different gpu is not same, loss.backward get stuck
bug
Something isn't working
training
#6667
opened Oct 25, 2024 by
fuzuoyi
[BUG] Deepspeed launcher not picking up virtualenv --system-site-packages
bug
Something isn't working
build
Improvements to the build and testing systems.
#6664
opened Oct 24, 2024 by
VRehnberg
[BUG] ZeRO++ sharding small parameter raise IndexError
bug
Something isn't working
training
#6659
opened Oct 23, 2024 by
wuxibin89
[BUG] Training batch size is not consistent with train_batch_size
bug
Something isn't working
training
#6657
opened Oct 23, 2024 by
tnnandi
[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
bug
Something isn't working
training
#6643
opened Oct 20, 2024 by
RickoNoNo3
[BUG] failed to find frozen {param} in named params
bug
Something isn't working
training
#6620
opened Oct 11, 2024 by
ssklzx
[BUG] Non-Deterministic Model Responses when the Input Prompt Order Changes
bug
Something isn't working
inference
#6612
opened Oct 8, 2024 by
zcakzhuu
[BUG] MOE: Loading experts parameters error when using expert parallel.
bug
Something isn't working
training
#6589
opened Sep 29, 2024 by
kakaxi-liu
[BUG] DeepSpeed Ulysses zero3 compatibility
bug
Something isn't working
training
#6582
opened Sep 27, 2024 by
Xirid
[BUG] AttributeError: 'NoneType' object has no attribute 'set_moe'
bug
Something isn't working
inference
#6572
opened Sep 25, 2024 by
zhanwenchen
[BUG] ValueError: Tensors must be contiguous when using deepspeed.initialize
bug
Something isn't working
training
#6571
opened Sep 25, 2024 by
shadow150519
[BUG] AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
bug
Something isn't working
training
#6568
opened Sep 24, 2024 by
umarbutler
[BUG] Expert gradient scaling problem with ZeRO optimizer
bug
Something isn't working
training
#6545
opened Sep 17, 2024 by
wyooyw
[BUG] Distributed Training randomly stuck in trainings loop
bug
Something isn't working
training
#6524
opened Sep 11, 2024 by
raeudigerRaeffi
[BUG] error :past_key, past_value = layer_past,how to solve this ?
bug
Something isn't working
deepspeed-chat
Related to DeepSpeed-Chat
#6522
opened Sep 11, 2024 by
lovychen
[BUG] RuntimeError: Error building extension 'inference_core_ops'
bug
Something isn't working
build
Improvements to the build and testing systems.
#6519
opened Sep 10, 2024 by
Chetan3200
Previous Next
ProTip!
no:milestone will show everything without a milestone.