Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

期望libai mae支持graph格式数据并行,流水线并行和模型并行 #259

Open
KellyZhang2020 opened this issue Apr 12, 2022 · 5 comments

Comments

@KellyZhang2020
Copy link
Contributor

No description provided.

@rentainhe
Copy link
Contributor

好的,这个我们会陆续推进

@rentainhe
Copy link
Contributor

rentainhe commented Apr 15, 2022

MAE-pytorch迁移至MAE-oneflow的接口缺失整理 (与算子兼容计划同步)

  • torch.cuda.synchronize
  • torch.cuda.max_memory_alocated
  • torch.nn.parallel.DistributedDataParallel() 入参没对齐
  • 缺少tensor.median()方法
  • oneflow.nn.utils.clip_grad_norm_不支持��入None
@BBuf
Copy link
Contributor

BBuf commented Apr 18, 2022

可以稍微写详细点吗?比如贴一个没对齐或者报错的示例。

@rentainhe

@rentainhe
Copy link
Contributor

可以稍微写详细点吗?比如贴一个没对齐或者报错的示例。

@rentainhe

好的,我这边跟用户一起整理一下

@rentainhe
Copy link
Contributor

最小复现example

  • tensor.median()
import torch
x = torch.randn(1, 2, 4)
print(x.median())

import oneflow as flow
y = flow.randn(1, 2, 4)
print(y.median())
  • torch.cuda.synchronize
  • torch.cuda.max_memory_alocated

这两个应该是没有对应接口

  • torch.nn.parallel.DistributedDataParallel()入参没对齐
import torch
torch.nn.parallel.DistributedDataParallel()
"""
Args:
    module,
    device_ids=None,
    output_device=None,
    dim=0,
    broadcast_buffers=True,
    process_group=None,
    bucket_cap_mb=25,
    find_unused_parameters=False,
    check_reduction=False,
    gradient_as_bucket_view=False,
    static_graph=False,
"""

import oneflow.nn.parallel as parallel
parallel.DistributedDataParallel()
"""
Args:
    module: "flow.nn.Module"
    broadcast_buffers: bool = True, 
    bucket_size: int = 10
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants