Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT2预训练模型,相同配置下libai的显存占用率会显著高于megatron-lm #349

Closed
Sakura-gh opened this issue Aug 9, 2022 · 11 comments

Comments

@Sakura-gh
Copy link

Sakura-gh commented Aug 9, 2022

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit,实验环境:8× Tesla P100-PCIE-16GB

我目前正在比较onelfow-libai和megatron-lm在预训练GPT2时的性能,实验过程中发现,在相同配置下:oneflow-libai相较于megatron-lm占用的gpu显存会明显更高,不知道是我的配置问题还是框架本身的问题?

我在相同的GPT2模型网络结构配置下(详见后面贴的配置文件),将data_parallel设置为8,tensor_parallel和pipeline_parallel均设置为1,服务器环境是8块16G的Tesla-P100:

  • 在Megatron-LM下可以采用micro_batch_size=8,global_batch_size=64的单次迭代数据大小,GPU显存占用率在70%左右
  • 但在oneflow-libai下设置micro_batch_size=4,global_batch_size=32时,GPU就已经达到95%的显存占用率了;如果设置为和Megatron-LM相同的batch size,就会出现out of memory问题

我的模型、训练、并行等参数的配置文件修改:libai/configs/gpt2_pretrain.py:

from libai.config import LazyCall
from libai.evaluation import PPLEvaluator
from .common.models.gpt import pretrain_model as model
from .common.train import train
from .common.optim import optim
from .common.data.gpt_dataset import dataloader, tokenization

from .common.models.graph import graph

merge_files = "/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt"
vocab_file = "/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json"
data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

tokenization.tokenizer.vocab_file = vocab_file
tokenization.tokenizer.merges_file = merge_files
dataloader.train.dataset[0].data_prefix = data_prefix
dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix

# GPT-2 model config
model.cfg.num_layers = 24
model.cfg.vocab_size = 50257
model.cfg.hidden_size = 1024
model.cfg.ffn_hidden_size = 4 * model.cfg.hidden_size 
model.cfg.num_attention_heads = 16
model.cfg.max_seq_length = 1024
model.cfg.embedding_dropout_prob = 0.1
model.cfg.attention_dropout_prob = 0.1

optim.lr = 1.5e-4

train.train_iter=100
train.warmup_ratio=0.01
train.zero_optimization.enabled=True
train.zero_optimization.stage=2
train.checkpointer.period=1000
train.test_micro_batch_size=8
train.evaluation.eval_period=100
train.evaluation.eval_iter=10
train.evaluation.evaluator = LazyCall(PPLEvaluator)()
train.log_period=1
train.amp.enabled = True

# train.input_placement_device = "cpu"
train.input_placement_device = "cuda"
train.rdma_enabled = False

train.dist.data_parallel_size=8
train.dist.tensor_parallel_size=1
train.dist.pipeline_parallel_size=1
train.dist.pipeline_num_layers = model.cfg.num_layers # 只有当pipeline_parallel_size>1时有效

train.train_micro_batch_size=4
train.num_accumulation_steps=1
train.global_batch_size=train.dist.data_parallel_size*train.train_micro_batch_size*train.num_accumulation_steps

for ds in dataloader.train.dataset:
    ds.max_seq_length = model.cfg.max_seq_length

train.output_dir = f"./output/oneflow_libai_perf_gpt2_pretrain"

对应的libai/tools/train.sh

#!/usr/bin/env bash

FILE=$1
CONFIG=$2
GPUS=8
NODE=1
NODE_RANK=0
ADDR=127.0.0.1
PORT=60075

export ONEFLOW_FUSE_OPTIMIZER_UPDATE_CAST=true

python3 -m oneflow.distributed.launch \
--nproc_per_node $GPUS --nnodes $NODE --node_rank $NODE_RANK --master_addr $ADDR --master_port $PORT \
$FILE --config-file $CONFIG ${@:4}

执行的命令:

bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py
@xyn1201
Copy link

xyn1201 commented Aug 10, 2022

您好,我用您的配置做了上面的实验

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit

您用到的oneflow和libai的版本可能不匹配,可以具体说明一下用的哪个libai commit吗?
如果用最新的libai,建议您搭配使用oneflow的nightly版本,比如python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu102

我的复现环境:

您可以用python3 -m oneflow --doctor这个指令查看当前的oneflow版本

用您的配置在8卡 16G V100上 复现的结果:

  • LiBai
    • mb2_gb16时,10143 MiB
    • mb4_gb32时,OOM
  • Megatron-LM
    • mb2_gb16时,13432 MiB
    • mb4_gb32时,OOM

您也可以用 我们的测试脚本 复现发版的实验结果

我用您的配置和您提到的commit版本没有复现出对应的结果,您可以跑下面的脚本,看能不能得出和我类似的结果
https://github.com/Oneflow-Inc/OneAutoTest/tree/dev_display/libai 下面的

  • args_libai_gpt2.sh 拷贝到libai仓库的tools路径下
  • gpt2_nl24_nah16_hs1024.py 拷贝到libai仓库的configs路径下
  • 运行bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 2 16
  • 这组配置将input_placement_device从'cuda'更改为了'cpu',对显存有进一步优化,复现结果如下:
    • mb2_gb16时,9942 MiB
    • mb4_gb32时,OOM
  • 复现megatron,将https://github.com/Oneflow-Inc/OneAutoTest/blob/dev_display/libai/megatron/megatron_args_pretrain_gpt2.sh 拷贝到megatron仓库的examples路径下,运行bash examples/megatron_args_pretrain_gpt2.sh 1 8 0 127.0.0.1 1 1 true false 2 16,结果是
    • mb2_gb16时,13432 MiB
    • mb4_gb32时,OOM

@Sakura-gh

@Sakura-gh
Copy link
Author

Sakura-gh commented Aug 10, 2022

你好,补充一些细节信息:

我的oneflow版本:version: 0.8.0+cu102
git_commit: a6d4cb80
cmake_build_type: Release
rdma: True
mlir: True
我的libai commit:commit 622cff9

此外,我根据你提供的脚本复��了实验,结果比较接近:

  1. libai:
  • mbs=2, gbs=16,脚本bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 2 16,对应的复现结果:mb2_gb16: 8913MiB, gpu_rate=54%, total_throughput: 16.55 samples/s
  • mbs=4, gbs=32,脚本bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 4 32,对应的复现结果:mb4_gb32: 15375MiB, gpu_rate=94%, throughput: 18.95 samples/s
  • 这里可能因为机器原因,我在mbs=4, gbs=32状态下并未出现OOM,不过也比较接近,误差不大
  1. megatron-lm:

从上述实验中可以看出,我在机器上能够大致复现出你描述的情况,因此判断:问题可能出现在我的megatron-lm训练脚本上。但我仔细对比了我的脚本和你提供的脚本,两者在网络参数、并行数等设置基本相同,但跑出来的结果却大有差异,这里贴上根据我的脚本实现的结果:

为了便于对比,我把两份脚本跑出来的log文件也贴上去了。

附:我的megetron-lm训练gpt2的脚本pretrain_gpt_dp_mp_pp.sh

#! /bin/bash

# Runs the "345M" parameter model
GPUS_PER_NODE=${1:-8}
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=60075
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

DATA_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document
VOCAB_FILE_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json
MERGE_FILE_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

M_P=${2:-1}
P_P=${3:-1}
D_P=$(($WORLD_SIZE/$M_P/$P_P))  

MICRO_BATCH_SIZE=${4:-8}
GLOABAL_BATCH_SIZE=${5:-64}

TRAIN_ITERS=${6:-100}

CHECKPOINT_PATH=checkpoints/gpt2_gpus${GPUS_PER_NODE}_dp${D_P}_mp${M_P}_pp${P_P}_mbs${MICRO_BATCH_SIZE}_gbs${GLOABAL_BATCH_SIZE}_iters${TRAIN_ITERS}
LOGFILE=./log/megatron_lm_perf_gpt_pretrain_gpus${GPUS_PER_NODE}_dp${D_P}_mp${M_P}_pp${P_P}_mbs${MICRO_BATCH_SIZE}_gbs${GLOABAL_BATCH_SIZE}_iters${TRAIN_ITERS}.log

python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       --tensor-model-parallel-size $M_P \
       --pipeline-model-parallel-size $P_P \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --micro-batch-size $MICRO_BATCH_SIZE \
       --global-batch-size $GLOABAL_BATCH_SIZE \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --train-iters $TRAIN_ITERS \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file $VOCAB_FILE_PATH \
       --merge-file $MERGE_FILE_PATH \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 1 \
       --save-interval 1000 \
       --eval-interval 100 \
       --eval-iters 10 \
       --fp16 2>&1 | tee ${LOGFILE}

echo "Writting log to ${LOGFILE}"     

执行的命令(mbs=8, gbs=64,gpu_rate=53%):

bash examples/pretrain_gpt_dp_mp_pp.sh 8 1 1 8 64 100

我们讨论下megatron-lm这两份脚本跑出来有差异的原因是啥?

@xyn1201

@yuanms2
Copy link

yuanms2 commented Aug 10, 2022

也就是即使是Megatron-LM, Sakura-gh 的配置可以跑很大的batch size,但我们的配置只能跑一半的batch size

@xyn1201
Copy link

xyn1201 commented Aug 10, 2022

我check了这两份log最前面输出的参数,区别在于checkpoint_activations,您的是True,我的是False
之前您和我们在libai上做的测试都是checkpoint_activations=false的
在libai当中是通过 train.activation_checkpoint.enabled=true 来设置打开的
您可以再复现一下
@Sakura-gh

@yuanms2
Copy link

yuanms2 commented Aug 10, 2022

哦,怪不得,使用checkpointing 可以用时间换显存

@Sakura-gh
Copy link
Author

Sakura-gh commented Aug 10, 2022

我check了这两份log最前面输出的参数,区别在于checkpoint_activations,您的是True,我的是False 之前您和我们在libai上做的测试都是checkpoint_activations=false的 在libai当中是通过 train.activation_checkpoint.enabled=true 来设置打开的 您可以再复现一下 @Sakura-gh

感谢感谢~在libai上使用checkpointing之后和megatron-lm的性能相当了,并且显存占用上还占有优势。复现结果如下:

  • libai: mbs=8, gbs=64, throughout=15.82 samples/s, gpu_rate=34%
  • megatron-lm: mbs=8, gbs=64, throughout=15.93 samples/s, gpu_rate=68%
@yuanms2
Copy link

yuanms2 commented Aug 10, 2022

我看打开checkpointing之后,libai 比 megatron低一点, @xyn1201 和之前的测试结果吻合吗

@xyn1201
Copy link

xyn1201 commented Aug 11, 2022

我看打开checkpointing之后,libai 比 megatron低一点, @xyn1201 和之前的测试结果吻合吗

@PussyCat0700
Copy link

data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

请问这里的data_prefix有下载或者预处理的guidance吗?搜了一圈没找到,谢谢!

@PussyCat0700
Copy link

data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

请问这里的data_prefix有下载或者预处理的guidance吗?搜了一圈没找到,谢谢!

另外这是哪个数据集啊?我没看出来

@PussyCat0700
Copy link

data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

请问这里的data_prefix有下载或者预处理的guidance吗?搜了一圈没找到,谢谢!

另外这是哪个数据集啊?我没看出来

似乎是这个

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants