Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] The performance for Hopper-v3 doesn't get converged for PPO #376

Open
5 tasks done
cx441000319 opened this issue Apr 30, 2023 · 5 comments
Open
5 tasks done
Labels
more information needed Please fill the issue template completely question Further information is requested

Comments

@cx441000319
Copy link

❓ Question

Hi there,

I ran into such an issue when I trained an agent using PPO in Hopper-v3. Here is the performance for 5 seeds by running: python3 scripts/all_plots.py -a ppo --env Hopper-v3 -f logs/downloaded

image

The commands are like: python train.py --algo ppo --env Hopper-v3 --seed 500X;

The seeds are from 5000 to 5004 and the default hyper-parameters are used. It always converges to 1K quickly, dramatically decreases to under 100, and then converges to 1K, ... .

I only encountered such an issue for Hopper-v3 (A2C suffers as well). It works well for other environments.

Is there anything I did wrong? Any help is appreciated!

Checklist

@cx441000319 cx441000319 added the question Further information is requested label Apr 30, 2023
@araffin araffin added the more information needed Please fill the issue template completely label Apr 30, 2023
@cx441000319
Copy link
Author

Sorry, what more information is needed?

@qgallouedec
Copy link
Collaborator

qgallouedec commented Apr 30, 2023

This is a fairly common result of ppo, here is one thread among many that discusses it:

https://www.reddit.com/r/reinforcementlearning/comments/bqh01v/having_trouble_with_ppo_rewards_crashing/

You can try to decrease the cliping parameter and early stopping the experiment.

@cx441000319
Copy link
Author

cx441000319 commented Apr 30, 2023

This is a fairly common result of ppo, here is one thread among many that discusses it:

https://www.reddit.com/r/reinforcementlearning/comments/bqh01v/having_trouble_with_ppo_rewards_crashing/

You can try to decrease the cliping parameter and early stopping the experiment.

Thank you for your suggestions. I will try them.

Based on the experiment results I got, there are another two spot points:

  1. PPO works well for Walker2d-v3 and HalfCheetah-v3 (by working well I mean I can generate comparable performance as the benchmark with the default hyper-parameters)
  2. A2C also fails in Hopper-v3 as below:

image

In this case, can I regard it as an issue in dealing with Hopper-v3 instead of an issue in PPO?

@araffin
Copy link
Member

araffin commented May 1, 2023

Sorry, what more information is needed?

The hyperparameters used and your system/lib information (os, gym version, mujoco version, sb3 version, ...)

@cx441000319
Copy link
Author

cx441000319 commented May 5, 2023

Sorry for my late reply.

Hyperparameters:
image
Hopper-v3:
normalize: "dict(norm_obs=True, norm_reward=False)"
n_envs: 1
policy: 'MlpPolicy'
n_timesteps: !!float 1e6
batch_size: 32
n_steps: 512
gamma: 0.999
learning_rate: 9.80828e-05
ent_coef: 0.00229519
clip_range: 0.2
n_epochs: 5
gae_lambda: 0.99
max_grad_norm: 0.7
vf_coef: 0.835671
policy_kwargs: "dict(
log_std_init=-2,
ortho_init=False,
activation_fn=nn.ReLU,
net_arch=dict(pi=[256, 256], vf=[256, 256])
)"

os: Ubuntu 20.04 LTS
gym: 0.26.2
mujoco_py: 2.1.2.14
sb3: 2.0.0a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more information needed Please fill the issue template completely question Further information is requested
3 participants