Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu Parameter Server Strategy new Optimizer #18624

Open
hmc-cs-mdrissi opened this issue Oct 16, 2023 · 4 comments
Open

Multi-gpu Parameter Server Strategy new Optimizer #18624

hmc-cs-mdrissi opened this issue Oct 16, 2023 · 4 comments
Assignees
Labels
stat:awaiting keras-eng Awaiting response from Keras engineer type:feature The user is asking for a new feature.

Comments

@hmc-cs-mdrissi
Copy link

hmc-cs-mdrissi commented Oct 16, 2023

Summary

The recent optimizer api, tf 2.11 default/tf.optimizers.experimental.Optimizer does not support multi-gpu per worker parameter server training. The old optimizer api, tf.optimizers.legacy.Optimizer does support this.

The issue is lack of aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA for iterations variable + learning rate variable. This leads to a crash on this line. The former had aggregation method explicitly specified in tf.optimizers.legacy.Optimizer while latter was not represented as a variable with legacy Optimizer.

I'm unsure how to fix this with Variable changes for supporting multiple backends. Before that the fix was small add aggregation to lines that make tf.Variable. Now options I see are either,

  1. Add if tensorflow backend in optimizer variable creation logic for iterations/lr and add aggregation
  2. Add aggregation argument to KerasVariable present for all backends although it may do nothing for non tensorflow backend. Does pytorch/jax have similar variable aggregation concept?
@fchollet
Copy link
Member

Parameter server training is not supported in Keras 3, generally speaking. The feature had very low usage. If you need it, I recommend you stick to tf.keras and use the legacy optimizers.

@hmc-cs-mdrissi
Copy link
Author

For context on usage I've seen it at couple companies for large recommender system models where embedding tables may be several GB and largest I've worked with had hundreds of gigabyte for 1 embedding table that relied on variable partitioner to split variable across PS servers. In that scenario I think most other strategies that mirror variables are difficult to use.

I mostly haven't seen PS outside of very large embedding tables. If PS is deprecated and unsupported reasonable to close ticket, although would be helpful to document somewhere (unless I missed it).

@SuryanarayanaY SuryanarayanaY added the type:feature The user is asking for a new feature. label Oct 17, 2023
@sachinprasadhs sachinprasadhs self-assigned this Oct 18, 2023
@sachinprasadhs
Copy link
Collaborator

@hmc-cs-mdrissi , Since Keras now became a Multi backend support platform, Keras code base is migrated to new Keras 3 code base.
PS is not available in Keras 3 code base, but if you want to use it, you can still use it using Keras legacy code using tf.keras

@hmc-cs-mdrissi
Copy link
Author

hmc-cs-mdrissi commented Oct 18, 2023

My main question is tf.keras.optimizers in tf 2.11 refers to new/experimental optimizers. Even if I use tf.keras explicitly those optimizers when used with parameter server strategy encounter this bug. Will improvements to make tf.keras.optimizers support PS be accepted (conditional on tf backend when needed)?

This is one ps bug but I’ve also found a couple more PS bugs when using new optimizers.

As for tf.keras.optimizers.legacy, the legacy optimizers my understanding was they don’t get additional features like weight decay. The other issue is variable saving behavior with checkpoints was improved in a key way that fixes a bug specific to them for my usage. The new optimizers are autotrackable while legacy optimizers are not leading to different checkpoint behavior.

As both legacy and experimental optimizer have different bugs. If I encounter bugs using optimizer often related to PS strategy (legacy/new) and that are tf specific which ones should be reported and where?

edit: If tf.keras.optimizers.legacy still accepts bug fixes that can work. I'm mainly unsure mostly as legacy bug is also one significant difference vs tf.keras.optimizers.experimental optimizer (slot variable checkpointing).

edit 2: Other aspect is this bug is "nicer" bug in that it crashes with relevant error message. There's one tf.keras.optimizers.experimental.Optimizer bug with PS that silently produces incorrect gradient update steps that can lead to model quality worsening and was very hard to notice. At moment that can run and appear to work, but surprise user. And if user doesn't explicitly pick class type and does,

model.compile(optimizer="adam")

it will automatically pick keras.optimizers.Adam/tf.keras.optimizers.experimental.Adam regardless of strategy being used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting keras-eng Awaiting response from Keras engineer type:feature The user is asking for a new feature.
4 participants