Multi-gpu Parameter Server Strategy new Optimizer #18624

hmc-cs-mdrissi · 2023-10-16T17:47:04Z

Summary

The recent optimizer api, tf 2.11 default/tf.optimizers.experimental.Optimizer does not support multi-gpu per worker parameter server training. The old optimizer api, tf.optimizers.legacy.Optimizer does support this.

The issue is lack of aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA for iterations variable + learning rate variable. This leads to a crash on this line. The former had aggregation method explicitly specified in tf.optimizers.legacy.Optimizer while latter was not represented as a variable with legacy Optimizer.

I'm unsure how to fix this with Variable changes for supporting multiple backends. Before that the fix was small add aggregation to lines that make tf.Variable. Now options I see are either,

Add if tensorflow backend in optimizer variable creation logic for iterations/lr and add aggregation
Add aggregation argument to KerasVariable present for all backends although it may do nothing for non tensorflow backend. Does pytorch/jax have similar variable aggregation concept?

The text was updated successfully, but these errors were encountered:

fchollet · 2023-10-16T19:33:53Z

Parameter server training is not supported in Keras 3, generally speaking. The feature had very low usage. If you need it, I recommend you stick to tf.keras and use the legacy optimizers.

hmc-cs-mdrissi · 2023-10-16T19:56:30Z

For context on usage I've seen it at couple companies for large recommender system models where embedding tables may be several GB and largest I've worked with had hundreds of gigabyte for 1 embedding table that relied on variable partitioner to split variable across PS servers. In that scenario I think most other strategies that mirror variables are difficult to use.

I mostly haven't seen PS outside of very large embedding tables. If PS is deprecated and unsupported reasonable to close ticket, although would be helpful to document somewhere (unless I missed it).

sachinprasadhs · 2023-10-18T18:40:00Z

@hmc-cs-mdrissi , Since Keras now became a Multi backend support platform, Keras code base is migrated to new Keras 3 code base.
PS is not available in Keras 3 code base, but if you want to use it, you can still use it using Keras legacy code using tf.keras

hmc-cs-mdrissi · 2023-10-18T21:14:52Z

My main question is tf.keras.optimizers in tf 2.11 refers to new/experimental optimizers. Even if I use tf.keras explicitly those optimizers when used with parameter server strategy encounter this bug. Will improvements to make tf.keras.optimizers support PS be accepted (conditional on tf backend when needed)?

This is one ps bug but I’ve also found a couple more PS bugs when using new optimizers.

As for tf.keras.optimizers.legacy, the legacy optimizers my understanding was they don’t get additional features like weight decay. The other issue is variable saving behavior with checkpoints was improved in a key way that fixes a bug specific to them for my usage. The new optimizers are autotrackable while legacy optimizers are not leading to different checkpoint behavior.

As both legacy and experimental optimizer have different bugs. If I encounter bugs using optimizer often related to PS strategy (legacy/new) and that are tf specific which ones should be reported and where?

edit: If tf.keras.optimizers.legacy still accepts bug fixes that can work. I'm mainly unsure mostly as legacy bug is also one significant difference vs tf.keras.optimizers.experimental optimizer (slot variable checkpointing).

edit 2: Other aspect is this bug is "nicer" bug in that it crashes with relevant error message. There's one tf.keras.optimizers.experimental.Optimizer bug with PS that silently produces incorrect gradient update steps that can lead to model quality worsening and was very hard to notice. At moment that can run and appear to work, but surprise user. And if user doesn't explicitly pick class type and does,

model.compile(optimizer="adam")

it will automatically pick keras.optimizers.Adam/tf.keras.optimizers.experimental.Adam regardless of strategy being used.

SuryanarayanaY added the type:feature The user is asking for a new feature. label Oct 17, 2023

sachinprasadhs self-assigned this Oct 18, 2023

sachinprasadhs added the stat:awaiting response from contributor label Oct 18, 2023

google-ml-butler bot removed the stat:awaiting response from contributor label Oct 18, 2023

sachinprasadhs added the keras-team-review-pending Pending review by a Keras team member. label Oct 18, 2023

sachinprasadhs assigned fchollet Oct 19, 2023

sachinprasadhs removed the keras-team-review-pending Pending review by a Keras team member. label Oct 19, 2023

sachinprasadhs mentioned this issue Oct 25, 2023

Optimizer Race in Variable Update under PS strategy #18668

Open

sachinprasadhs added the stat:awaiting keras-eng Awaiting response from Keras engineer label Oct 25, 2023

hmc-cs-mdrissi mentioned this issue Aug 16, 2024

Sharding Callback + Parameter Server Strategy partitioner crash when combined tensorflow/tensorflow#73836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu Parameter Server Strategy new Optimizer #18624

Multi-gpu Parameter Server Strategy new Optimizer #18624

hmc-cs-mdrissi commented Oct 16, 2023 •

edited

Loading

fchollet commented Oct 16, 2023

hmc-cs-mdrissi commented Oct 16, 2023

sachinprasadhs commented Oct 18, 2023

hmc-cs-mdrissi commented Oct 18, 2023 •

edited

Loading

Multi-gpu Parameter Server Strategy new Optimizer #18624

Multi-gpu Parameter Server Strategy new Optimizer #18624

Comments

hmc-cs-mdrissi commented Oct 16, 2023 • edited Loading

Summary

fchollet commented Oct 16, 2023

hmc-cs-mdrissi commented Oct 16, 2023

sachinprasadhs commented Oct 18, 2023

hmc-cs-mdrissi commented Oct 18, 2023 • edited Loading

hmc-cs-mdrissi commented Oct 16, 2023 •

edited

Loading

hmc-cs-mdrissi commented Oct 18, 2023 •

edited

Loading