TensorFlow GPU - Fix `keras/layers/merging/merging_test.py` #18567

sampathweb · 2023-10-06T19:27:58Z

Fix failing test - keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted and update TODO in https://github.com/keras-team/keras/blob/master/keras/kokoro/github/ubuntu/gpu/build.sh#L39

https://source.cloud.google.com/results/invocations/9df9ee7e-5666-4644-abd2-01a10771faeb/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftensorflow%2Fpresubmit/log

keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted

Current thread 0x00007f51610f0740 (most recent call first):
  File "/tmpfs/venv/lib/python3.9/site-packages/tensorflow/python/ops/linalg/sparse/gen_sparse_csr_matrix_ops.py", line 1114 in sparse_matrix_sparse_mat_mul
  File "/tmpfs/src/github/keras/keras/backend/tensorflow/numpy.py", line 119 in sparse_sparse_matmul
  File "/tmpfs/src/github/keras/keras/backend/tensorflow/numpy.py", line 156 in matmul
  File "/tmpfs/src/github/keras/keras/ops/numpy.py", line 3431 in matmul
  File "/tmpfs/src/github/keras/keras/layers/merging/dot.py", line 171 in batch_dot
  File "/tmpfs/src/github/keras/keras/layers/merging/dot.py", line 320 in _merge_function
  File "/tmpfs/src/github/keras/keras/layers/merging/base_merge.py", line 189 in call
  File "/tmpfs/src/github/keras/keras/ops/operation.py", line 47 in __call__
  File "/tmpfs/src/github/keras/keras/utils/traceback_utils.py", line 114 in error_handler
  File "/tmpfs/src/github/keras/keras/layers/layer.py", line 810 in __call__
  File "/tmpfs/src/github/keras/keras/utils/traceback_utils.py", line 114 in error_handler
  File "/tmpfs/src/github/keras/keras/testing/test_case.py", line 380 in run_layer_test
  File "/tmpfs/src/github/keras/keras/layers/merging/merging_test.py", line 240 in test_sparse
  File "/tmpfs/venv/lib/python3.9/site-packages/absl/testing/parameterized.py", line 319 in bound_param_test
  File "/usr/lib/python3.9/unittest/case.py", line 550 in _callTestMethod
  File "/usr/lib/python3.9/unittest/case.py", line 592 in run
  File "/usr/lib/python3.9/unittest/case.py", line 651 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/unittest.py", line 333 in runtest
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 341 in from_call
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 325 in _main
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/tmpfs/venv/bin/pytest", line 8 in <module>
github/keras/keras/kokoro/github/ubuntu/gpu/build.sh: line 34:  4954 Aborted                 (core dumped)

The text was updated successfully, but these errors were encountered:

sampathweb · 2023-10-11T20:18:15Z

The culprit is between 2.15.0.dev20230918 (good) and 2.15.0.dev20230919 (bad).

Tested via: pytest keras/layers/merging/merging_test.py::MergingLayersTest::test_basic_add
Culprit in one of the changes in this range: git log e4a6720f42a..dfcf1d40e46 --oneline

qlzh727 · 2023-10-11T20:47:15Z

Thanks Ramesh for the repo, we will revisit this during the triage meeting.

On a side note, I didn't find any change on the sparse side between those two date. Will need to dig deep for the root cause.

sampathweb · 2023-10-11T21:03:41Z

Here's a small code snippet to reproduce the issue in Colab with Keras Master and TF-Nightly -

!pip uninstall -y keras tensorflow
!pip install tf-nightly[and-cuda]==2.15.0.dev20231009 --extra-index-url https://pypi.nvidia.com
!pip uninstall -y keras-nightly

# Install Keras from Master or `keras-core`
!pip install keras_core

import keras_core as keras
import numpy as np

input = keras.layers.Input(shape=(2,))
x1 = keras.layers.Dense(4, activation='relu')(input)
x2 = keras.layers.Dense(4, activation='relu')(input)
added = keras.layers.Add()([x1, x2])
out = keras.layers.Dense(1)(added)
model = keras.models.Model(inputs=input, outputs=out)

x = np.random.randn(8, 2)
y = np.random.randn(8, 1)
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, epochs=1)

Error:

---> 17 model.fit(x, y, epochs=1)

[/usr/local/lib/python3.10/dist-packages/keras_core/src/backend/tensorflow/trainer.py](https://localhost:8080/#) in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq)
    320                 for step, iterator in epoch_iterator.enumerate_epoch():
    321                     callbacks.on_train_batch_begin(step)
--> 322                     logs = self.train_function(iterator)
    323                     callbacks.on_train_batch_end(
    324                         step, self._pythonify_logs(logs)
    
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py](https://localhost:8080/#) in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51   try:
     52     ctx.ensure_initialized()
---> 53     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                         inputs, attrs, num_outputs)
     55   except core._NotOkStatusException as e:

FailedPreconditionError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/keras_core/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

DNN library initialization failed. Look at the errors above for more details.
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_one_step_on_iterator_403]

sampathweb · 2023-10-11T21:12:11Z

If I replace Add with Concatenate also it breaks. This is a high priority error since it breaks very important layer in TF GPU. Also, the same test fails for JAX GPU as well.

sampathweb · 2023-10-11T21:12:54Z

@fchollet - If you have any thoughts or suggestions to try let me know.

qlzh727 · 2023-10-11T21:34:22Z

The example you provided doesn't even use sparse inputs, which is different from the error on top. The error DNN library initialization failed somehow indicate that its a setup issue for GPU.

sampathweb · 2023-10-11T23:30:52Z

If you install the Nightly from 09/18, it works fine.

!pip install tf-nightly[and-cuda]==2.15.0.dev20230918 --extra-index-url https://pypi.nvidia.com

It has started failing from 09/19 Nightly.

!pip install tf-nightly[and-cuda]==2.15.0.dev20230919 --extra-index-url https://pypi.nvidia.com

sampathweb · 2023-10-11T23:32:32Z

The example you provided doesn't even use sparse inputs, which is different from the error on top. The error DNN library initialization failed somehow indicate that its a setup issue for GPU.

There are multiple failures in merging_test.py. I tried to run the basic test case with add and that fails. Initially I reported on the sparse test which actually aborts with core dump
keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted

sampathweb · 2023-10-11T23:34:02Z

TF Nightly 09/18 works for ALL the tests in merging_test.py. So I think its a common issue due to change in TF on 09/19 between these commits in TF: git log e4a6720f42a..dfcf1d40e46 --oneline

qlzh727 · 2023-10-12T18:57:54Z

Somehow I wasn't able to produce the on colab with T4 GPU. https://colab.sandbox.google.com/drive/1_hMJieL_6DobTPUbZ6BRZIEVz0YRHhBo#scrollTo=GM2B7qEqNYqk

Maybe I didn't config the GPU properly?

qlzh727 · 2023-10-12T18:58:10Z

@sampathweb do u have a testable env that I can run with?

fchollet · 2023-10-14T09:20:32Z

Also seems to be failing with JAX-GPU now:

github/keras/keras/kokoro/github/ubuntu/gpu/build.sh: line 57:  4493 Aborted                 (core dumped) pytest keras --ignore keras/applications --ignore keras/layers/merging/merging_test.py --ignore keras/trainers/data_adapters/py_dataset_adapter_test.py --ignore keras/backend/jax/distribution_lib_test.py --cov=keras

sampathweb · 2023-10-17T03:26:46Z

I will work on this tomorrow. I used Colab v100 as my test env

sampathweb · 2023-10-17T20:32:18Z

Seems to be a Cudnn TF compilation issue.

2023-10-17 20:23:09.628643: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2023-10-17 20:23:10.277194: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:447] Loaded runtime CuDNN library: 8.7.0 but source was compiled with: 8.9.4.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2023-10-17 20:23:10.278786: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:574 : FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.

Tested via pytest keras/layers/merging/merging_test.py::MergingLayersTest::test_basic_add
Don't have a resolution yet. But might be related to this change that's within the range - git log e4a6720f42a..dfcf1d40e46 --oneline

commit 3de44168950a5972ba4cfa7e3c6cbf4cffa67fe6
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Sep 18 13:50:11 2023 -0700

    Upgrade to LLVM 17, CUDA 12.2, and CuDNN 8.9.4
    
    This is updating TF's default toolchain to LLVM 17, as well as
    CUDA and cuDNN to the latest releases.
    
    PiperOrigin-RevId: 566403707

sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Oct 10, 2023

sampathweb self-assigned this Oct 11, 2023

qlzh727 self-assigned this Oct 12, 2023

qlzh727 removed the keras-team-review-pending Pending review by a Keras team member. label Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow GPU - Fix `keras/layers/merging/merging_test.py` #18567

TensorFlow GPU - Fix `keras/layers/merging/merging_test.py` #18567

sampathweb commented Oct 6, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 11, 2023

sampathweb commented Oct 11, 2023 •

edited

Loading

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 11, 2023

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 12, 2023

qlzh727 commented Oct 12, 2023

fchollet commented Oct 14, 2023

sampathweb commented Oct 17, 2023 •

edited

Loading

sampathweb commented Oct 17, 2023

TensorFlow GPU - Fix keras/layers/merging/merging_test.py #18567

TensorFlow GPU - Fix keras/layers/merging/merging_test.py #18567

Comments

sampathweb commented Oct 6, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 11, 2023

sampathweb commented Oct 11, 2023 • edited Loading

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 11, 2023

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

sampathweb commented Oct 11, 2023

qlzh727 commented Oct 12, 2023

qlzh727 commented Oct 12, 2023

fchollet commented Oct 14, 2023

sampathweb commented Oct 17, 2023 • edited Loading

sampathweb commented Oct 17, 2023

TensorFlow GPU - Fix `keras/layers/merging/merging_test.py` #18567

TensorFlow GPU - Fix `keras/layers/merging/merging_test.py` #18567

sampathweb commented Oct 11, 2023 •

edited

Loading

sampathweb commented Oct 17, 2023 •

edited

Loading