Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow GPU - Fix keras/layers/merging/merging_test.py #18567

Open
sampathweb opened this issue Oct 6, 2023 · 14 comments
Open

TensorFlow GPU - Fix keras/layers/merging/merging_test.py #18567

sampathweb opened this issue Oct 6, 2023 · 14 comments
Assignees
Labels

Comments

@sampathweb
Copy link
Collaborator

Fix failing test - keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted and update TODO in https://github.com/keras-team/keras/blob/master/keras/kokoro/github/ubuntu/gpu/build.sh#L39

https://source.cloud.google.com/results/invocations/9df9ee7e-5666-4644-abd2-01a10771faeb/targets/keras%2Fgithub%2Fubuntu%2Fgpu%2Ftensorflow%2Fpresubmit/log

keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted

Current thread 0x00007f51610f0740 (most recent call first):
  File "/tmpfs/venv/lib/python3.9/site-packages/tensorflow/python/ops/linalg/sparse/gen_sparse_csr_matrix_ops.py", line 1114 in sparse_matrix_sparse_mat_mul
  File "/tmpfs/src/github/keras/keras/backend/tensorflow/numpy.py", line 119 in sparse_sparse_matmul
  File "/tmpfs/src/github/keras/keras/backend/tensorflow/numpy.py", line 156 in matmul
  File "/tmpfs/src/github/keras/keras/ops/numpy.py", line 3431 in matmul
  File "/tmpfs/src/github/keras/keras/layers/merging/dot.py", line 171 in batch_dot
  File "/tmpfs/src/github/keras/keras/layers/merging/dot.py", line 320 in _merge_function
  File "/tmpfs/src/github/keras/keras/layers/merging/base_merge.py", line 189 in call
  File "/tmpfs/src/github/keras/keras/ops/operation.py", line 47 in __call__
  File "/tmpfs/src/github/keras/keras/utils/traceback_utils.py", line 114 in error_handler
  File "/tmpfs/src/github/keras/keras/layers/layer.py", line 810 in __call__
  File "/tmpfs/src/github/keras/keras/utils/traceback_utils.py", line 114 in error_handler
  File "/tmpfs/src/github/keras/keras/testing/test_case.py", line 380 in run_layer_test
  File "/tmpfs/src/github/keras/keras/layers/merging/merging_test.py", line 240 in test_sparse
  File "/tmpfs/venv/lib/python3.9/site-packages/absl/testing/parameterized.py", line 319 in bound_param_test
  File "/usr/lib/python3.9/unittest/case.py", line 550 in _callTestMethod
  File "/usr/lib/python3.9/unittest/case.py", line 592 in run
  File "/usr/lib/python3.9/unittest/case.py", line 651 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/unittest.py", line 333 in runtest
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 341 in from_call
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 325 in _main
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77 in _multicall
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115 in _hookexec
  File "/tmpfs/venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493 in __call__
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/tmpfs/venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/tmpfs/venv/bin/pytest", line 8 in <module>
github/keras/keras/kokoro/github/ubuntu/gpu/build.sh: line 34:  4954 Aborted                 (core dumped)
@sachinprasadhs sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Oct 10, 2023
@sampathweb
Copy link
Collaborator Author

The culprit is between 2.15.0.dev20230918 (good) and 2.15.0.dev20230919 (bad).

Tested via: pytest keras/layers/merging/merging_test.py::MergingLayersTest::test_basic_add
Culprit in one of the changes in this range: git log e4a6720f42a..dfcf1d40e46 --oneline

@qlzh727
Copy link
Member

qlzh727 commented Oct 11, 2023

Thanks Ramesh for the repo, we will revisit this during the triage meeting.

On a side note, I didn't find any change on the sparse side between those two date. Will need to dig deep for the root cause.

@sampathweb
Copy link
Collaborator Author

sampathweb commented Oct 11, 2023

Here's a small code snippet to reproduce the issue in Colab with Keras Master and TF-Nightly -

!pip uninstall -y keras tensorflow
!pip install tf-nightly[and-cuda]==2.15.0.dev20231009 --extra-index-url https://pypi.nvidia.com
!pip uninstall -y keras-nightly

# Install Keras from Master or `keras-core`
!pip install keras_core

import keras_core as keras
import numpy as np

input = keras.layers.Input(shape=(2,))
x1 = keras.layers.Dense(4, activation='relu')(input)
x2 = keras.layers.Dense(4, activation='relu')(input)
added = keras.layers.Add()([x1, x2])
out = keras.layers.Dense(1)(added)
model = keras.models.Model(inputs=input, outputs=out)

x = np.random.randn(8, 2)
y = np.random.randn(8, 1)
model.compile(optimizer='sgd', loss='mse')
model.fit(x, y, epochs=1)

Error:

---> 17 model.fit(x, y, epochs=1)

[/usr/local/lib/python3.10/dist-packages/keras_core/src/backend/tensorflow/trainer.py](https://localhost:8080/#) in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq)
    320                 for step, iterator in epoch_iterator.enumerate_epoch():
    321                     callbacks.on_train_batch_begin(step)
--> 322                     logs = self.train_function(iterator)
    323                     callbacks.on_train_batch_end(
    324                         step, self._pythonify_logs(logs)
    
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py](https://localhost:8080/#) in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51   try:
     52     ctx.ensure_initialized()
---> 53     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                         inputs, attrs, num_outputs)
     55   except core._NotOkStatusException as e:

FailedPreconditionError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/keras_core/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

DNN library initialization failed. Look at the errors above for more details.
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_one_step_on_iterator_403]
@sampathweb
Copy link
Collaborator Author

If I replace Add with Concatenate also it breaks. This is a high priority error since it breaks very important layer in TF GPU. Also, the same test fails for JAX GPU as well.

@sampathweb
Copy link
Collaborator Author

@fchollet - If you have any thoughts or suggestions to try let me know.

@sampathweb sampathweb self-assigned this Oct 11, 2023
@qlzh727
Copy link
Member

qlzh727 commented Oct 11, 2023

The example you provided doesn't even use sparse inputs, which is different from the error on top. The error DNN library initialization failed somehow indicate that its a setup issue for GPU.

@sampathweb
Copy link
Collaborator Author

If you install the Nightly from 09/18, it works fine.

!pip install tf-nightly[and-cuda]==2.15.0.dev20230918 --extra-index-url https://pypi.nvidia.com

It has started failing from 09/19 Nightly.

!pip install tf-nightly[and-cuda]==2.15.0.dev20230919 --extra-index-url https://pypi.nvidia.com
@sampathweb
Copy link
Collaborator Author

The example you provided doesn't even use sparse inputs, which is different from the error on top. The error DNN library initialization failed somehow indicate that its a setup issue for GPU.

There are multiple failures in merging_test.py. I tried to run the basic test case with add and that fails. Initially I reported on the sparse test which actually aborts with core dump
keras/layers/merging/merging_test.py::MergingLayersTest::test_sparse_dot_2d Fatal Python error: Aborted

@sampathweb
Copy link
Collaborator Author

TF Nightly 09/18 works for ALL the tests in merging_test.py. So I think its a common issue due to change in TF on 09/19 between these commits in TF: git log e4a6720f42a..dfcf1d40e46 --oneline

@qlzh727 qlzh727 self-assigned this Oct 12, 2023
@qlzh727 qlzh727 removed the keras-team-review-pending Pending review by a Keras team member. label Oct 12, 2023
@qlzh727
Copy link
Member

qlzh727 commented Oct 12, 2023

Somehow I wasn't able to produce the on colab with T4 GPU. https://colab.sandbox.google.com/drive/1_hMJieL_6DobTPUbZ6BRZIEVz0YRHhBo#scrollTo=GM2B7qEqNYqk

Maybe I didn't config the GPU properly?

@qlzh727
Copy link
Member

qlzh727 commented Oct 12, 2023

@sampathweb do u have a testable env that I can run with?

@fchollet
Copy link
Member

Also seems to be failing with JAX-GPU now:

github/keras/keras/kokoro/github/ubuntu/gpu/build.sh: line 57:  4493 Aborted                 (core dumped) pytest keras --ignore keras/applications --ignore keras/layers/merging/merging_test.py --ignore keras/trainers/data_adapters/py_dataset_adapter_test.py --ignore keras/backend/jax/distribution_lib_test.py --cov=keras
@sampathweb
Copy link
Collaborator Author

sampathweb commented Oct 17, 2023

I will work on this tomorrow. I used Colab v100 as my test env

@sampathweb
Copy link
Collaborator Author

Seems to be a Cudnn TF compilation issue.

2023-10-17 20:23:09.628643: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2023-10-17 20:23:10.277194: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:447] Loaded runtime CuDNN library: 8.7.0 but source was compiled with: 8.9.4.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2023-10-17 20:23:10.278786: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:574 : FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.

Tested via pytest keras/layers/merging/merging_test.py::MergingLayersTest::test_basic_add
Don't have a resolution yet. But might be related to this change that's within the range - git log e4a6720f42a..dfcf1d40e46 --oneline

commit 3de44168950a5972ba4cfa7e3c6cbf4cffa67fe6
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Sep 18 13:50:11 2023 -0700

    Upgrade to LLVM 17, CUDA 12.2, and CuDNN 8.9.4
    
    This is updating TF's default toolchain to LLVM 17, as well as
    CUDA and cuDNN to the latest releases.
    
    PiperOrigin-RevId: 566403707
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants