Dataproc serverless batch job failing with below error

from last 7 days Dataproc serverless batch job failing with below error

Configuring index-url as 'https://us-python.pkg.dev/artifact-registry-python-cache/virtual-python/simple/'
SPARK_EXTRA_CLASSPATH=
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
24/09/15 02:20:27 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-google-hadoop-file-system.properties,hadoop-metrics2.properties
24/09/15 02:20:27 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
24/09/15 02:20:27 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
Traceback (most recent call last):
File "/var/dataproc/tmp/srvls-batch-b6ab0841-0cae-4496-b1fa-b3d434b0c735/python_bq_orc.py", line 49, in <module>
df = spark.read.orc(src_orc_gcs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 828, in orc

reading ORC file from GCS Bucket,its working before 7 days,please help

0 4 125
4 REPLIES 4

Here are a few things you can check to troubleshoot the issue:

First, verify that the service account used by your Dataproc job still has the "Storage Object Viewer" role for the GCS bucket. You can check this in the IAM section of the Google Cloud console. Additionally, review the bucket's access control policies to ensure there have been no changes restricting access.

Test the ORC file’s integrity by attempting to read it with gsutil cat gs://your-bucket/your-file.orc. If the file is corrupted or inaccessible, this command will fail. Also, check for any schema changes in the ORC files, as modifications such as adding or removing columns can cause compatibility issues with your Spark job.

Check the Dataproc Serverless release notes for any updates or known issues that may have been introduced recently. Changes in the platform or runtime environment can sometimes lead to unexpected failures. Additionally, ensure that the SPARK_EXTRA_CLASSPATH or other Spark configurations haven't been modified.

While the warning regarding the missing metrics configuration is likely unrelated, it's worth verifying that no other configuration changes have been made. Enabling debug logging (spark.logConf=true) can also provide more detailed output, which may reveal the underlying issue.

To further isolate the problem, review the Spark driver and executor logs in the Dataproc UI or Cloud Logging. A simplified code block that just reads the ORC file can help determine whether the issue lies within the file-reading process or a broader pipeline issue. Additionally, test GCS accessibility by copying the file locally using gsutil cp.

 

below is last successfully ran job log 

Srinu6412_0-1726512695693.png

Failed job log 

Srinu6412_1-1726512853503.png

 

 

The comparison between the successful and failed Dataproc Serverless job logs reveals discrepancies that likely contribute to the job failure. The most notable difference is the Java version path. The successful job uses JAVA_HOME=/usr/lib/jvm/temurin-17-jdk-amd64, while the failed job incorrectly references JAVA_HOME=/usr/11b/jvm/temurin-17-jdk-amd64. This incorrect path may lead to runtime errors, impacting the job's ability to execute correctly.

Additionally, the failed job includes extra steps involving PIP configuration (/home/spark/.pip/pip.conf and index-url configuration for Artifact Registry), which are absent in the successful job. These changes suggest modifications in the environment that could be interfering with dependency management or the job’s ability to read ORC files.

The failed job also shows a warning about missing metrics configuration files (hadoop-metrics2-google-hadoop-file-system.properties), which, while not directly causing the failure, indicates broader configuration issues that could affect the job’s stability.

To address these issues, verify that the correct Java path (/usr/lib/jvm/temurin-17-jdk-amd64) is being used. Review recent changes to the environment, particularly any unintended PIP configuration modifications, and revert them to match the setup of the last successful job. Resolving the metrics configuration warning can further ensure a stable environment. Testing a minimal job under the failed setup conditions may also help isolate whether the altered Java path or new configurations are the primary cause of the error. By aligning the environment closely with the last successful setup, you are likely to resolve the recurring job failure.

Hi, I got the same error except the my java path is same between the successful job and failed job. I was able to run the exact same job two weeks ago and there's no change applied to my environment between the successful run and failed run. Were you able to run the job? 

Thank you!