Re: Incompatible data schema

ramkrishnamI · 07-29-2024 04:13 AM

We are dumping multiple CSV files to the sources bucket for a datazone with the same number of columns and same name of columns in each file.

While processing the CSV , The asset is throwing error

There is data detected in this resource with incompatible data schema. Incompatible schema refers to existing fields changing to a different data type, missing non-optional fields, or 2 schemas having no overlap in fields

On further inspection it was revealed that the earlier CSV files has one column with INTEGER values(100,199 ... ) while the files that are delivered after some days has the same colum with DOUBLE values format (101.0, 199.0) causing the issue.

Is there a Way possible to explicitly convert the data types in BigQuery Table from INTEGER to DOUBLE or vice versa ?

Regards

ms4446

BigQuery is designed to be schema-aware, expecting consistent data types in each column of a table. When an incoming CSV file changes a column's data type (e.g., from INTEGER to DOUBLE), it disrupts schema consistency. Below are some

tegies to address this issue: adjusting the table schema and pre-processing CSV files.

ALTER TABLE: Anticipating future changes in data types, you can design your BigQuery table with a more flexible schema from the outset. However, BigQuery offers limited direct support for altering existing columns. You may often need to create a new table with the desired schema and migrate the data. For example, to modify a column's data type:

ALTER TABLE your_dataset.your_table
ALTER COLUMN your_column SET DATA TYPE FLOAT64;

SAFE_CAST Function: When querying data, use the SAFE_CAST function to handle data type conversions safely. This function returns NULL instead of an error if the conversion fails, thereby maintaining query integrity:

SELECT SAFE_CAST(your_column AS FLOAT64) AS your_column
FROM your_dataset.your_table;

Cloud Functions/Dataflow: If you control the CSV generation process, employing Cloud Functions or Dataflow to pre-process files before loading them into BigQuery is a robust solution. This approach ensures data type standardization across your files.

BigQuery External Tables: Creating an external table that points to your CSV files allows you to query the data without loading it directly into BigQuery. This method facilitates on-the-fly data type casting, ensuring consistency before final table creation:

CREATE EXTERNAL TABLE your_dataset.external_table_csv
OPTIONS (
  format = 'CSV',
  uris = ['gs://your_bucket/your_files/*.csv']
);

CREATE TABLE your_dataset.your_table AS
SELECT
  SAFE_CAST(your_column AS FLOAT64) AS your_column,
  -- other columns...
FROM your_dataset.external_table_csv;

Schema Design: Anticipate potential data type changes by using flexible types like FLOAT64 or STRING.

Data Validation: Perform validation checks on CSV files before loading to ensure consistent data types.

Documentation: Maintain thorough documentation of table schemas and expected data types.

Automation: Incorporate schema alteration or data preprocessing steps into your data pipeline automation.

Pre-process your CSV data using Python's Pandas library:

import pandas as pd

df = pd.read_csv('your_file.csv')
df['your_column'] = pd.to_numeric(df['your_column'], errors='coerce')  # Convert to float, handle errors gracefully
df.to_csv('your_file_modified.csv', index=False)

Ensure BigQuery correctly interprets headers in your CSV files.
Utilize BigQuery's schema auto-detection feature if unsure of the exact data types in your files.

ramkrishnamI

Is it possible to create table with partition over month and year ?

my source bucket is partition as bucket/year=xxxx/month=yy/

ms4446

Yes, it's possible to create a partitioned table in BigQuery based on year and month. However, BigQuery itself does not support multi-level partitioning directly (e.g., partitioning by both year and month as separate fields). Instead, you would typically create a partitioned table based on a single date or timestamp column, and then use SQL functions to extract the year and month for querying.

Approach 1: Partitioning by Date or Timestamp Column

If your data includes a date or timestamp column, you can create a partitioned table using that column. You can then use SQL to extract the year and month for queries.

CREATE OR REPLACE TABLE `your_dataset.your_table`
PARTITION BY DATE_TRUNC(your_date_column, MONTH)
AS
SELECT
  -- other columns
  PARSE_DATE('%Y%m%d', CONCAT(year, month, '01')) AS your_date_column
FROM
  `your_dataset.your_source_table`;

Approach 2: Partitioning Using Integer Ranges

If you don't have a date or timestamp column, but have separate year and month fields, you can create a synthetic column combining these fields and then partition by it. For example:

Creating the Partitioned Table:

CREATE OR REPLACE TABLE `your_dataset.your_partitioned_table`
PARTITION BY
  RANGE_BUCKET(
    CAST(CONCAT(year, FORMAT('%02d', month)) AS INT64), 
    GENERATE_ARRAY(201901, 203012, 1)
  )
AS
SELECT
  *,
  CAST(CONCAT(year, FORMAT('%02d', month)) AS INT64) AS year_month
FROM
  `your_dataset.your_source_table`;

Loading Data:

You can load data from your source bucket into this table by using SQL or a Dataflow job that extracts and transforms the data accordingly.

Approach 3: Using External Tables with Hive Partitioning

If your files are stored in a bucket with a structure like bucket/year=xxxx/month=yy/, you can use an external table and take advantage of Hive-style partitioning:

CREATE OR REPLACE EXTERNAL TABLE `your_dataset.your_external_table`
OPTIONS(
  format = 'CSV',
  uris = ['gs://your_bucket/year=*/month=*/your_data_files.csv'],
  hive_partitioning_mode = 'AUTO',
  hive_partitioning_options = [
    {mode = 'AUTO', sourceUriPrefix = 'gs://your_bucket/'}
  ]
);

Querying the Data

Once your table is partitioned, you can easily query it with filters on year and month:

SELECT * 
FROM `your_dataset.your_table`
WHERE EXTRACT(YEAR FROM your_date_column) = 2023
  AND EXTRACT(MONTH FROM your_date_column) = 07;

ramkrishnamI

BigQuery External Tables: Creating an external table that points to your CSV files allows you to query the data without loading it directly into BigQuery. This method facilitates on-the-fly data type casting, ensuring consistency before final table creation:

CREATE EXTERNAL TABLE your_dataset.external_table_csv
OPTIONS (
  format = 'CSV',
  uris = ['gs://your_bucket/your_files/*.csv']
);

CREATE TABLE your_dataset.your_table AS
SELECT
  SAFE_CAST(your_column AS FLOAT64) AS your_column,
  -- other columns...
FROM your_dataset.external_table_csv;

If I create a standard BigQuery table from external table which is pointing to a bucket and new files are daily added to the bucket, Will the changes got reflected in the standard bigquery table also automatically ?

ramkrishnamI

Hi why my dataplex discovering job is complaining about incompatible data schema

Even though there is no Long data type in BigQuery and every value is well within Int64 ?

ms4446

The issue with your Dataplex discovery job arises because it encounters a data type labeled as LONG, which BigQuery doesn't recognize.

To resolve this, you need to ensure that the data you're sending to BigQuery uses a data type it understands. BigQuery uses INTEGER or INT64 for whole numbers, so you'll need to convert the LONG data type to one of these before sending it to BigQuery.

There are a few ways to address this:

Change the Data Source: If possible, modify the data at its source to use the INTEGER or INT64 type.
Use a Data Transformation Tool: Tools like Apache Beam or Cloud Dataflow can be used to convert the data type before it reaches BigQuery.
Define a Custom Schema: In Dataplex, you can configure it to expect LONG data and map it to INTEGER or INT64 during the discovery process.

ramkrishnamI

Hi Thanks for the reply

How to define custom schema in dataplex portal ? If i update the existing schema

{

"name": "ADAS",

"type": "INT64",

"mode": "NULLABLE"

},

To FLOAT64 and try to save it , it is not saving the changes.

And what is the difference between an external Big Query table created via dataplex and an external table directly created from the source bucket ?

ms4446

The key difference between an external BigQuery table created via Dataplex and one created directly from a source bucket is in management and governance:

Via Dataplex: The table is part of Dataplex’s managed environment, offering enhanced governance, metadata management, and integration with tools like data quality checks and lineage tracking.
Directly from Source Bucket: The table is a standalone resource in BigQuery, simpler to set up but without the additional governance and management features provided by Dataplex.

If you’re trying to update an existing schema in Dataplex and changes aren’t saving, follow these steps:

Navigate to the Asset: In the Dataplex portal, go to the specific asset whose schema you want to edit.
Edit Schema: Modify the data type (e.g., from INT64 to FLOAT64).
Save Changes: Click "Save" or "Apply." If the changes aren’t saving:
- Ensure you have the necessary permissions.
- Verify that the schema adheres to BigQuery standards.