Storing a list of lists in a pandas dataframe to Big Query from a CSV file

I have stored a Pandas dataframe as a cvs file of which some of the columns are list of lists like [['Dog', 13], ['Cat', 14]]

What would be the best approach to store this on big query?

I have tried 

ARRAY<STRUCT<STRING, INT64>> and RECORD type. Array type keeps throwing error that Array is not a valid type while Record type says CSV cannot be loaded with a nested schema.
Solved Solved
0 3 100
1 ACCEPTED SOLUTION

When dealing with nested data structures like lists of lists in Pandas DataFrames, storing them directly in a CSV format poses challenges because CSV is inherently a flat file format that doesn't support nested or complex data types. Google BigQuery also has limitations when loading nested or repeated fields from CSV files. Here's how you can effectively store your data in BigQuery:

Convert the DataFrame to JSON:

 
import pandas as pd

# Assuming `df` is your DataFrame
df.to_json('data.json', orient='records', lines=True) 

Use orient='records' and lines=True to create a newline-delimited JSON (NDJSON) file, which BigQuery can read efficiently.

Define the BigQuery Schema:

Create a schema that reflects the structure of your data, including nested and repeated fields.

 
[
  {
    "name": "your_column_name",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      {
        "name": "field1",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "field2",
        "type": "INT64",
        "mode": "NULLABLE"
      }
    ]
  },
  // Define other columns similarly
]

Load the JSON File into BigQuery:

Use the bq command-line tool or the BigQuery web UI to load the data.

Using the bq command-line tool:

 
bq load --source_format=NEWLINE_DELIMITED_JSON \
  your_dataset.your_table \
  data.json \
  schema.json

Replace your_dataset, your_table, data.json, and schema.json with your actual dataset name, table name, data file, and schema file.

Alternative Approach: Flatten the Data Before Saving to CSV

If you must use CSV due to constraints, you can flatten the nested lists into a format that can be stored in CSV.

Option 1: Expand Nested Lists into Multiple Rows

  • Pros: Maintains all data points.
  • Cons: Increases the number of rows significantly.
 
df_exploded = df.explode('your_nested_column')
df_exploded.to_csv('data_flattened.csv', index=False) 

Option 2: Serialize Nested Lists into Strings

Convert the nested lists into JSON strings before saving.

 
import json

df['your_nested_column'] = df['your_nested_column'].apply(json.dumps)
df.to_csv('data_with_serialized_columns.csv', index=False)
  • In BigQuery, store this column as a STRING type.
  • You can parse the JSON string in SQL queries using JSON_EXTRACT() functions.

View solution in original post

3 REPLIES 3

When dealing with nested data structures like lists of lists in Pandas DataFrames, storing them directly in a CSV format poses challenges because CSV is inherently a flat file format that doesn't support nested or complex data types. Google BigQuery also has limitations when loading nested or repeated fields from CSV files. Here's how you can effectively store your data in BigQuery:

Convert the DataFrame to JSON:

 
import pandas as pd

# Assuming `df` is your DataFrame
df.to_json('data.json', orient='records', lines=True) 

Use orient='records' and lines=True to create a newline-delimited JSON (NDJSON) file, which BigQuery can read efficiently.

Define the BigQuery Schema:

Create a schema that reflects the structure of your data, including nested and repeated fields.

 
[
  {
    "name": "your_column_name",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      {
        "name": "field1",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "field2",
        "type": "INT64",
        "mode": "NULLABLE"
      }
    ]
  },
  // Define other columns similarly
]

Load the JSON File into BigQuery:

Use the bq command-line tool or the BigQuery web UI to load the data.

Using the bq command-line tool:

 
bq load --source_format=NEWLINE_DELIMITED_JSON \
  your_dataset.your_table \
  data.json \
  schema.json

Replace your_dataset, your_table, data.json, and schema.json with your actual dataset name, table name, data file, and schema file.

Alternative Approach: Flatten the Data Before Saving to CSV

If you must use CSV due to constraints, you can flatten the nested lists into a format that can be stored in CSV.

Option 1: Expand Nested Lists into Multiple Rows

  • Pros: Maintains all data points.
  • Cons: Increases the number of rows significantly.
 
df_exploded = df.explode('your_nested_column')
df_exploded.to_csv('data_flattened.csv', index=False) 

Option 2: Serialize Nested Lists into Strings

Convert the nested lists into JSON strings before saving.

 
import json

df['your_nested_column'] = df['your_nested_column'].apply(json.dumps)
df.to_csv('data_with_serialized_columns.csv', index=False)
  • In BigQuery, store this column as a STRING type.
  • You can parse the JSON string in SQL queries using JSON_EXTRACT() functions.

@nidhalhafeez In addition, feel free to explore this sample code using the pandas-gbq package in loading DataFrame to BigQuery for your future reference.

 

import pandas
import pandas_gbq

# TODO: Set project_id to your Google Cloud Platform project ID.
# project_id = "my-project"
# TODO: Set table_id to the full destination table ID (including the
#       dataset ID).
# table_id = 'my_dataset.my_table'

df = pandas.DataFrame(
    {
        "my_string": ["a", "b", "c"],
        "my_int64": [1, 2, 3],
        "my_float64": [4.0, 5.0, 6.0],
        "my_bool1": [True, False, True],
        "my_bool2": [False, True, False],
        "my_dates": pandas.date_range("now", periods=3),
    }
)

pandas_gbq.to_gbq(df, table_id, project_id=project_id)

 

I hope this helps.

 

Thank you