Solved: Storing a list of lists in a pandas dataframe to B...

nidhalhafeez · 09-11-2024 08:46 PM

I have stored a Pandas dataframe as a cvs file of which some of the columns are list of lists like [['Dog', 13], ['Cat', 14]]

What would be the best approach to store this on big query?

I have tried

ARRAY<STRUCT<STRING, INT64>> and RECORD type. Array type keeps throwing error that Array is not a valid type while Record type says CSV cannot be loaded with a nested schema.

ms4446

When dealing with nested data structures like lists of lists in Pandas DataFrames, storing them directly in a CSV format poses challenges because CSV is inherently a flat file format that doesn't support nested or complex data types. Google BigQuery also has limitations when loading nested or repeated fields from CSV files. Here's how you can effectively store your data in BigQuery:

Convert the DataFrame to JSON:

 
import pandas as pd

# Assuming `df` is your DataFrame
df.to_json('data.json', orient='records', lines=True)

Use orient='records' and lines=True to create a newline-delimited JSON (NDJSON) file, which BigQuery can read efficiently.

Define the BigQuery Schema:

Create a schema that reflects the structure of your data, including nested and repeated fields.

 

[
  {
    "name": "your_column_name",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      {
        "name": "field1",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "field2",
        "type": "INT64",
        "mode": "NULLABLE"
      }
    ]
  },
  // Define other columns similarly
]

Load the JSON File into BigQuery:

Use the bq command-line tool or the BigQuery web UI to load the data.

Using the bq command-line tool:

 

bq load --source_format=NEWLINE_DELIMITED_JSON \
  your_dataset.your_table \
  data.json \
  schema.json

Replace your_dataset, your_table, data.json, and schema.json with your actual dataset name, table name, data file, and schema file.

Alternative Approach: Flatten the Data Before Saving to CSV

If you must use CSV due to constraints, you can flatten the nested lists into a format that can be stored in CSV.

Option 1: Expand Nested Lists into Multiple Rows

Pros: Maintains all data points.
Cons: Increases the number of rows significantly.

 
df_exploded = df.explode('your_nested_column')
df_exploded.to_csv('data_flattened.csv', index=False)

Option 2: Serialize Nested Lists into Strings

Convert the nested lists into JSON strings before saving.

 
import json

df['your_nested_column'] = df['your_nested_column'].apply(json.dumps)
df.to_csv('data_with_serialized_columns.csv', index=False)

In BigQuery, store this column as a STRING type.
You can parse the JSON string in SQL queries using JSON_EXTRACT() functions.

View solution in original post

ms4446