I have stored a Pandas dataframe as a cvs file of which some of the columns are list of lists like [['Dog', 13], ['Cat', 14]]
What would be the best approach to store this on big query?
I have tried
Solved! Go to Solution.
When dealing with nested data structures like lists of lists in Pandas DataFrames, storing them directly in a CSV format poses challenges because CSV is inherently a flat file format that doesn't support nested or complex data types. Google BigQuery also has limitations when loading nested or repeated fields from CSV files. Here's how you can effectively store your data in BigQuery:
Convert the DataFrame to JSON:
import pandas as pd
# Assuming `df` is your DataFrame
df.to_json('data.json', orient='records', lines=True)
Use orient='records' and lines=True to create a newline-delimited JSON (NDJSON) file, which BigQuery can read efficiently.
Define the BigQuery Schema:
Create a schema that reflects the structure of your data, including nested and repeated fields.
[
{
"name": "your_column_name",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "field1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "field2",
"type": "INT64",
"mode": "NULLABLE"
}
]
},
// Define other columns similarly
]
Load the JSON File into BigQuery:
Use the bq command-line tool or the BigQuery web UI to load the data.
Using the bq command-line tool:
bq load --source_format=NEWLINE_DELIMITED_JSON \
your_dataset.your_table \
data.json \
schema.json
Replace your_dataset, your_table, data.json, and schema.json with your actual dataset name, table name, data file, and schema file.
Alternative Approach: Flatten the Data Before Saving to CSV
If you must use CSV due to constraints, you can flatten the nested lists into a format that can be stored in CSV.
Option 1: Expand Nested Lists into Multiple Rows
df_exploded = df.explode('your_nested_column')
df_exploded.to_csv('data_flattened.csv', index=False)
Option 2: Serialize Nested Lists into Strings
Convert the nested lists into JSON strings before saving.
import json
df['your_nested_column'] = df['your_nested_column'].apply(json.dumps)
df.to_csv('data_with_serialized_columns.csv', index=False)
When dealing with nested data structures like lists of lists in Pandas DataFrames, storing them directly in a CSV format poses challenges because CSV is inherently a flat file format that doesn't support nested or complex data types. Google BigQuery also has limitations when loading nested or repeated fields from CSV files. Here's how you can effectively store your data in BigQuery:
Convert the DataFrame to JSON:
import pandas as pd
# Assuming `df` is your DataFrame
df.to_json('data.json', orient='records', lines=True)
Use orient='records' and lines=True to create a newline-delimited JSON (NDJSON) file, which BigQuery can read efficiently.
Define the BigQuery Schema:
Create a schema that reflects the structure of your data, including nested and repeated fields.
[
{
"name": "your_column_name",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "field1",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "field2",
"type": "INT64",
"mode": "NULLABLE"
}
]
},
// Define other columns similarly
]
Load the JSON File into BigQuery:
Use the bq command-line tool or the BigQuery web UI to load the data.
Using the bq command-line tool:
bq load --source_format=NEWLINE_DELIMITED_JSON \
your_dataset.your_table \
data.json \
schema.json
Replace your_dataset, your_table, data.json, and schema.json with your actual dataset name, table name, data file, and schema file.
Alternative Approach: Flatten the Data Before Saving to CSV
If you must use CSV due to constraints, you can flatten the nested lists into a format that can be stored in CSV.
Option 1: Expand Nested Lists into Multiple Rows
df_exploded = df.explode('your_nested_column')
df_exploded.to_csv('data_flattened.csv', index=False)
Option 2: Serialize Nested Lists into Strings
Convert the nested lists into JSON strings before saving.
import json
df['your_nested_column'] = df['your_nested_column'].apply(json.dumps)
df.to_csv('data_with_serialized_columns.csv', index=False)
@nidhalhafeez In addition, feel free to explore this sample code
using the pandas-gbq package in loading DataFrame to BigQuery for your future reference.
import pandas
import pandas_gbq
# TODO: Set project_id to your Google Cloud Platform project ID.
# project_id = "my-project"
# TODO: Set table_id to the full destination table ID (including the
# dataset ID).
# table_id = 'my_dataset.my_table'
df = pandas.DataFrame(
{
"my_string": ["a", "b", "c"],
"my_int64": [1, 2, 3],
"my_float64": [4.0, 5.0, 6.0],
"my_bool1": [True, False, True],
"my_bool2": [False, True, False],
"my_dates": pandas.date_range("now", periods=3),
}
)
pandas_gbq.to_gbq(df, table_id, project_id=project_id)
I hope this helps.
Thank you