Hello,
I'm trying to create a table which stores all the events from two other table with some aggregations for the past 120 days. I tried to solve the problem with materialized view, but it seems I hit some limitations.
Because of the data volume - each query will be 60-70 TB, I'm trying to figure out how to optimize the solution.
What I've done so far:
1. I created a view - 'events_view'
CREATE VIEW events_view AS SELECT colA, colB, date_column, count(*) FROM table a WHERE date_column >= current_date - 120 GROUP BY colA, colB, date_column UNION ALL SELECT colA, colB, date_column, count(*) FROM table b WHERE date_column >= current_date - 120 GROUP BY colA, colB, date_column;
2. Then I'm using the view to create a table 'events_table'
CREATE TABLE 'events_table' AS PARTITION BY date_column
SELECT * FROM 'events_view';
If I write a procedure where I add the following:
1. Delete data older than 120 days.
2. Insert data which is from current date.
DELETE * FROM 'events_table' WHERE date_column < current_date - 120; INSERT INTO 'events_table'
SELECT * FROM 'events_view' WHERE date_column = current_date;
Will this solve my issue or there is a far more elegant way and optimal way to do that. And my other concern is that the view will run all this data for past 120 days every time even if there is a WHERE date_column = current_date clause.
Your concern is valid: using a view that filters data for the past 120 days and then querying it—even with an additional WHERE date_column = current_date clause—can still result in scanning all data for those 120 days. This is because the optimizer may not effectively prune partitions when the filtering is done inside the view definition, leading to inefficient queries that scan large volumes of data unnecessarily.
Here's how you can optimize your solution:
1. Optimized View:
Keep your view as is:
CREATE OR REPLACE VIEW events_view AS
SELECT
colA,
colB,
date_column,
COUNT(*) AS event_count
FROM table_a
GROUP BY colA, colB, date_column
UNION ALL
SELECT
colA,
colB,
date_column,
COUNT(*) AS event_count
FROM table_b
GROUP BY colA, colB, date_column;
2. Partitioned and Clustered Table:
Create the partitioned and clustered table as before:
CREATE OR REPLACE TABLE `your_dataset.events_table`
PARTITION BY date_column
CLUSTER BY colA, colB AS
SELECT * FROM events_view;
3. Scheduled Query with Efficient Update:
Replace the previous MERGE statement with a query that inserts only the data for the current day:
INSERT INTO `your_dataset.events_table`
SELECT
colA,
colB,
date_column,
COUNT(*) AS cnt
FROM (
SELECT colA, colB FROM `your_dataset.table_a` WHERE date_column = CURRENT_DATE()
UNION ALL
SELECT colA, colB FROM `your_dataset.table_b` WHERE date_column = CURRENT_DATE()
)
GROUP BY colA, colB, date_column;
4. Table Expiration:
Set a table expiration policy on events_table to automatically remove data older than 120 days:
ALTER TABLE `your_dataset.events_table`
SET OPTIONS (
partition_expiration_days = 120
);
Thank you very much @ms4446!
Just a question for the scheduled query - it will run every time new data is inserted in one of the initial tables - table_a or table_b, or an interval can be set?
And do you think there can be some duplicates - because there is no unique keys in this table creation, if we have couple of runs during the day can this condition lead to multiple inserts?
WHERE date_column = CURRENT_DATE()
And one question about table
your_dataset.events_table
If we go only with partitioning on date_column and without clustering on column A and B, will this have some negative effect on the table?
Thank you!
In BigQuery, scheduled queries run at fixed intervals rather than being triggered by data insertion into tables like table_a or table_b. You can define a schedule, such as hourly or daily, to run the query at regular intervals. If you require the query to run every time new data is inserted, you would need an external trigger mechanism, such as Google Cloud Functions or Cloud Run, combined with Pub/Sub notifications for new data events.
If your scheduled query runs multiple times a day with the condition WHERE date_column = CURRENT_DATE(), there’s a risk of inserting duplicate records, especially since there are no unique keys in the table. To mitigate this, you can add logic to check for existing records before inserting new ones.
One way to do this is by adding a HAVING clause to ensure the query inserts data only if it does not already exist for the current date. Alternatively, using a MERGE statement is a more robust solution to avoid duplicates, as it allows you to update existing rows and insert new ones only when there is no match. This ensures that your data remains accurate and avoids multiple inserts for the same day.
Clustering on columns colA and colB can significantly improve query performance, especially when these columns are frequently used for filtering or aggregation in queries. Without clustering, queries that involve colA or colB may require scanning more data, resulting in higher query costs and slower performance.
However, if your queries rarely filter or group by colA and colB, skipping clustering may not have a significant negative impact. In this case, partitioning by date_column alone might be sufficient to optimize query performance. Clustering is most beneficial when you frequently need to run queries that group or filter on the clustered columns.
Scheduled queries in BigQuery operate at fixed intervals, and external triggers would be needed for more dynamic execution. To avoid duplicates when running queries multiple times a day, consider using a HAVING clause or a MERGE statement. Finally, clustering on colA and colB is advisable when frequent filtering or aggregation on these columns is necessary, as it can improve query performance and reduce costs. If these columns are not commonly queried, partitioning by date_column alone may suffice.