Calculate Dataproc Cost

I am trying to calculate the Cost of Dataproc and then Display on Looker Report and also trying to automate this whole process. Loading the data on Bigquery with the help of Airflow. How can we achieve this ?

0 2 68
2 REPLIES 2

Hi @shubham21feb,

Welcome to the Google Cloud Community!

From what I gather, you’re looking to build an automated system that calculates your Google Cloud Dataproc costs and displays them in a clear and understandable format using Looker dashboards.

The best way to handle your Google Cloud billing data is to export it directly to BigQuery. Google Cloud can automatically send your detailed billing info to BigQuery throughout the day. From there, you can analyze the data in BigQuery or visualize it using tools like Looker Studio.

To accomplish this, follow the steps:

  1. Standard - It includes standard Cloud Billing account cost and usage details, such as account ID, invoice date, services, SKUs, projects, labels, locations, costs, usage, credits, adjustments, and currency.
  2. Detailed - It contains comprehensive Cloud Billing account cost and usage information. This includes all standard usage cost data, plus detailed resource-level information, such as costs associated with specific virtual machines or SSDs generating service usage.
  3. Pricing - Contains Cloud Billing account pricing details, including account ID, services, SKUs, products, geographic metadata, pricing units, currency, aggregation methods, and pricing tiers.

Note: Once you enable Cloud Billing export to BigQuery, billing data tables are automatically created within your BigQuery dataset. Be sure to consider the data load frequency and manage permissions carefully across Cloud Billing, BigQuery, and Composer.

Below are the options:

Option 1: Create a view table on BigQuery specifically for Dataproc Cost Resource based on the exported data, then establish a connection between your Looker and the BigQuery dataset:

CREATE OR REPLACE VIEW project_name.dataset_name.table_name AS(
SELECT usage_start_time,cost /** select all columns or select only the columns that are needed on your view table to optimize the query**/
FROM
    `project_name.dataset_name.table_name`
WHERE
    service.description LIKE '%Dataproc)c%' /*This is a condition to select only Dataproc resources.*/

Terms used:

  • project_name: The name of the project you’re working on.
  • dataset_name: The name of your dataset.
  • table_name: The name of your table view.

Option 2: You can calculate costs directly within an Airflow DAG without needing to create a view table. Below is an example code snippet. Once the table is created, you can establish a connection with Looker.

# import statements
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator

#set your arguments
default_args = {
   'depends_on_past': False, #default to false
   'start_date': #set_date,
   'retries': 1,
   'retry_delay': timedelta(minutes=5)
   #add arguments based on your needs
}

with DAG(dag_id='dataproc_cost_calculation', #name of your DAG
   default_args=default_args,
   schedule_interval=timedelta(days=1),  # Run daily
   catchup=False
) as dag:

# Dummy start task
   start = DummyOperator(
       task_id='start',
       dag=dag,
   )

# BigQuery task, operator
   calculate_dataproc_cost = BigQueryOperator(
       task_id='calculate_dataproc_cost',
       use_legacy_sql=False,  # Use standard SQL
       sql='''
           #Insert your SQL for dataproc cost
       ''',
        write_disposition='WRITE_TRUNCATE',
   )

# Dummy end task
   end = DummyOperator(
       task_id='end',
       dag=dag,
   )

# dependency ()
start >> calculate_dataproc_cost >> end

To connect your dataset to Looker, refer to this documentation. If you’re using Looker Studio, please consult this documentation instead.

I hope the above information is helpful.

Hi @caryna ,

Thanks for the answer.

How can i calculate the cost of serverless cluster and ephemeral cluster from cloudaudit_googleapis_com_activity log table ?

What factors consider to calculate the cost of serverless cluster ? Do we need to include the compute cost along with DCU cost?