Build a hybrid data processing footprint using Dataproc on Google Distributed Cloud
Antonio Scaramuzzino
Senior Product Manager
Chris Nauroth
Senior Staff Software Engineer
Google Cloud customers interested in building or modernizing their data lake infrastructure often need to maintain at least part of their workloads and data on-premises, because of regulatory or operational requirements.
Thanks to Dataproc on Google Distributed Cloud, introduced in preview at Google Cloud Next ‘24, you can now fully modernize your data lake with cloud-based technology, while building hybrid data processing footprints that allow you to store and process on-prem data that you can’t move to the cloud.
Dataproc on Google Distributed Cloud lets you run Apache Spark processing workloads on-prem, using Google-provided hardware located within your data center, while maintaining consistency between the technology you use in the cloud and locally.
For example, a large telecommunications company in Europe is modernizing their data lake on Google Cloud, while keeping Personally Identifiable Information (PII) data on-prem, on Google Distributed Cloud, to satisfy regulatory requirements.
In this blog, we will show how to use Dataproc on Google Distributed Cloud to read on-prem PII data, calculate aggregate metrics, and upload the resulting dataset to the data lake on the cloud using Google Cloud Storage.
Aggregate and anonymize sensitive data on-prem
In our demo scenario, the customer is a telecommunications company storing event logs that record users’ calls:
This dataset contains PII. For regulatory compliance, PII must remain on-prem in their own data center. To satisfy this requirement, the customer S3-compatible object storage on-premise to store this data. However, now the customer would like to use their broader data lake in Google Cloud to analyze signal_strength by location and identify the best areas for new infrastructure investments.
To integrate with Google Cloud Data Analytics while still satisfying compliance requirements, Dataproc on Google Distributed Cloud supports full local execution of Spark jobs that can perform an aggregation on signal_quality. Consider this sample Spark code:
Dataproc on GDC exposes custom resources in the Kubernetes Resource Manager API to support Spark application submission. First, users obtain credentials to the GDC cluster:
Then, users can run the job shown above by creating a SparkApplication custom resource and specifying the input location from local object storage and the output location to Cloud Storage:
The resulting output in Cloud Storage identifies several areas of low signal quality:
This data set is now available in Cloud Storage, with PII removed, as part of the customer’s broader GCP data lake strategy. This opens the possibility of additional analysis, such as trending over time, or using multiple data analytics products such as BigQuery and Dataproc Serverless.
Learn more
In this blog, we saw how you can leverage Dataproc on Google Distributed Cloud to create hybrid data processing footprints, processing on-prem sensitive data that needs to remain in your datacenter, and moving the rest of your data to the cloud. Dataproc on Google Distributed Cloud lets you modernize your data lake while respecting regulatory and operational data residency requirements. To learn more about Dataproc and Google Distributed Cloud, please visit: