Dataproc variants - Pros / Cons and Usecases

There are 3 Dataproc options

  1. Dataproc on GCE
  2. Dataproc with GKE
  3. Dataproc Serverless

What are the pros/cons of each, I understand containers, serverless and IaaS fully but looking more from the operations perspective and also cost. 

Solved Solved
0 3 173
1 ACCEPTED SOLUTION

Hello,

Thank you for contacting Google Cloud Community!

Dataproc on GCE (Google Compute Engine)

Pros:

  • Full control: Offers the highest level of control over your cluster configuration.
  • Flexibility: Can be tailored to specific workloads and performance requirements.
  • Cost-effective for long-running clusters: Ideal for workloads that require consistent compute resources.

Cons:

  • Requires management: Requires more operational overhead to manage and scale clusters.
  • Higher upfront costs: Can have higher upfront costs due to provisioning and managing infrastructure.
Dataproc with GKE (Google Kubernetes Engine)

Pros:

  • Managed Kubernetes: Leverages the managed Kubernetes platform for cluster management.
  • Container orchestration: Provides advanced container orchestration capabilities.
  • Hybrid workloads: Can run both batch and streaming workloads.

Cons:

  • Steeper learning curve: Requires familiarity with Kubernetes concepts and best practices.
  • Additional costs: May incur additional costs for GKE resources and services.
Dataproc Serverless

Pros:

  • Fully managed: No cluster management required.
  • Pay-as-you-go: Only pay for the resources used, making it cost-effective for intermittent workloads.
  • Scalability: Automatically scales to meet workload demands.

Cons:

  • Limited control: Offers less control over cluster configuration compared to Dataproc on GCE.
  • Potential for cold starts: May experience delays when starting new jobs after periods of inactivity.
Operational Considerations and Cost
  • Operational overhead: Dataproc on GCE requires the most operational overhead, while Dataproc Serverless requires the least.
  • Cost: Dataproc Serverless is generally the most cost-effective option for intermittent workloads, while Dataproc on GCE can be more cost-effective for long-running clusters.
  • Workload requirements: Consider the specific requirements of your workloads, such as batch processing, streaming, or machine learning, to determine the most suitable option.

Regards,

Jai Ade

View solution in original post

3 REPLIES 3

Hello,

Thank you for contacting Google Cloud Community!

Dataproc on GCE (Google Compute Engine)

Pros:

  • Full control: Offers the highest level of control over your cluster configuration.
  • Flexibility: Can be tailored to specific workloads and performance requirements.
  • Cost-effective for long-running clusters: Ideal for workloads that require consistent compute resources.

Cons:

  • Requires management: Requires more operational overhead to manage and scale clusters.
  • Higher upfront costs: Can have higher upfront costs due to provisioning and managing infrastructure.
Dataproc with GKE (Google Kubernetes Engine)

Pros:

  • Managed Kubernetes: Leverages the managed Kubernetes platform for cluster management.
  • Container orchestration: Provides advanced container orchestration capabilities.
  • Hybrid workloads: Can run both batch and streaming workloads.

Cons:

  • Steeper learning curve: Requires familiarity with Kubernetes concepts and best practices.
  • Additional costs: May incur additional costs for GKE resources and services.
Dataproc Serverless

Pros:

  • Fully managed: No cluster management required.
  • Pay-as-you-go: Only pay for the resources used, making it cost-effective for intermittent workloads.
  • Scalability: Automatically scales to meet workload demands.

Cons:

  • Limited control: Offers less control over cluster configuration compared to Dataproc on GCE.
  • Potential for cold starts: May experience delays when starting new jobs after periods of inactivity.
Operational Considerations and Cost
  • Operational overhead: Dataproc on GCE requires the most operational overhead, while Dataproc Serverless requires the least.
  • Cost: Dataproc Serverless is generally the most cost-effective option for intermittent workloads, while Dataproc on GCE can be more cost-effective for long-running clusters.
  • Workload requirements: Consider the specific requirements of your workloads, such as batch processing, streaming, or machine learning, to determine the most suitable option.

Regards,

Jai Ade

Thanks so much @jaia 

One question though i accepted your solution, COSTWISE  with Dataproc GCE can i achieve the same benefits as Dataproc Serverless by having ephemeral clusters meaning clusters are deleted once the jobs are completed.  I don't see a huge difference between the 2 variants with this approach @jaia