This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your Compute Engine, Google Kubernetes Engine (GKE), and serverless resources.
Compute Engine
This section provides guidance to help you optimize the performance of your Compute Engine resources.
Autoscale resources
Managed instance groups (MIGs) let you scale your stateless apps deployed on Compute Engine VMs efficiently. Autoscaling helps your apps continue to deliver predictable performance when the load increases. In a MIG, a group of Compute Engine VMs is launched based on a template that you define. In the instance group configuration, you configure an autoscaling policy, which specifies one or more signals that the autoscaler uses to scale the group. The autoscaling signals can be schedule-based, like start time or duration, or based on target metrics such as average CPU utilization. For more information, see Autoscaling groups of instances.
Disable SMT
Each virtual CPU (vCPU) that you allocate to a Compute Engine VM is implemented as a single hardware multithread. By default, two vCPUs share a physical CPU core. This architecture is called simultaneous multi-threading (SMT).
For workloads that are highly parallel or that perform floating point calculations (such as transcoding, Monte Carlo simulations, genetic sequence analysis, and financial risk modeling), you can improve performance by disabling SMT. For more information, see Set the number of threads per core.
Use GPUs
For workloads such as machine learning and visualization, you can add graphics processing units (GPUs) to your VMs. Compute Engine provides NVIDIA GPUs in passthrough mode so that your VMs have direct control over the GPUs and the associated memory. For graphics-intensive workloads such as 3D visualization, you can use NVIDIA RTX virtual workstations. After you deploy the workloads, monitor the GPU usage and review the options for optimizing GPU performance.
Use compute-optimized machine types
Workloads like gaming, media transcoding, and high performance computing (HPC) require consistently high performance per CPU core. Google recommends that you use compute-optimized machine types for the VMs that run such workloads. Compute-optimized VMs are built on an architecture that uses features like non-uniform memory access (NUMA) for optimal and reliable performance.
Tightly coupled HPC workloads have a unique set of requirements for achieving peak efficiency in performance. For more information, see Parallel file systems for HPC workloads.
Choose appropriate storage
Google Cloud offers a wide range of storage options for Compute Engine VMs: Persistent disks, local solid-state drive (SSD) disks, Filestore, and Cloud Storage. For design recommendations and best practices to optimize the performance of each of these storage options, see Optimize storage performance.
Google Kubernetes Engine
This section provides guidance to help you optimize the performance of your Google Kubernetes Engine (GKE) resources.
Autoscale resources
You can automatically resize the node pools in a GKE cluster to match the current load by using the cluster autoscaler feature. Autoscaling helps your apps continue to deliver predictable performance when the load increases. The cluster autoscaler resizes node pools automatically based on the resource requests (rather than actual resource utilization) of the Pods running on the nodes. When you use autoscaling, there can be a trade-off between performance and cost. Review the best practices for configuring cluster autoscaling efficiently.
Use C2D VMs
You can improve the performance of compute-intensive containerized workloads by using C2D machine types. You can add C2D nodes to your GKE clusters by choosing a C2D machine type in your node pools.
Disable SMT
Simultaneous multi-threading (SMT) can increase application throughput significantly for general computing tasks and for workloads that need high I/O. But for workloads in which both the virtual cores are compute-bound, SMT can cause inconsistent performance. To get better and more predictable performance, you can disable SMT for your GKE nodes by setting the number of vCPUs per core to 1.
Use GPUs
For compute-intensive workloads like image recognition and video transcoding, you can accelerate performance by creating node pools that use GPUs. For more information, see Running GPUs.
Use container-native load balancing
Container-native load balancing enables load balancers to distribute traffic directly and evenly to Pods. This approach provides better network performance and improved visibility into network latency between the load balancer and the Pods. Because of these benefits, container-native load balancing is the recommended solution for load balancing through Ingress.
Define a compact placement policy
Tightly coupled batch workloads need low network latency between the nodes in the GKE node pool. You can deploy such workloads to single-zone node pools, and ensure that the nodes are physically close to each other by defining a compact placement policy. For more information, see Define compact placement for GKE nodes.
Serverless compute services
This section provides guidance to help you optimize the performance of your serverless compute services in Google Cloud: Cloud Run and Cloud Run functions. These services provide autoscaling capabilities, where the underlying infrastructure handles scaling automatically. By using these serverless services, you can reduce the effort to scale your microservices and functions, and focus on optimizing performance at the application level.
For more information, see the following documentation:
- Optimizing performance for Cloud Run services
- Optimizing Java applications for Cloud Run
- Optimizing performance in Cloud Run functions
What's next
Review the best practices for optimizing the performance of your storage, networking, database, and analytics resources:
- Optimize storage performance.
- Optimize networking performance.
- Optimize database performance.
- Optimize analytics performance.