Skip to content

Latest commit

 

History

History

text-generation-inference

AI on GKE Benchmark TGI Server

Overview

This stage deploys a Text Generation Inference server.

Instructions

Step 1: create and configure terraform.tfvars

Create a terraform.tfvars file. sample-terraform.tfvars is provided as an example file. You can copy the file as a starting point. Note that you will have to change the existing credentials_config.

cp sample-terraform.tfvars terraform.tfvars

Fill out your terraform.tfvars with the desired model and server configuration, referring to the list of required and optional variables here. Variables credentials_config are required.

Optionally configure HPA (Horizontal Pod Autoscaling) by setting hpa_type. Note: GMP (Google Managed Prometheus) must be enabled on this cluster (which is the default) to scale based on custom metrics. See autoscaling.md for more details.

Determine number of gpus

gpu_count should be configured respective to the size of the model with some overhead for the kv cache. Here's an example on figuring out how many GPUs you need to run a model:

TGI defaults to bfloat16 for running supported models on GPUs. For a model with dtype of FP16 or bfloat16, each parameter requires 16 bits. A 7 billion parameter model requires a minimum of 7 billion * 16 bits = 14 GB of GPU memory. A single L4 GPU has 24GB of GPU memory. 1 L4 GPU is sufficient to run the tiiuae/falcon-7b model with plenty of overhead for the kv cache.

Note that Huggingface TGI server supports gpu_count equal to one of 1, 2, 4, or 8.

[optional] set-up credentials config with kubeconfig

If you created your cluster with steps from ../../infra/ or with fleet management enabled, the existing credentials_config must use the fleet host credentials like this:

credentials_config = {
  fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}

If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the credentials_config. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:

KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION

Then update your terraform.tfvars credentials_config to the following:

credentials_config = {
  kubeconfig = {
    path = "~/.kube/${CLUSTER_NAME}-kube.config"
  }
}

[optional] set up secret token in Secret Manager

A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a user access token. If the model you want to run does not require this, skip this step.

If you followed steps from ../../infra/, Secret Manager and the user access token should already be set up. If not, it is strongly recommended that you use Workload Identity and Secret Manager to access the user access tokens to avoid adding a plain text token into the terraform state. To do so, follow the instructions for setting up a secret in Secret Manager here.

Once complete, you should add these related secret values to your terraform.tfvars:

# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
 # ex. 1
hugging_face_secret_version =  $SECRET_VERSION

[Optional] Step 2: configure alternative storage

By default, the TGI yaml spec assumes that the cluster has local SSD-backed ephemeral storage available.

If you wish to use a different storage option with the TGI server, you can edit the ./manifest-templates/text-generation-inference.tftpl directly with your desired storage setup.

Step 3: login to gcloud

Run the following gcloud command for authorization:

gcloud auth application-default login

Step 4: terraform initialize, plan and apply

Run the following terraform commands:

# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply

Variables

Name Description Type Default Required
credentials_config Configure how Terraform authenticates to the cluster.
object({
fleet_host = optional(string)
kubeconfig = optional(object({
context = optional(string)
path = optional(string, "~/.kube/config")
}))
})
n/a yes
gpu_count Parallelism based on number of gpus. number 1 no
hpa_averagevalue_target AverageValue target for the hpa_type metric. Must be set if hpa_type is not null. number null no
hpa_max_replicas Maximum number of HPA replicas. number 5 no
hpa_min_replicas Minimum number of HPA replicas. number 1 no
hpa_type How the TGI workload should be scaled. string null no
hugging_face_secret Secret id in Secret Manager string null no
hugging_face_secret_version Secret version in Secret Manager string null no
ksa Kubernetes Service Account used for workload. string "default" no
max_concurrent_requests Max concurrent requests allowed for TGI to handle at once. TGI will drop all requests once it hits this max-concurrent-requests limit. number 128 no
model_id Model used for inference. string "tiiuae/falcon-7b" no
namespace Namespace used for TGI resources. string "default" no
project_id Project id of existing or created project. string n/a yes
quantization Quantization used for the model. Can be one of the quantization options mentioned in https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#quantize. eetq and bitsandbytes can be applied to any models whereas others might require the use of quantized checkpoints. string "" no
secret_templates_path Path where secret configuration manifest templates will be read from. Set to null to use the default manifests string null no
templates_path Path where manifest templates will be read from. Set to null to use the default manifests string null no