Name		Name	Last commit message	Last commit date
parent directory ..
hpa-templates		hpa-templates
manifest-templates		manifest-templates
monitoring-templates		monitoring-templates
README.md		README.md
autoscaling.md		autoscaling.md
main.tf		main.tf
providers.tf		providers.tf
sample-terraform.tfvars		sample-terraform.tfvars
variables.tf		variables.tf

README.md

AI on GKE Benchmark TGI Server

AI on GKE Benchmark TGI Server

Overview

This stage deploys a Text Generation Inference server.

Instructions

Step 1: create and configure terraform.tfvars

Create a terraform.tfvars file. sample-terraform.tfvars is provided as an example file. You can copy the file as a starting point. Note that you will have to change the existing credentials_config.

cp sample-terraform.tfvars terraform.tfvars

Fill out your terraform.tfvars with the desired model and server configuration, referring to the list of required and optional variables here. Variables credentials_config are required.

Optionally configure HPA (Horizontal Pod Autoscaling) by setting hpa_type. Note: GMP (Google Managed Prometheus) must be enabled on this cluster (which is the default) to scale based on custom metrics. See autoscaling.md for more details.

Determine number of gpus

gpu_count should be configured respective to the size of the model with some overhead for the kv cache. Here's an example on figuring out how many GPUs you need to run a model:

TGI defaults to bfloat16 for running supported models on GPUs. For a model with dtype of FP16 or bfloat16, each parameter requires 16 bits. A 7 billion parameter model requires a minimum of 7 billion * 16 bits = 14 GB of GPU memory. A single L4 GPU has 24GB of GPU memory. 1 L4 GPU is sufficient to run the tiiuae/falcon-7b model with plenty of overhead for the kv cache.

Note that Huggingface TGI server supports gpu_count equal to one of 1, 2, 4, or 8.

[optional] set-up credentials config with kubeconfig

If you created your cluster with steps from ../../infra/ or with fleet management enabled, the existing credentials_config must use the fleet host credentials like this:

credentials_config = {
  fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}

If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the credentials_config. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:

KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION

Then update your terraform.tfvars credentials_config to the following:

credentials_config = {
  kubeconfig = {
    path = "~/.kube/${CLUSTER_NAME}-kube.config"
  }
}

[optional] set up secret token in Secret Manager

A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a user access token. If the model you want to run does not require this, skip this step.

If you followed steps from ../../infra/, Secret Manager and the user access token should already be set up. If not, it is strongly recommended that you use Workload Identity and Secret Manager to access the user access tokens to avoid adding a plain text token into the terraform state. To do so, follow the instructions for setting up a secret in Secret Manager here.

Once complete, you should add these related secret values to your terraform.tfvars:

# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
 # ex. 1
hugging_face_secret_version =  $SECRET_VERSION

[Optional] Step 2: configure alternative storage

By default, the TGI yaml spec assumes that the cluster has local SSD-backed ephemeral storage available.

If you wish to use a different storage option with the TGI server, you can edit the ./manifest-templates/text-generation-inference.tftpl directly with your desired storage setup.

Step 3: login to gcloud

Run the following gcloud command for authorization:

gcloud auth application-default login

Step 4: terraform initialize, plan and apply

Run the following terraform commands:

# initialize terraform
terraform init

# verify changes
terraform plan

# apply changes
terraform apply

Variables

Name	Description	Type	Default	Required
credentials_config	Configure how Terraform authenticates to the cluster.	object({ fleet_host = optional(string) kubeconfig = optional(object({ context = optional(string) path = optional(string, "~/.kube/config") })) })	n/a	yes
gpu_count	Parallelism based on number of gpus.	`number`	`1`	no
hpa_averagevalue_target	AverageValue target for the `hpa_type` metric. Must be set if `hpa_type` is not null.	`number`	`null`	no
hpa_max_replicas	Maximum number of HPA replicas.	`number`	`5`	no
hpa_min_replicas	Minimum number of HPA replicas.	`number`	`1`	no
hpa_type	How the TGI workload should be scaled.	`string`	`null`	no
hugging_face_secret	Secret id in Secret Manager	`string`	`null`	no
hugging_face_secret_version	Secret version in Secret Manager	`string`	`null`	no
ksa	Kubernetes Service Account used for workload.	`string`	`"default"`	no
max_concurrent_requests	Max concurrent requests allowed for TGI to handle at once. TGI will drop all requests once it hits this max-concurrent-requests limit.	`number`	`128`	no
model_id	Model used for inference.	`string`	`"tiiuae/falcon-7b"`	no
namespace	Namespace used for TGI resources.	`string`	`"default"`	no
project_id	Project id of existing or created project.	`string`	n/a	yes
quantization	Quantization used for the model. Can be one of the quantization options mentioned in https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#quantize. `eetq` and `bitsandbytes` can be applied to any models whereas others might require the use of quantized checkpoints.	`string`	`""`	no
secret_templates_path	Path where secret configuration manifest templates will be read from. Set to null to use the default manifests	`string`	`null`	no
templates_path	Path where manifest templates will be read from. Set to null to use the default manifests	`string`	`null`	no

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-generation-inference

text-generation-inference

README.md

AI on GKE Benchmark TGI Server

Overview

Instructions

Step 1: create and configure terraform.tfvars

Determine number of gpus

[optional] set-up credentials config with kubeconfig

[optional] set up secret token in Secret Manager

[Optional] Step 2: configure alternative storage

Step 3: login to gcloud

Step 4: terraform initialize, plan and apply

Variables

Files

text-generation-inference

Directory actions

More options

Directory actions

More options

Latest commit

History

text-generation-inference

Folders and files

parent directory

README.md

AI on GKE Benchmark TGI Server

Overview

Instructions

Step 1: create and configure terraform.tfvars

Determine number of gpus

[optional] set-up credentials config with kubeconfig

[optional] set up secret token in Secret Manager

[Optional] Step 2: configure alternative storage

Step 3: login to gcloud

Step 4: terraform initialize, plan and apply

Variables