This stage deploys a Text Generation Inference server.
Create a terraform.tfvars
file. sample-terraform.tfvars
is provided as an example file. You can copy the file as a starting point. Note that you will have to change the existing credentials_config
.
cp sample-terraform.tfvars terraform.tfvars
Fill out your terraform.tfvars
with the desired model and server configuration, referring to the list of required and optional variables here. Variables credentials_config
are required.
Optionally configure HPA (Horizontal Pod Autoscaling) by setting hpa_type
. Note: GMP (Google Managed Prometheus) must be enabled on this cluster (which is the default) to scale based on custom metrics. See autoscaling.md
for more details.
gpu_count
should be configured respective to the size of the model with some overhead for the kv cache. Here's an example on figuring out how many GPUs you need to run a model:
TGI defaults to bfloat16 for running supported models on GPUs. For a model with dtype of FP16 or bfloat16, each parameter requires 16 bits. A 7 billion parameter model requires a minimum of 7 billion * 16 bits = 14 GB of GPU memory. A single L4 GPU has 24GB of GPU memory. 1 L4 GPU is sufficient to run the tiiuae/falcon-7b
model with plenty of overhead for the kv cache.
Note that Huggingface TGI server supports gpu_count equal to one of 1, 2, 4, or 8.
If you created your cluster with steps from ../../infra/
or with fleet management enabled, the existing credentials_config
must use the fleet host credentials like this:
credentials_config = {
fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
}
If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the credentials_config
. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:
KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
Then update your terraform.tfvars
credentials_config
to the following:
credentials_config = {
kubeconfig = {
path = "~/.kube/${CLUSTER_NAME}-kube.config"
}
}
A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a user access token. If the model you want to run does not require this, skip this step.
If you followed steps from ../../infra/
, Secret Manager and the user access token should already be set up. If not, it is strongly recommended that you use Workload Identity and Secret Manager to access the user access tokens to avoid adding a plain text token into the terraform state. To do so, follow the instructions for setting up a secret in Secret Manager here.
Once complete, you should add these related secret values to your terraform.tfvars
:
# ex. "projects/sample-project/secrets/hugging_face_secret"
hugging_face_secret = $SECRET_ID
# ex. 1
hugging_face_secret_version = $SECRET_VERSION
By default, the TGI yaml spec assumes that the cluster has local SSD-backed ephemeral storage available.
If you wish to use a different storage option with the TGI server, you can edit the ./manifest-templates/text-generation-inference.tftpl
directly with your desired storage setup.
Run the following gcloud command for authorization:
gcloud auth application-default login
Run the following terraform commands:
# initialize terraform
terraform init
# verify changes
terraform plan
# apply changes
terraform apply
Name | Description | Type | Default | Required |
---|---|---|---|---|
credentials_config | Configure how Terraform authenticates to the cluster. | object({ |
n/a | yes |
gpu_count | Parallelism based on number of gpus. | number |
1 |
no |
hpa_averagevalue_target | AverageValue target for the hpa_type metric. Must be set if hpa_type is not null. |
number |
null |
no |
hpa_max_replicas | Maximum number of HPA replicas. | number |
5 |
no |
hpa_min_replicas | Minimum number of HPA replicas. | number |
1 |
no |
hpa_type | How the TGI workload should be scaled. | string |
null |
no |
hugging_face_secret | Secret id in Secret Manager | string |
null |
no |
hugging_face_secret_version | Secret version in Secret Manager | string |
null |
no |
ksa | Kubernetes Service Account used for workload. | string |
"default" |
no |
max_concurrent_requests | Max concurrent requests allowed for TGI to handle at once. TGI will drop all requests once it hits this max-concurrent-requests limit. | number |
128 |
no |
model_id | Model used for inference. | string |
"tiiuae/falcon-7b" |
no |
namespace | Namespace used for TGI resources. | string |
"default" |
no |
project_id | Project id of existing or created project. | string |
n/a | yes |
quantization | Quantization used for the model. Can be one of the quantization options mentioned in https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#quantize. eetq and bitsandbytes can be applied to any models whereas others might require the use of quantized checkpoints. |
string |
"" |
no |
secret_templates_path | Path where secret configuration manifest templates will be read from. Set to null to use the default manifests | string |
null |
no |
templates_path | Path where manifest templates will be read from. Set to null to use the default manifests | string |
null |
no |