LangSmithGCP GKE

Quickstart

Get from zero to a running LangSmith instance on GKE in under an hour.

First time?
Run these commands in order. Each step builds on the previous. Return to the full guide below for configuration details, advanced options, and per-pass troubleshooting.
# 1 — Unzip the Terraform modules provided by your LangChain SA
unzip gcp.zip
cd gcp
# 2 — Authenticate to GCP
gcloud auth login
gcloud config set project <your-project-id>
gcloud auth application-default login
# 3 — Generate terraform.tfvars interactively
#     Re-running is safe — Enter accepts current values
make quickstart
# 4 — Set up secrets in Secret Manager
#     Auto-generates passwords and Fernet keys — must be sourced
source infra/scripts/setup-env.sh
# 5 — Deploy infrastructure (~25–30 min)
make init
make plan
make apply
# 6 — Configure kubectl
make kubeconfig

# Verify nodes are ready
kubectl get nodes
# 7 — Deploy LangSmith
make init-values
make deploy
# 8 — Get the Gateway IP for DNS
kubectl get gateway -n langsmith \
  -o jsonpath='{.items[0].status.addresses[0].value}'
What gets deployed
Pass 1 creates the VPC, GKE cluster, Cloud SQL PostgreSQL, Memorystore Redis, GCS bucket, IAM/Workload Identity, cert-manager, KEDA, and Envoy Gateway. Pass 2 installs the LangSmith Helm chart. Passes 3–5 are optional add-ons (Deployments, Agent Builder, Insights).
Professional Services — GCP GKE

LangSmith on GCP
Self-hosted deployment on GKE, managed with Terraform.

ChangelogCheck the Self-Hosted Changelog before upgrading — breaking changes, new variables, Helm chart notes.Recent releases
01
Infrastructure
~30 min
02
LangSmith
~10 min
03
Deployments
~5 min
04
Agent Builder
~5 min
05
Insights
~5 min
01
Infrastructure
VPC, GKE cluster, Cloud SQL PostgreSQL, Memorystore Redis, GCS bucket, Workload Identity, cert-manager, KEDA, Envoy Gateway
Required
02
LangSmith
LangSmith Helm chart — traces, prompts, evaluations, org management
Required
03
LangSmith Deployments
Deploy and manage LangGraph graphs as API servers from the LangSmith UI. Requires KEDA.
Optional
04
Agent Builder
Build and deploy AI agents from the LangSmith UI. Adds tool-server, trigger-server, and a deep-agent LGP. Requires Pass 3.
Optional
05
Insights + Polly
ClickHouse-backed trace analytics (Clio) and Polly AI eval/monitoring agent. Requires Pass 4.
Optional

Two deployment tiers

TierPostgresRedisClickHouseUse case
LightIn-cluster podIn-cluster podIn-cluster podDemo / POC / short-lived dev
ProductionCloud SQL (private IP)Memorystore (private IP)LangChain ManagedPersistent, scalable deployments
In-cluster ClickHouse is for dev/POC only
In-cluster ClickHouse runs as a single pod with no replication or backups. For production deployments, use LangChain Managed ClickHouse.
Blob Storage is always required
Regardless of tier, trace payloads must go to GCS — never to ClickHouse. Both tiers use external GCS blob storage.

GCP resources created (Pass 1)

ResourceTypePurpose
VPC Networkgoogle_compute_networkIsolated network with regional routing
Subnetgoogle_compute_subnetworkGKE nodes, pods, and services CIDRs
Cloud Router + NATgoogle_compute_routerOutbound internet access for private nodes
GKE Clustergoogle_container_clusterKubernetes — Standard or Autopilot mode
Node Poolgoogle_container_node_poolAutoscaling worker nodes with Workload Identity
Cloud SQLgoogle_sql_database_instancePostgreSQL — org config, run metadata, graph checkpoints
Memorystore Redisgoogle_redis_instanceTrace ingestion queue, pub/sub
GCS Bucketgoogle_storage_bucketRaw trace objects with TTL lifecycle rules
cert-managerHelmAutomated TLS via Let's Encrypt
KEDAHelmEvent-driven autoscaling (required for Deployments)
Envoy GatewayHelmOptional ingress for external traffic + TLS termination

Prerequisites

Required tools

bash
# Google Cloud SDK (>= 450)
brew install --cask google-cloud-sdk
# Linux: https://cloud.google.com/sdk/docs/install

# Terraform (>= 1.5)
brew tap hashicorp/tap && brew install hashicorp/tap/terraform

# kubectl
brew install kubectl

# Helm (>= 3.12)
brew install helm

# Verify
gcloud version
terraform version
kubectl version --client
helm version

Required GCP IAM roles

RolePurpose
roles/container.adminCreate and manage GKE clusters
roles/compute.networkAdminCreate VPC, subnets, firewall rules
roles/iam.serviceAccountAdminCreate service accounts for Workload Identity
roles/cloudsql.adminCreate and manage Cloud SQL instances
roles/redis.adminCreate and manage Memorystore Redis instances
roles/storage.adminCreate GCS buckets and lifecycle policies
roles/resourcemanager.projectIamAdminGrant IAM bindings during provisioning
roles/servicenetworking.networksAdminCreate private service connections for Cloud SQL and Memorystore

Authenticate and configure project

bash
gcloud auth login
gcloud config set project <your-project-id>
gcloud auth application-default login

# Terraform enables these automatically — or enable manually:
gcloud services enable \
  container.googleapis.com \
  sqladmin.googleapis.com \
  redis.googleapis.com \
  storage.googleapis.com \
  iam.googleapis.com \
  servicenetworking.googleapis.com \
  cloudresourcemanager.googleapis.com \
  --project <your-project-id>
Service Networking
Cloud SQL and Memorystore use private service access (VPC peering to Google's network). The servicenetworking.googleapis.com API and roles/servicenetworking.networksAdmin are required before those resources can be created — the Terraform module handles this automatically.

Repository Layout

terraform/gcp/
├── Makefile                    ← All commands — start here (make help)
├── infra/
│   ├── main.tf                 ← Root config — enables APIs, wires all sub-modules
│   ├── variables.tf            ← All configurable inputs with defaults
│   ├── locals.tf               ← Naming convention: {prefix}-{env}-{resource}-{suffix}
│   ├── outputs.tf              ← Cluster, DB, Redis, Storage outputs
│   ├── modules/
│   │   ├── networking/         ← VPC, subnet, Cloud Router, Cloud NAT, private service connection
│   │   ├── k8s-cluster/        ← GKE Standard/Autopilot, node pool, Workload Identity
│   │   ├── postgres/           ← Cloud SQL PostgreSQL, HA, private IP, deletion protection
│   │   ├── redis/              ← Memorystore Redis, HA tier, private IP
│   │   ├── storage/            ← GCS bucket with TTL lifecycle rules (ttl_s/ ttl_l/)
│   │   ├── k8s-bootstrap/      ← Namespaces, K8s secrets, cert-manager, KEDA
│   │   ├── ingress/            ← Envoy Gateway (Gateway API), GatewayClass, HTTPRoute
│   │   ├── iam/                ← Workload Identity service accounts and IAM bindings
│   │   ├── dns/                ← Cloud DNS managed zone (optional)
│   │   └── secrets/            ← Secret Manager secrets (optional)
│   └── scripts/
│       ├── _common.sh          ← Shared helpers (tfvar parser, color output)
│       ├── preflight.sh        ← Pre-Terraform tooling / auth / API checks
│       ├── quickstart.sh       ← Interactive setup wizard — generates terraform.tfvars
│       ├── setup-env.sh        ← Exports TF_VAR_* secrets from Secret Manager (source it)
│       ├── status.sh           ← Deployment health check
│       ├── manage-secrets.sh   ← Secret Manager CRUD (list/get/set/validate/delete)
│       └── tf-run.sh           ← Terraform wrapper that auto-sources setup-env.sh
└── helm/
    ├── scripts/
    │   ├── deploy.sh               ← Helm deploy — values chain + Workload Identity annotation
    │   ├── get-kubeconfig.sh       ← gcloud get-credentials wrapper
    │   ├── init-values.sh          ← Generates values-overrides.yaml from Terraform outputs
    │   ├── preflight-check.sh      ← Pre-deploy validation (tools, cluster, helm repo)
    │   └── uninstall.sh            ← Helm uninstall + operator resource cleanup
    └── values/
        ├── values.yaml             ← GCP base Helm values (Gateway, GCS config) — tracked in git
        ├── values-overrides.yaml   ← Live file — gitignored, generated by init-values.sh
        └── examples/               ← Source templates — tracked in git, copied by init-values.sh
Naming convention
All resources follow the pattern: {prefix}-{environment}-{resource}-{suffix} (e.g. ls-prod-gke-a1b2c3d4). The random suffix is generated once from name_prefix, environment, and project_id — stored in state to stay stable across applies. Set unique_suffix = false to disable.

Configuration

Create a terraform.tfvars file in terraform/gcp/infra/:

hcl
# Required
project_id            = "<your-gcp-project-id>"
environment           = "prod"
name_prefix           = "ls"
langsmith_license_key = "<your-license-key>"
langsmith_domain      = "langsmith.<your-domain>"

# Region / zone
region = "us-west2"
zone   = "us-west2-a"

# GKE
gke_use_autopilot   = false
gke_machine_type    = "e2-standard-4"
gke_min_nodes       = 2
gke_max_nodes       = 10
gke_release_channel = "REGULAR"

# Cloud SQL
postgres_source   = "external"
postgres_version  = "POSTGRES_15"
postgres_tier     = "db-custom-2-8192"
postgres_password = "<strong-password>"   # or: export TF_VAR_postgres_password=...

# Memorystore Redis
redis_source      = "external"
redis_memory_size = 5
redis_version     = "REDIS_7_0"

# ClickHouse
clickhouse_source = "in-cluster"

# Ingress + TLS
install_ingress        = true
ingress_type           = "envoy"
tls_certificate_source = "letsencrypt"
letsencrypt_email      = "<ops@your-domain>"

# Sizing + addon flags
sizing_profile = "production"
# enable_deployments   = true
# enable_agent_builder = true
# enable_insights      = true
# enable_polly         = true

Terraform state backend (recommended for production)

Uncomment the backend block in terraform/gcp/infra/main.tf:

hcl
backend "gcs" {
  bucket = "<your-terraform-state-bucket>"
  prefix = "langsmith/state"
}

Pass 1 — Required Infrastructure

What gets created
VPC, GKE cluster, Cloud SQL, Memorystore Redis, GCS bucket, K8s bootstrap (namespaces, secrets, cert-manager, KEDA)
Duration
~25–35 minutes
bash
cd terraform/gcp

# First time? Run the interactive setup wizard:
make quickstart   # generates terraform.tfvars from guided prompts

# Set up secrets in Secret Manager (auto-generates passwords + Fernet keys)
# Must be sourced — not executed — to export TF_VAR_* into your shell
source infra/scripts/setup-env.sh

make preflight    # checks gcloud auth, required APIs, and IAM roles
make init
make plan
make apply
First apply duration
GKE cluster creation takes 10–15 minutes. Cloud SQL with HA takes an additional 10 minutes. The private service connection for Cloud SQL and Redis is created during this pass — do not interrupt.
source, not execute
Always run source infra/scripts/setup-env.sh — not ./infra/scripts/setup-env.sh. The script exports TF_VAR_postgres_password and other credentials into the calling shell. Running without source silently exports nothing and Terraform fails at plan time.

After apply — get cluster credentials

bash
make kubeconfig

kubectl get nodes
kubectl get ns

Verify bootstrap components

bash
kubectl get pods -n cert-manager
kubectl get pods -n keda
kubectl get secrets -n langsmith

Pass 1 — Infrastructure

Provisions: VPC, GKE cluster, Cloud SQL PostgreSQL, Memorystore Redis, GCS bucket, K8s bootstrap (namespaces, K8s secrets, cert-manager, KEDA).

Pass 2 — Required LangSmith

What gets created
LangSmith Helm release — API server, backend workers, frontend, ClickHouse
Duration
~8–12 minutes

Generate Helm values from Terraform outputs and deploy:

bash
cd terraform/gcp

make init-values   # reads Terraform outputs → writes values-overrides.yaml
make deploy        # helm upgrade --install with the full values chain
What init-values does
init-values.sh reads Terraform outputs and writes helm/values/values-overrides.yaml with your hostname, GCS bucket name, Cloud SQL endpoint, Redis endpoint, and Workload Identity client-id. It also copies sizing and addon files from examples/ based on your sizing_profile and enable_* flags.

Verify deployment and configure DNS

bash
kubectl get pods -n langsmith

# Get Gateway external IP — create an A record pointing your domain here
EXTERNAL_IP=$(kubectl get svc -n envoy-gateway-system \
  -l gateway.envoyproxy.io/owning-gateway-name=langsmith-gateway \
  -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
echo "A record: $EXTERNAL_IP → <your-langsmith-domain>"

# Verify TLS certificate
kubectl get certificate -n langsmith

Pass 2 — LangSmith Helm Deploy

Use the scripted flow (includes preflight + kubeconfig refresh):

cd gcp/helm/scripts
./deploy.sh

Or run manually — generate secrets first:

export API_KEY_SALT=$(openssl rand -base64 32)
export JWT_SECRET=$(openssl rand -base64 32)
export AGENT_BUILDER_ENCRYPTION_KEY=$(python3 -c \
  "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
export INSIGHTS_ENCRYPTION_KEY=$(python3 -c \
  "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
export ADMIN_EMAIL="admin@example.com"
export ADMIN_PASSWORD="<strong-password>"

Pass 3 — Optional LangSmith Deployments

Prerequisite
Pass 2 healthy — all core pods Running/Completed
What gets created
host-backend, listener, operator — deploy and manage LangGraph graphs from the UI. Requires KEDA.
Duration
~5 minutes

Enable the flag in infra/terraform.tfvars and run:

hcl
# infra/terraform.tfvars
enable_deployments = true
bash
cd terraform/gcp

make apply          # pushes enable_deployments flag (KEDA managed by Terraform)
make init-values    # picks up enable_deployments = true
make deploy         # rolls out host-backend + listener + operator

Verify

bash
kubectl get pods -n langsmith | grep -E "host-backend|listener|operator"
# Expected: all Running

kubectl get lgp -n langsmith      # list LangSmith Deployments
kubectl get crd | grep langchain  # operator CRDs registered
kubectl get pods -n keda          # KEDA running
config.deployment.url must include https://
Missing the protocol causes operator-spawned agents to get stuck in DEPLOYING indefinitely.

LangSmith Deployments (Pass 3)

enable_langsmith_deployment = true


### Terraform state backend (recommended for production)

Copy `backend.tf.example` to `backend.tf` and fill in your bucket:

```hcl
backend "gcs" {
  bucket = "<your-terraform-state-bucket>"
  prefix = "langsmith/state"
}

Pass 1 — Infrastructure

Provisions: VPC, GKE cluster, Cloud SQL PostgreSQL, Memorystore Redis, GCS bucket, K8s bootstrap (namespaces, K8s secrets, cert-manager, KEDA).

Pass 4 — Optional Agent Builder

Prerequisite
Pass 3 healthy — listener and operator pods Running
What gets created
agent-builder-tool-server, agent-builder-trigger-server + deep-agent LGP
Duration
~5 minutes

Enable the flag in infra/terraform.tfvars and deploy:

hcl
# infra/terraform.tfvars
enable_agent_builder = true
bash
cd terraform/gcp

make init-values    # picks up enable_agent_builder = true
make deploy

Verify

bash
kubectl get pods -n langsmith | grep -E "tool-server|trigger-server|bootstrap"
# Expected: tool-server Running, trigger-server Running, agentBootstrap Completed

Roll the frontend after agentBootstrap completes to pick up the langsmith-polly-config ConfigMap:

bash
kubectl rollout restart deployment langsmith-frontend -n langsmith
Frontend restart required
If you skip the frontend restart after first Polly enable, Polly shows "Unable to connect to LangGraph server" because the frontend was started before the langsmith-polly-config ConfigMap existed.

Pass 5 — Optional Insights + Polly

Prerequisite
Pass 4 healthy — Agent Builder pods Running, agentBootstrap Completed
What gets created
Clio analytics (ClickHouse-backed), Polly AI eval/monitoring agent
Duration
~5 minutes

Enable both flags together in infra/terraform.tfvars and deploy:

hcl
# infra/terraform.tfvars
enable_insights = true
enable_polly    = true   # requires Polly license entitlement
bash
cd terraform/gcp

make init-values    # picks up both flags
make deploy

Verify

bash
kubectl get pods -n langsmith | grep -E "clio|polly"
# Expected: clio Running, smith-polly Running (operator-spawned)

kubectl get pods -n langsmith -w   # watch until all new pods stabilize
Encryption keys are write-once
insights_encryption_key and polly_encryption_key must never change after first enable — rotating either permanently corrupts existing encrypted data.

Architecture Overview

LangSmith on GCP — Full Architecture
click to zoom

Workload Identity

GKE pods authenticate to GCS using Workload Identity — the Kubernetes service account is annotated with a GCP service account email via an IAM binding. For GCS via the S3-compatible API, HMAC credentials are passed through Helm values. No static GCP service account keys are stored in Kubernetes secrets.

Private connectivity

Cloud SQL and Memorystore are accessible only via private IP within the VPC. A private service connection (VPC peering to Google's managed network) is established during Pass 1. No public endpoints are created for database or cache resources.

Envoy Gateway

Ingress is handled by Envoy Gateway (Gateway API). TLS is terminated at the Gateway using certificates issued by cert-manager (Let's Encrypt HTTP01) or an existing certificate. The Gateway exposes a single external LoadBalancer IP — point your DNS A record here.

Variable Reference

VariableDefaultDescription
project_idrequiredGCP project ID where resources will be created
regionus-west2GCP region for all resources
zoneus-west2-aGCP zone for zonal resources
environmentprodEnvironment label: dev, staging, prod, test, uat
name_prefixlsShort prefix for all resource names (1–11 chars)
unique_suffixtrueAppend random suffix to resource names
subnet_cidr10.0.0.0/20CIDR for the GKE subnet
pods_cidr10.4.0.0/14CIDR for GKE pod IPs (secondary range)
services_cidr10.8.0.0/20CIDR for GKE service IPs (secondary range)
gke_use_autopilotfalseUse GKE Autopilot mode (managed node pools)
gke_node_count2Initial node count per zone (Standard mode)
gke_min_nodes2Minimum nodes per zone for autoscaling
gke_max_nodes10Maximum nodes per zone for autoscaling
gke_machine_typee2-standard-4GKE node machine type (Standard mode only)
gke_disk_size100Node disk size in GB
gke_release_channelREGULARGKE release channel: RAPID, REGULAR, or STABLE
gke_deletion_protectiontrueEnable deletion protection for GKE cluster
gke_network_policy_providerDATA_PLANE_V2Network policy: CALICO or DATA_PLANE_V2
postgres_sourceexternalexternal (Cloud SQL, private IP) or in-cluster
postgres_versionPOSTGRES_15PostgreSQL version for Cloud SQL
postgres_tierdb-custom-2-8192Cloud SQL machine tier (2 vCPU, 8 GB RAM)
postgres_disk_size50Cloud SQL disk size in GB
postgres_high_availabilitytrueEnable Cloud SQL HA (regional standby)
postgres_deletion_protectiontrueEnable deletion protection on Cloud SQL
postgres_passwordrequired when externalPostgreSQL password — use TF_VAR_postgres_password
redis_sourceexternalexternal (Memorystore, private IP) or in-cluster
redis_versionREDIS_7_0Redis version for Memorystore
redis_memory_size5Memorystore Redis memory size in GB
redis_high_availabilitytrueEnable Memorystore HA tier (Standard HA)
redis_prevent_destroyfalsePrevent accidental Terraform destroy of Redis
clickhouse_sourcein-clusterin-cluster (dev/POC only), langsmith-managed (recommended for production), or external
clickhouse_host""ClickHouse host (required for external/managed)
clickhouse_port9440ClickHouse native protocol port
clickhouse_http_port8443ClickHouse HTTP port
clickhouse_userdefaultClickHouse username
clickhouse_tlstrueEnable TLS for ClickHouse connections
storage_ttl_short_days14GCS TTL for ttl_s/ prefix (short-lived trace objects)
storage_ttl_long_days400GCS TTL for ttl_l/ prefix (long-lived trace objects)
storage_force_destroyfalseAllow bucket deletion even with objects inside
langsmith_namespacelangsmithKubernetes namespace for LangSmith
langsmith_domainlangsmith.example.comFully qualified domain name for LangSmith
langsmith_license_key""LangSmith license key — use TF_VAR_langsmith_license_key
langsmith_helm_chart_version""Pin a specific Helm chart version (empty = latest)
install_ingresstrueInstall Envoy Gateway ingress via Terraform
ingress_typeenvoyenvoy (implemented), istio, or other
tls_certificate_sourcenonenone, letsencrypt, or existing
letsencrypt_email""Email for Let's Encrypt (required when tls_certificate_source = letsencrypt)
tls_secret_namelangsmith-tlsName for the TLS secret in Kubernetes
sizing_profiledefaultHelm sizing: production, production-large, dev, minimum, default
enable_deploymentsfalseEnable LangSmith Deployments — installs KEDA, listener, operator, host-backend
enable_agent_builderfalseEnable Agent Builder UI (requires enable_deployments = true)
enable_insightsfalseEnable ClickHouse-backed analytics
enable_pollyfalseEnable Polly AI eval/monitoring (requires enable_deployments = true)
enable_usage_telemetryfalseEnable extended usage telemetry reporting
ownerplatform-teamOwner label applied to all resources
cost_center""Cost center label for billing attribution
labels{}Additional labels applied to all resources

Variable Reference

VariableDefaultRequiredDescription
project_idyesGCP project ID
regionus-west2noGCP region
zoneus-west2-anoGCP zone for zonal resources
environmentprodnoEnvironment: dev, staging, prod, test, uat
name_prefixlsnoResource name prefix (1–11 chars)
unique_suffixtruenoAppend random suffix to resource names
subnet_cidr10.0.0.0/20noCIDR for the GKE subnet
pods_cidr10.4.0.0/14noCIDR for GKE pods
services_cidr10.8.0.0/20noCIDR for GKE services
gke_use_autopilotfalsenoUse GKE Autopilot mode
gke_node_count2noInitial node count per zone (Standard mode)
gke_min_nodes2noMinimum nodes per zone for autoscaling
gke_max_nodes10noMaximum nodes per zone for autoscaling
gke_machine_typee2-standard-4noGKE node machine type
gke_disk_size100noNode disk size in GB
gke_release_channelREGULARnoGKE release channel: RAPID, REGULAR, STABLE
gke_deletion_protectiontruenoEnable deletion protection on GKE cluster
gke_network_policy_providerDATA_PLANE_V2noNetwork policy: CALICO or DATA_PLANE_V2
postgres_sourceexternalnoexternal (Cloud SQL) or in-cluster (Helm)
postgres_versionPOSTGRES_15noPostgreSQL version for Cloud SQL
postgres_tierdb-custom-2-8192noCloud SQL machine tier
postgres_disk_size50noCloud SQL disk size in GB
postgres_high_availabilitytruenoEnable Cloud SQL HA (regional standby)
postgres_deletion_protectiontruenoEnable deletion protection on Cloud SQL
postgres_password""when externalPostgreSQL password — use TF_VAR_postgres_password
redis_sourceexternalnoexternal (Memorystore) or in-cluster (Helm)
redis_versionREDIS_7_0noRedis version for Memorystore
redis_memory_size5noMemorystore Redis memory size in GB
redis_high_availabilitytruenoEnable Memorystore HA tier (Standard HA)
redis_prevent_destroyfalsenoPrevent accidental Terraform destroy of Redis
clickhouse_sourcein-clusternoin-cluster, langsmith-managed, or external
clickhouse_host""when externalClickHouse host (external/managed only)
clickhouse_port9440noClickHouse native protocol port
clickhouse_http_port8443noClickHouse HTTP port
clickhouse_userdefaultnoClickHouse username
clickhouse_tlstruenoEnable TLS for ClickHouse connections
storage_ttl_short_days14noGCS TTL for ttl_s/ prefix
storage_ttl_long_days400noGCS TTL for ttl_l/ prefix
storage_force_destroyfalsenoAllow bucket deletion with objects inside
langsmith_namespacelangsmithnoKubernetes namespace for LangSmith
langsmith_domainlangsmith.example.comnoFully qualified domain name
langsmith_license_key""noLicense key — use TF_VAR_langsmith_license_key
langsmith_helm_chart_version""noPin Helm chart version (empty = latest)
install_ingresstruenoInstall Envoy Gateway via Terraform
ingress_typeenvoynoIngress type: envoy, istio, or other
tls_certificate_sourcenonenonone, letsencrypt, or existing
letsencrypt_email""when letsencryptEmail for Let's Encrypt notifications
tls_secret_namelangsmith-tlsnoName for the TLS secret in Kubernetes
enable_langsmith_deploymenttruenoEnable LangSmith Deployments — installs KEDA
ownerplatform-teamnoOwner label applied to all resources
cost_center""noCost center label for billing attribution
labels{}noAdditional labels applied to all resources

Optional GCP modules

VariableDefaultDescription
enable_gcp_iam_moduletrueWires modules/iam for Workload Identity + bucket IAM binding
enable_secret_manager_modulefalseWires modules/secrets for Secret Manager bootstrap secret
enable_dns_modulefalseWires modules/dns for Cloud DNS + managed cert
dns_create_zonetrueCreate a DNS zone when DNS module is enabled
dns_existing_zone_name""Existing zone to use when dns_create_zone = false
dns_create_certificatetrueCreate a Google-managed cert when DNS module is enabled

Quick Reference

First-time setup

bash
cd terraform/gcp

make quickstart              # generates terraform.tfvars interactively
source infra/scripts/setup-env.sh   # exports TF_VAR_* into shell (must be sourced)
make secrets                 # verify secrets stored correctly in Secret Manager
make init && make plan && make apply   # ~25–35 min
make kubeconfig
make init-values
make deploy

Day-2 operations

bash
make status            # full deployment health check
make status-quick      # skip Secret Manager + K8s queries
make deploy            # re-deploy after changing Helm values or upgrading chart
make init-values       # re-generate values after Terraform changes
make kubeconfig        # refresh cluster credentials
make secrets           # manage Secret Manager secrets interactively (list/get/set/validate)

Pass summary

PassWhatCommand
1VPC + GKE + Cloud SQL + Memorystore + GCS + IAM + cert-manager + KEDA + Envoy Gatewaymake apply
1.5Cluster credentialsmake kubeconfig
2LangSmith Helmmake init-values && make deploy
3+ LangSmith Deployments — host-backend, listener, operatormake apply && make init-values && make deploy
4+ Agent Builder — tool-server, trigger-server + deep-agent LGPmake init-values && make deploy
5+ Insights + Polly — Clio analytics, Polly eval agentmake init-values && make deploy

Enable optional addons

hcl
# infra/terraform.tfvars — set flags, then: make init-values && make deploy
enable_deployments   = true   # required for Agent Builder and Polly
enable_agent_builder = true   # requires enable_deployments = true
enable_insights      = true
enable_polly         = true   # requires enable_deployments = true + Polly entitlement
enable_usage_telemetry = true # optional extended telemetry

Sizing profiles

ProfileWhen to use
defaultChart defaults — quick tests, no sizing overlay applied
minimumAbsolute floor — fits e2-standard-4; demos, CI smoke tests
devSingle replica, minimal resources — dev/CI environments
productionMulti-replica with HPA — recommended for real workloads
production-largeHigh-memory / high-CPU — 50+ users or 1000+ traces/sec
hcl
# infra/terraform.tfvars — then: make init-values && make deploy
sizing_profile = "production"

Glossary

TermMeaning
values chaindeploy.sh loads Helm values files in order: base → overrides → sizing → addons. The last file wins on conflicts.
sizing profileA pre-built Helm values file that sets resources, replicaCount, and HPA settings for all LangSmith components. Set via sizing_profile in terraform.tfvars.
enable_* flagsBoolean flags in terraform.tfvars that tell init-values.sh which addon values files to copy. No terraform apply needed for pure Helm addons — only make init-values && make deploy.
Workload IdentityGKE's mechanism for pods to authenticate to GCP APIs (GCS, Secret Manager) without static credentials. The K8s service account is annotated with a GCP service account email via an IAM binding.
Fernet keysSymmetric encryption keys used by Agent Builder, Insights, and Polly to encrypt stored state. Generated once by setup-env.sh. Never rotate after first deploy — changing them permanently corrupts existing data.
values-overrides.yamlThe live, gitignored file generated by init-values.sh. Contains your hostname, Cloud SQL endpoint, Redis endpoint, GCS bucket, and Workload Identity config. Do not edit directly — re-run make init-values.
KEDAKubernetes Event-Driven Autoscaling — required for LangSmith Deployments (Pass 3). Installed via Terraform (k8s-bootstrap module) when enable_deployments = true.

Troubleshooting

Full reference
See the Troubleshooting guide for the complete issue list and diagnostic commands.
1
terraform apply fails: missing required GCP APIs
manual
2
GKE API server not accessible after cluster creation
manual
3
GKE nodes not joining cluster (NotReady)
manual
4
Cloud SQL connection refused from GKE pods
manual
5
Memorystore Redis connection timeout
manual
6
cert-manager fails to issue Let's Encrypt certificate
manual
7
GCS bucket access denied from LangSmith pods
manual
8
terraform destroy fails: deletion protection enabled
manual
9
Envoy Gateway admission webhook blocking resources
manual
10
405 Not Allowed on prompts, datasets, and other UI pages after upgrade to 0.13.26+
manual
11
Workload Identity not working — GCS permission denied
manual
12
langsmith-ksa missing Workload Identity annotation
manual
13
Helm release stuck in pending-upgrade
manual
14
Secret Manager access denied
manual
15
langsmith-postgres or langsmith-redis secret missing
manual