LangSmithAzure AKS

Quickstart

Get from zero to a running LangSmith instance on AKS in under an hour.

First time?
Run these commands in order. Each step builds on the previous. Return to the full guide below for configuration details, advanced options, and per-pass troubleshooting.
# 1 — Unzip the Terraform modules provided by your LangChain SA
unzip azure.zip
cd azure
# 2 — Generate terraform.tfvars interactively
#     Re-running is safe — Enter accepts current values
make quickstart
# 3 — Bootstrap secrets from Key Vault
#     Prompts on first run, reads from Key Vault on repeat
make setup-env
# 4 — Check prerequisites (az CLI, resource providers, RBAC, quotas)
make preflight
# 5 — Deploy infrastructure (~15–20 min)
#     Skip make plan on a fresh deploy — kubernetes_manifest requires a live cluster
make init
make apply
# 6 — Configure kubectl and create K8s secrets
make kubeconfig
make k8s-secrets
# 7 — Deploy LangSmith (~10 min)
make init-values
make deploy
# 8 — Check status
make status

# Get the public IP from the ingress
kubectl get ingress -n langsmith
What gets deployed
Pass 1 creates the AKS cluster, Azure DB for PostgreSQL, Azure Cache for Redis, Blob Storage, Key Vault, cert-manager, and KEDA. Pass 2 installs the LangSmith Helm chart (~25 pods). Passes 3–5 are optional add-ons (Deployments, Agent Builder, Insights).
Professional Services — Azure AKS

LangSmith on Azure
Self-hosted deployment on AKS, managed with Terraform.

ChangelogCheck the Self-Hosted Changelog before upgrading — breaking changes, new variables, Helm chart notes.Recent releases
01
Infrastructure
~25 min
02
LangSmith
~10 min
03
Deployments
~5 min
04
Agent Builder
~10 min
05
Insights
~5 min
01
Infrastructure
AKS cluster, VNet, PostgreSQL, Redis, Blob Storage, Key Vault, cert-manager, KEDA
Required
02
LangSmith
K8s secrets from Key Vault + LangSmith Helm chart — traces, prompts, evaluations, org management
Required
03
LangSmith Deployments
Deploy and manage LangGraph graphs as API servers from the LangSmith UI
Optional
04
Agent Builder
AI-assisted LangGraph agent creation from the UI — requires Pass 3
Optional
05
Insights
AI-powered trace analytics — requires Pass 3
Optional

Two deployment tiers

TierPostgresRedisClickHouseUse case
LightIn-cluster podIn-cluster podIn-cluster podDemo / POC / short-lived dev
ProductionAzure DB for PostgreSQLAzure Cache for Redis PremiumLangChain ManagedPersistent, scalable deployments
In-cluster ClickHouse is for dev/POC only
In-cluster ClickHouse runs as a single pod with no replication or backups. For production deployments, use LangChain Managed ClickHouse.
Blob Storage is always required
Regardless of tier, trace payloads must go to Azure Blob Storage — never to ClickHouse. Both tiers use external Blob Storage and Azure Key Vault.

Azure resources created (Pass 1)

ResourceTypePurpose
Resource Groupazurerm_resource_groupContainer for all resources
Virtual Networkazurerm_virtual_networkIsolated network (10.0.0.0/17)
AKS Clusterazurerm_kubernetes_clusterKubernetes — all workloads run here
NGINX IngressHelm (ingress-nginx)External load balancer + TLS termination
PostgreSQL Flexible Serverazurerm_postgresql_flexible_serverOrg config, run metadata (production tier)
Redis Cache Premiumazurerm_redis_cacheTrace ingestion queue, pub/sub (production tier)
Blob Storageazurerm_storage_accountRaw trace objects, TTL-tiered (always)
Managed Identityazurerm_user_assigned_identityWorkload Identity for pod → Blob auth
Azure Key Vaultazurerm_key_vaultStores all LangSmith secrets
cert-managerHelmAutomated TLS certificate management
KEDAHelmEvent-driven autoscaling for workers

Prerequisites

Required tools

bash
# Azure CLI (>= 2.50)
brew install azure-cli          # macOS
# Linux: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli

# Terraform (>= 1.5)
brew tap hashicorp/tap && brew install hashicorp/tap/terraform

# kubectl
brew install kubectl

# Helm (>= 3.x)
brew install helm

# Verify versions
az version --output table
terraform version
kubectl version --client
helm version

Required accounts and access

RequirementNotes
Azure subscriptionOwner or Contributor + User Access Administrator. Owner is required to create role assignments for Workload Identity.
LangSmith license keyContact your LangChain sales representative. Required for self-hosted deployments.
DNS / hostnameA domain where you can create an A record (e.g. langsmith.example.com). Alternatively use sslip.io for quick testing — no DNS registration needed.

Azure quota check

The default configuration uses Standard_D8s_v3 (8 vCPU, 32 GiB) for the default pool and Standard_D16s_v3 (16 vCPU, 64 GiB) for the large pool. Confirm sufficient quota before applying.

bash
# Check Dsv3 family quota
az vm list-usage --location <region> \
  --query "[?contains(name.value,'standardDSv3')].{name:name.localizedValue,used:currentValue,limit:limit}" \
  -o table

Log in and set subscription

bash
az login
az account show --query "{name:name, id:id, user:user.name}" -o table

# Switch subscriptions if needed
az account set --subscription "YOUR_SUBSCRIPTION_ID"

Preflight check (new machines and subscriptions)

Run make preflight from terraform/azure/ before your first make apply. It validates az CLI login, required Azure resource provider registrations (Microsoft.ContainerService, Microsoft.DBforPostgreSQL, etc.), RBAC roles (Contributor + User Access Administrator), and that terraform.tfvars is populated.

bash
cd terraform/azure
make preflight

Repository Layout

terraform/azure/
├── Makefile                    # Task runner — start here (make help)
├── infra/                      # Terraform root module
│   ├── main.tf                 # Module wiring
│   ├── variables.tf            # All input variables with descriptions
│   ├── outputs.tf              # Outputs consumed by helm scripts
│   ├── terraform.tfvars.example
│   ├── secrets.auto.tfvars     # Generated by setup-env.sh — gitignored, never commit
│   └── scripts/
│       ├── _common.sh          # Shared helpers: _parse_tfvar, _tfvar_is_true, color output
│       ├── setup-env.sh        # Bootstrap secrets → writes secrets.auto.tfvars
│       ├── preflight.sh        # Validates az login, resource providers, RBAC, tfvars
│       ├── create-k8s-secrets.sh  # Key Vault → langsmith-config-secret
│       ├── status.sh           # 9-section health check (supports --quick)
│       └── clean.sh            # Remove all generated/sensitive local files after teardown
├── helm/
│   ├── scripts/
│   │   ├── deploy.sh           # Helm values chain deploy (base + overrides + sizing + addons)
│   │   ├── init-values.sh      # TF outputs → values-overrides.yaml; copies sizing + addon files
│   │   ├── get-kubeconfig.sh   # az aks get-credentials wrapper
│   │   ├── preflight-check.sh  # Tools check + cluster connectivity + Helm repo
│   │   └── uninstall.sh        # Clean Helm uninstall (Azure LB warning included)
│   └── values/
│       ├── values.yaml                              # Azure base config (NGINX, Blob WI) — tracked in git
│       ├── values-overrides.yaml                    # Live file — gitignored, generated by init-values.sh
│       └── examples/                               # Source templates — tracked in git
│           ├── langsmith-values.yaml                     # Annotated reference
│           ├── langsmith-values-sizing-minimum.yaml      # Absolute minimum resources
│           ├── langsmith-values-sizing-dev.yaml          # Dev / CI sizing
│           ├── langsmith-values-sizing-production.yaml   # Production (multi-replica + HPA)
│           ├── langsmith-values-sizing-production-large.yaml  # High-volume (~1000 traces/sec)
│           ├── langsmith-values-agent-deploys.yaml       # Pass 3 — LangSmith Deployments
│           ├── langsmith-values-agent-builder.yaml       # Pass 4 — Agent Builder
│           ├── langsmith-values-insights.yaml            # Pass 5 — Insights / Clio
│           └── langsmith-values-polly.yaml               # Pass 5 — Polly
Makefile-driven workflow
All deployment operations are wrapped by make targets. make init-values generates helm/values/values-overrides.yaml from Terraform outputs automatically — no manual placeholder substitution needed. make deploy runs helm upgrade --install with the full values chain.

Terraform Module Reference

ModuleDescription
modules/networking/VNet with dedicated subnets for AKS, PostgreSQL, and Redis (PostgreSQL/Redis subnets only created when source = "external"). Multi-AZ zone configuration optional.
modules/k8s-cluster/AKS cluster (Azure CNI, OIDC, Workload Identity) + NGINX ingress. Workload Identity federated credentials for all LangSmith service accounts are centralized here.
modules/postgres/PostgreSQL 14 Flexible Server — private subnet, max_connections tuned, vector extensions. Provisioned only when postgres_source = "external". Multi-AZ and HA optional.
modules/redis/Redis Cache Premium — private subnet, TLS port 6380. Provisioned only when redis_source = "external".
modules/storage/Blob Storage account + container. Workload Identity federated credentials have moved to modules/k8s-cluster/.
modules/keyvault/Azure Key Vault (RBAC mode). Stores all LangSmith secrets. Terraform is the sole writer — setup-env.sh only reads.
modules/k8s-bootstrap/K8s namespace, service account (annotated for WI), cert-manager, KEDA, and K8s secrets for Postgres + Redis connection URLs.
modules/waf/Azure WAF policy (OWASP 3.2 + bot protection). Enabled via create_waf = true. Independent of other modules — safe to add post-deploy.
modules/diagnostics/Log Analytics workspace + diagnostic settings for AKS, Key Vault, and PostgreSQL. Enabled via create_diagnostics = true. Required for production observability and audit logging.
modules/bastion/Jump VM for private AKS cluster access via az ssh vm. No public IP required. Enabled via create_bastion = true.
modules/dns/Azure DNS zone + A record for custom domain. Enabled via create_dns_zone = true. The A record is only created once ingress_ip is set — first apply creates the zone only, then set the IP and re-apply.

Configuration

Generate terraform.tfvars

The interactive wizard generates infra/terraform.tfvars by prompting for subscription, region, ingress controller, TLS approach, and sizing profile:

bash
cd terraform/azure
make quickstart

Prefer to edit manually? Copy the example instead:

bash
cp infra/terraform.tfvars.example infra/terraform.tfvars
vi infra/terraform.tfvars

Initialize Terraform

bash
make init

Minimum required values:

hcl
# ── Identity ─────────────────────────────────────────────────────────────
subscription_id = "YOUR_AZURE_SUBSCRIPTION_ID"

# ── Location ─────────────────────────────────────────────────────────────
location = "eastus"

# ── Naming / tagging ─────────────────────────────────────────────────────
identifier  = ""           # suffix appended to all resource names, e.g. -prod
environment = "dev"        # dev | staging | prod

# ── Deployment tier ───────────────────────────────────────────────────────
# Production (recommended):
postgres_source   = "external"    # Azure DB for PostgreSQL
redis_source      = "external"    # Azure Cache for Redis Premium
clickhouse_source = "in-cluster"  # ClickHouse in-cluster pod (always)

# Light / demo (all in-cluster — skip managed Postgres/Redis):
# postgres_source = "in-cluster"
# redis_source    = "in-cluster"

# ── PostgreSQL ────────────────────────────────────────────────────────────
postgres_admin_username = "langsmith"
# postgres_admin_password — set via setup-env.sh (written to secrets.auto.tfvars)

# ── LangSmith ────────────────────────────────────────────────────────────
langsmith_namespace = "langsmith"
langsmith_domain    = "langsmith.example.com"   # your FQDN

# ── TLS ───────────────────────────────────────────────────────────────────
tls_certificate_source = "letsencrypt"
letsencrypt_email      = "you@example.com"

# ── Deletion protection (disable for dev/test) ────────────────────────────
aks_deletion_protection      = false
postgres_deletion_protection = false
keyvault_purge_protection    = false

Bootstrap secrets with setup-env.sh

setup-env.sh writes a secrets.auto.tfvars file (gitignored, chmod 600) that Terraform picks up automatically. It prompts on the first run and reads silently from Key Vault on all subsequent runs.

bash
# Run from terraform/azure/
make setup-env
What setup-env.sh does
First run (Key Vault does not exist): prompts for postgres password, license key, admin password. Generates api_key_salt, jwt_secret, and Fernet encryption keys locally. Writes everything to secrets.auto.tfvars.

Subsequent runs (Key Vault exists): reads all secrets from Key Vault silently — no prompts, no generation. Overwrites secrets.auto.tfvars with stable values from Key Vault. Terraform is the sole Key Vault writer.
Never commit secrets.auto.tfvars
This file is gitignored and should never be committed. Regenerate it on any machine by running ./setup-env.sh.
Pass 1 — Required

Pass 1 — Azure Infrastructure

Goal
Provision all Azure resources. No Kubernetes workloads deployed yet.
Duration
~20–25 minutes
What's created
AKS, Postgres, Redis, Blob, Key Vault, cert-manager, KEDA
bash
cd terraform/azure
make setup-env    # prompts for secrets on first run, reads Key Vault on repeat
make preflight    # validates az CLI, providers, RBAC, tfvars
make init
make apply        # ~15-20 min
Skip make plan on a fresh deploy
make plan fails on a fresh deploy because kubernetes_manifest resources require a live cluster API during plan — which does not exist yet. Skip plan and run make apply directly. It handles resource ordering in three internal stages.

Verify Terraform outputs

bash
# View all outputs (run from terraform/azure/)
terraform -chdir=infra output

# Key outputs consumed by helm scripts
terraform -chdir=infra output -raw keyvault_name
terraform -chdir=infra output -raw storage_account_name
terraform -chdir=infra output -raw storage_container_name
terraform -chdir=infra output -raw storage_account_k8s_managed_identity_client_id
Light deploy note
With postgres_source = "in-cluster" and redis_source = "in-cluster", the postgres_connection_url and redis_connection_url outputs are empty — the Helm chart manages its own Postgres and Redis pods. For a full copy-paste walkthrough of the all-in-cluster deploy (sslip.io hostname, Let's Encrypt TLS, no external DBs), see terraform/azure/BUILDING_LIGHT_LANGSMITH.md.

What Pass 1 provisions

Pass 1 creates all Azure infrastructure. No Kubernetes workloads are deployed yet — that happens in Pass 2.

ResourceTypePurpose
Resource Groupazurerm_resource_groupContainer for all resources
Virtual Networkazurerm_virtual_networkIsolated network (10.0.0.0/17)
AKS Clusterazurerm_kubernetes_clusterKubernetes — all workloads run here
Ingress ControllerHelmExternal load balancer + TLS termination (nginx by default)
PostgreSQL Flexible Serverazurerm_postgresql_flexible_serverOrg config, run metadata (external tier)
Redis Cache Premiumazurerm_redis_cacheTrace ingestion queue, pub/sub (external tier)
Blob Storageazurerm_storage_accountRaw trace objects — always required
Managed Identityazurerm_user_assigned_identityWorkload Identity for pod → Blob auth
Azure Key Vaultazurerm_key_vaultStores all LangSmith secrets
cert-managerHelmAutomated TLS certificate management
KEDAHelmEvent-driven autoscaling for workers

Step 1 — Configure terraform.tfvars

Run the interactive wizard from terraform/azure/:

cd terraform/azure
make quickstart

The wizard generates infra/terraform.tfvars covering: subscription, region, naming, AKS sizing, ingress controller, DNS/TLS, backend services, Key Vault, and security add-ons. Each section includes explanatory context and cost estimates.

Prefer to edit manually? Copy the example instead:

cp infra/terraform.tfvars.example infra/terraform.tfvars
vi infra/terraform.tfvars

Minimum required values:

# ── Identity ─────────────────────────────────────────────────────────────
subscription_id = "YOUR_AZURE_SUBSCRIPTION_ID"

# ── Location ─────────────────────────────────────────────────────────────
location = "eastus"

# ── Naming / tagging ─────────────────────────────────────────────────────
identifier  = "-prod"      # suffix appended to all resource names
environment = "prod"       # dev | staging | prod

# ── Deployment tier ───────────────────────────────────────────────────────
# Production (recommended):
postgres_source   = "external"    # Azure DB for PostgreSQL
redis_source      = "external"    # Azure Cache for Redis Premium
clickhouse_source = "in-cluster"  # in-cluster (dev/POC) or external (prod)

# ── DNS + TLS ────────────────────────────────────────────────────────────
dns_label              = "langsmith-prod"   # → langsmith-prod.eastus.cloudapp.azure.com
tls_certificate_source = "letsencrypt"
letsencrypt_email      = "you@example.com"

# ── Sizing ────────────────────────────────────────────────────────────────
sizing_profile = "production"   # minimum | dev | production | production-large
In-cluster ClickHouse is for dev/POC only
In-cluster ClickHouse runs as a single pod with no replication or backups. For production, use LangChain Managed ClickHouse.
Blob Storage is always required
Regardless of tier, trace payloads must go to Azure Blob Storage — never to ClickHouse. Both tiers provision external Blob Storage.

Step 2 — Bootstrap secrets

make setup-env

setup-env.sh writes infra/secrets.auto.tfvars (gitignored, chmod 600) — Terraform picks this file up automatically, no shell exports needed.

  • First run: prompts for PostgreSQL password, LangSmith license key, admin password, and admin email. Generates api_key_salt, jwt_secret, and four Fernet encryption keys locally.
  • Subsequent runs: reads all values silently from Azure Key Vault — no prompts.
Never commit secrets.auto.tfvars
This file is gitignored. Regenerate it on any machine by running make setup-env.

Step 3 — Preflight check

make preflight

Validates before you spend 20 minutes on a failing apply:

  • az CLI version and active login
  • Active subscription — prints name so you can confirm it is correct
  • 11 required Azure resource providers registered (Microsoft.ContainerService, Microsoft.DBforPostgreSQL, Microsoft.Cache, Microsoft.KeyVault, Microsoft.Storage, and others)
  • RBAC: requires Contributor + User Access Administrator (or Owner) at subscription scope
  • terraform.tfvars exists with location and subscription_id set
  • secrets.auto.tfvars exists and has a non-empty langsmith_license_key
  • terraform, kubectl, and helm binaries on PATH

Step 4 — Initialize Terraform

make init

Downloads the AzureRM provider, initializes the backend, and updates module sources. Required once per fresh clone and after any provider version change.

Step 5 — Apply infrastructure

make apply   # ~15–20 min on first run
Skip make plan on a fresh deploy
make plan fails on a fresh deploy because kubernetes_manifest resources require a live cluster API during plan — which does not exist yet. Skip plan and run make apply directly. It handles resource ordering in three internal stages: Azure resources → AKS → K8s bootstrap.

make apply creates all Azure resources in the correct order. On first run this takes approximately 15–20 minutes. Subsequent applies (e.g. enabling Pass 3) are much faster.

Step 6 — Pass 1.5: Cluster access and K8s secrets

After make apply completes, get cluster credentials and push secrets into the cluster:

make kubeconfig    # fetches AKS credentials, merges into ~/.kube/config
make k8s-secrets   # Key Vault → langsmith-config-secret in the langsmith namespace

make k8s-secrets reads 8 secrets from Key Vault and creates or updates langsmith-config-secret. Safe to re-run — uses --dry-run=client | kubectl apply so it updates in place without recreating the secret.

Verify Pass 1

# All nodes should be Ready
kubectl get nodes

# Bootstrap components — all Running
kubectl get pods -n cert-manager     # 3 pods
kubectl get pods -n keda             # 3 pods
kubectl get pods -n ingress-nginx    # 1 pod (if using nginx)

# NGINX LoadBalancer — save the EXTERNAL-IP
kubectl get svc ingress-nginx-controller -n ingress-nginx

# Workload Identity service account — should have client-id annotation
kubectl get sa langsmith-ksa -n langsmith \
  -o jsonpath='{.metadata.annotations}'

# View all Terraform outputs
terraform -chdir=infra output

# Key outputs consumed by Helm scripts
terraform -chdir=infra output -raw keyvault_name
terraform -chdir=infra output -raw storage_account_name
terraform -chdir=infra output -raw storage_container_name
terraform -chdir=infra output -raw storage_account_k8s_managed_identity_client_id

Or run everything — cluster credentials, K8s secrets, Helm values generation, and deploy — in one shot after make apply:

make deploy-all   # kubeconfig → k8s-secrets → init-values → deploy

Teardown

Always uninstall Helm before destroying infrastructure:

make uninstall   # removes Helm releases, LGP CRD, langsmith namespace (removes Azure Load Balancer)
make destroy     # terraform destroy — safe now that LB is gone
make clean       # removes local secrets, generated values, local tfstate (LAST)
Uninstall Helm before terraform destroy
The Azure Load Balancer created by the ingress controller is not tracked by Terraform. Azure blocks VNet deletion while the LB holds a subnet reference. Always run make uninstall first.
Pass 1.5

Pass 1.5 — Cluster Access + K8s Secrets

bash
# Run from terraform/azure/
make kubeconfig    # wraps az aks get-credentials, reads cluster/RG names from terraform output
make k8s-secrets   # Key Vault → langsmith-config-secret in the langsmith namespace

make k8s-secrets reads 8 secrets from Key Vault and creates or updates langsmith-config-secret. It is safe to re-run — uses --dry-run=client | kubectl apply so it updates in place. langsmith-postgres-secret and langsmith-redis-secret are already created by Terraform (Pass 1).

Verify cluster is healthy

bash
kubectl get nodes

# Workload Identity service account (should have client-id annotation)
kubectl get sa langsmith-ksa -n langsmith \
  -o jsonpath='{.metadata.annotations}'

# cert-manager (3 pods Running)
kubectl get pods -n cert-manager

# KEDA (3 pods Running)
kubectl get pods -n keda

# NGINX — save the EXTERNAL-IP for the hostname
kubectl get svc ingress-nginx-controller -n ingress-nginx
sslip.io — free hostname without DNS registration
bash
NGINX_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
HOSTNAME="${NGINX_IP//./-}.sslip.io"
echo "Hostname: $HOSTNAME"
sslip.io resolves <ip-with-dashes>.sslip.io to the IP automatically — no DNS setup required.
Pass 1.6

Pass 1.6 — TLS ClusterIssuer

How you create the cert-manager ClusterIssuer depends on tls_certificate_source:

ValueHow ClusterIssuer is createdRecommended for
letsencrypt (default) ⭐Automatic — make deploy creates it via kubectl applyQuick deploy, demo/POC, any hostname without DNS zone
dns01Automatic — Terraform creates it in Pass 1Custom domain with Azure DNS zone (create_dns_zone = true)
noneSkip — no TLSBring your own TLS

HTTP-01 with Azure Public IP DNS label (letsencrypt — recommended quick path)

Set dns_label in terraform.tfvars to get a free Azure subdomain (<label>.<region>.cloudapp.azure.com) — no DNS registration required:

hcl
dns_label        = "langsmith-<identifier>"   # → langsmith-<identifier>.eastus.cloudapp.azure.com
tls_certificate_source = "letsencrypt"
letsencrypt_email      = "you@example.com"

make deploy automatically handles both steps for HTTP-01:

  1. Annotates the NGINX LoadBalancer service with service.beta.kubernetes.io/azure-dns-label-name — this is what makes the Azure DNS label resolve to your public IP.
  2. Creates the letsencrypt-prod ClusterIssuer via kubectl apply (idempotent — skipped if already present).
No manual kubectl apply needed for letsencrypt
The ClusterIssuer is created automatically by make deploy (deploy.sh). You do not need to run kubectl apply -f letsencrypt-issuers.yaml manually. kubernetes_manifest cannot be used in Terraform for this — it requires a live k8s API during terraform plan, which does not exist on a fresh deploy.
bash
# After make deploy — verify the ClusterIssuer and DNS label are in place
kubectl get clusterissuer letsencrypt-prod
# NAME               READY   AGE
# letsencrypt-prod   True    30s

kubectl get svc ingress-nginx-controller -n ingress-nginx \
  -o jsonpath='{.metadata.annotations.service\.beta\.kubernetes\.io/azure-dns-label-name}'
# Expected: langsmith-<identifier>

DNS-01 (dns01 — automatic, custom domain)

When tls_certificate_source = "dns01", Terraform creates the letsencrypt-prod ClusterIssuer automatically during Pass 1. cert-manager uses Azure Workload Identity to manage DNS TXT records — no static service principal required.

Required variables in terraform.tfvars:

hcl
tls_certificate_source          = "dns01"
letsencrypt_email               = "you@example.com"
create_dns_zone                 = true
dns_zone_name                   = "langsmith.mycompany.com"
dns_resource_group_name         = "langsmith-rg<identifier>"
# cert_manager_identity_client_id is wired automatically from k8s-cluster output
bash
# After terraform apply — verify ClusterIssuer was created
kubectl get clusterissuer letsencrypt-prod
Pass 2 — Required

Pass 2 — LangSmith Base Platform

Goal
Generate Helm values from Terraform outputs + deploy the LangSmith Helm chart.
Duration
~10 minutes
Prerequisite
Pass 1.5 complete (kubeconfig + k8s-secrets).
LangSmith Azure — Pass 2 Architecture (External Postgres + Redis)
click to zoom

2a — Generate Helm values

Run from terraform/azure/:

bash
make init-values

Reads terraform output and terraform.tfvars and generates helm/values/values-overrides.yaml with all placeholders filled: hostname, storage account name, Workload Identity client ID, DB connection references, ingress/TLS block, and service account annotations. Also copies the sizing overlay and any enabled addon overlays from examples/.

Admin email
The admin email is read from langsmith_admin_email in terraform.tfvars (set during make setup-env) and written into values-overrides.yaml automatically. No manual editing of the generated file is needed.

2b — Deploy LangSmith

bash
make deploy

Runs the full values chain: values.yamlvalues-overrides.yaml → sizing overlay → any enabled addon overlays. Annotates the NGINX LoadBalancer with the Azure DNS label, creates the letsencrypt-prod ClusterIssuer if needed, and runs helm upgrade --install with --timeout 20m.

Or run Pass 1.5 + Pass 2 in one shot after make apply:

bash
make deploy-all   # kubeconfig → k8s-secrets → init-values → deploy
Why --timeout 20m
The langsmith-backend-auth-bootstrap Job runs DB migrations and org initialization as a post-install hook. This takes up to 5 minutes on first install. Without a long timeout, helm may report failure even though the install eventually succeeds. See issue #4.
Watch pods in a second terminal
bash
# macOS — install watch first
brew install watch
watch kubectl get pods -n langsmith

# Without watch
while true; do clear; kubectl get pods -n langsmith; sleep 3; done

2c — Verify

bash
kubectl get pods -n langsmith      # all Running or Completed
kubectl get ingress -n langsmith   # host + TLS assigned
kubectl get certificate -n langsmith  # READY: True

Expected pod state (all Running after ~5 minutes):

langsmith-ace-backend-xxxxx              1/1   Running     0   5m
langsmith-backend-xxxxx                  1/1   Running     0   5m
langsmith-backend-xxxxx                  1/1   Running     0   5m
langsmith-backend-xxxxx                  1/1   Running     0   5m
langsmith-backend-auth-bootstrap-xxxxx   0/1   Completed   0   5m
langsmith-backend-ch-migrations-xxxxx    0/1   Completed   0   5m
langsmith-backend-migrations-xxxxx       0/1   Completed   0   5m
langsmith-clickhouse-0                   1/1   Running     0   5m
langsmith-frontend-xxxxx                 1/1   Running     0   5m
langsmith-ingest-queue-xxxxx             1/1   Running     0   5m  (×3)
langsmith-platform-backend-xxxxx         1/1   Running     0   5m
langsmith-playground-xxxxx               1/1   Running     0   5m
langsmith-queue-xxxxx                    1/1   Running     0   5m  (×3)

Open https://<HOSTNAME> and log in with initialOrgAdminEmail + admin password from Key Vault.

Pass 2 pod resource reference

PodCPU req/limitMem req/limitHPA min/maxWI
langsmith-backend1000m / 2000m2Gi / 4Gi3 / 10
langsmith-platform-backend500m / 1000m1Gi / 2Gi1 / 10
langsmith-frontend500m / 1000m1Gi / 2Gi1 / 10
langsmith-playground500m / 1000m1Gi / 2Gi1 / 10
langsmith-queue1000m / 2000m2Gi / 4Gi3 / 10
langsmith-ingest-queue1000m / 2000m2Gi / 4Gi3 / 10
langsmith-ace-backend500m / 1000m1Gi / 2Gi1 / 5
langsmith-clickhouse3500m / 8000m15Gi / 32GiStatefulSet

HPA scales on CPU ≥ 50% or Memory ≥ 80%. KEDA additionally scales queue and ingest-queue on Redis queue depth.

What Pass 2 deploys

Pass 2 installs the LangSmith Helm chart. It reads all secrets from langsmith-config-secret (created in Pass 1.5) and all infrastructure configuration from Terraform outputs.

Prerequisites: Pass 1 complete (make apply) and Pass 1.5 complete (make kubeconfig && make k8s-secrets).

Two deployment paths

PathCommandWhen to use
Helm path (default)make init-values && make deployInteractive output, kubeconfig refresh, pre-flight checks. Best for first-time deploys and day-2 re-deploys.
Terraform pathmake init-app && make apply-appHelm release + K8s secrets + Workload Identity SA managed in Terraform state. Best for GitOps and CI/CD pipelines.

Helm path (recommended)

Step 1 — Generate Helm values

cd terraform/azure
make init-values

make init-values reads terraform output and terraform.tfvars and generates helm/values/values-overrides.yaml with all fields populated:

  • config.hostname — your FQDN (from dns_label or langsmith_domain)
  • config.initialOrgAdminEmail — the first org admin account
  • config.existingSecretName: langsmith-config-secret — secrets reference
  • config.blobStorage — storage account name + container + Workload Identity client ID
  • Workload Identity annotations for 5 service accounts (backend, platform-backend, queue, ingest-queue, host-backend)
  • Ingress + TLS block (cert-manager annotation, TLS secret name)
  • Postgres and Redis external secret references (if postgres_source = "external" / redis_source = "external")

Also copies the sizing overlay and any enabled addon overlays from helm/values/examples/ into helm/values/.

Admin email is set automatically
The admin email is read from langsmith_admin_email in terraform.tfvars (set during make setup-env) and written into values-overrides.yaml automatically. No manual editing needed.

Step 2 — Deploy LangSmith

make deploy   # ~10 min

make deploy handles:

  1. Validates values-overrides.yaml exists (fails fast with make init-values hint if missing)
  2. Refreshes kubeconfig via az aks get-credentials
  3. Annotates the LoadBalancer service with service.beta.kubernetes.io/azure-dns-label-name — required for Azure to assign the DNS label to the public IP
  4. Creates the letsencrypt-prod cert-manager ClusterIssuer if tls_certificate_source = "letsencrypt" (idempotent)
  5. Runs preflight checks: kubectl, helm, az, terraform on PATH; cluster connectivity; Helm repo updated
  6. Verifies langsmith-config-secret exists — auto-creates from Key Vault if missing
  7. Builds and logs the values chain: values.yamlvalues-overrides.yaml → sizing overlay → addon overlays
  8. Guards against stuck Helm releases: auto-rolls back pending-upgrade state before proceeding
  9. Runs helm upgrade --install langsmith langchain/langsmith --timeout 20m
  10. Waits for core deployments to roll out
  11. Annotates the langsmith-ksa service account with the Workload Identity client ID
  12. Prints the access URL and login credentials location

Or run Pass 1.5 + Pass 2 in one shot after make apply:

make deploy-all   # kubeconfig → k8s-secrets → init-values → deploy
Why --timeout 20m
The langsmith-backend-auth-bootstrap Job runs DB migrations and org initialization as a post-install hook. This takes up to 5 minutes on first install. Without a long timeout, helm may report failure even though the install eventually succeeds.
Watch pods in a second terminal
# macOS
brew install watch
watch kubectl get pods -n langsmith

# Without watch
while true; do clear; kubectl get pods -n langsmith; sleep 3; done

Terraform path (alternative)

Use the Terraform path when you want the Helm release, K8s secrets, and Workload Identity service account managed in Terraform state.

# Copy and configure app vars
cp app/terraform.tfvars.example app/terraform.tfvars
vi app/terraform.tfvars   # set admin_email at minimum

# Pull infra outputs into app/infra.auto.tfvars.json + terraform init
make init-app

# Deploy Helm release + K8s secrets + WI service account via Terraform
make apply-app

Feature flags in app/terraform.tfvars:

sizing              = "production"   # minimum | dev | production | production-large
enable_agent_deploys  = true         # Pass 3 — LangSmith Deployments
enable_agent_builder  = true         # Pass 4 — Agent Builder (requires agent_deploys)
enable_insights       = true         # Pass 5 — Insights / ClickHouse
enable_polly          = true         # Pass 5 — Polly (requires agent_deploys)

End-to-end via Terraform (Pass 1 + Pass 2 in one shot):

make deploy-all-tf   # apply → init-values → init-app → apply-app

Verify Pass 2

# All pods Running or Completed (~17 pods)
kubectl get pods -n langsmith

# Ingress host + TLS assigned
kubectl get ingress -n langsmith

# TLS certificate issued
kubectl get certificate -n langsmith
# Expected: READY: True

# Helm release status
helm list -n langsmith

Expected pod state (all Running after ~5 minutes):

langsmith-ace-backend-xxxxx              1/1   Running     0   5m
langsmith-backend-xxxxx                  1/1   Running     0   5m
langsmith-backend-auth-bootstrap-xxxxx   0/1   Completed   0   5m
langsmith-backend-ch-migrations-xxxxx    0/1   Completed   0   5m
langsmith-backend-migrations-xxxxx       0/1   Completed   0   5m
langsmith-clickhouse-0                   1/1   Running     0   5m
langsmith-frontend-xxxxx                 1/1   Running     0   5m
langsmith-ingest-queue-xxxxx             1/1   Running     0   5m
langsmith-platform-backend-xxxxx         1/1   Running     0   5m
langsmith-playground-xxxxx               1/1   Running     0   5m
langsmith-queue-xxxxx                    1/1   Running     0   5m

Open https://<HOSTNAME> and log in with initialOrgAdminEmail and the admin password from Key Vault:

# Retrieve admin password
az keyvault secret show \
  --vault-name $(terraform -chdir=infra output -raw keyvault_name) \
  --name langsmith-admin-password \
  --query value -o tsv

Values chain

make deploy applies Helm values files in this order (last file wins on conflicts):

1. helm/values/values.yaml                             — Azure base (NGINX, Blob WI, no Istio)
2. helm/values/values-overrides.yaml                   — hostname, WI client-id, auth, postgres/redis
3. helm/values/langsmith-values-sizing-<profile>.yaml  — resource requests + HPA settings
4. (addon files — only when enable_* flags are set)

All files in helm/values/ are gitignored (generated or contain live secrets). Source templates live in helm/values/examples/ and are copied by make init-values.

Day-2 operations

make status         # 10-section health check
make status-quick   # skip Key Vault + K8s secret queries (faster)
make deploy         # re-deploy after any Helm value changes
make init-values    # re-generate values after Terraform changes
make kubeconfig     # refresh cluster credentials
make k8s-secrets    # re-create langsmith-config-secret from Key Vault
Pass 3 — Optional

Pass 3 — LangSmith Deployments

Goal
Enable LangGraph agent deployments from the UI. Adds host-backend, listener, and operator.
Duration
~5 minutes (rolling update)
Prerequisite
Pass 2 running.
LangSmith Azure — Pass 3 Architecture (LangSmith Deployments)
click to zoom

What gets added

PodRoleWI
langsmith-host-backendLangGraph control plane API — manages deployment lifecycle, stores state in shared PostgreSQL
langsmith-listenerWatches host-backend, creates/updates LangGraphPlatform CRDs in Kubernetes
langsmith-operatorReconciles CRDs — creates per-deployment K8s Deployments, StatefulSets, Services

3a — Scale nodes, then enable in terraform.tfvars

Before enabling, bump default_node_pool_min_count to at least 5 — the operator spawns agent deployment pods on demand and needs node headroom.

hcl
# infra/terraform.tfvars
default_node_pool_min_count = 5      # operator pods need headroom
enable_deployments          = true

Then re-apply infra, regenerate values, and deploy:

bash
# Run from terraform/azure/
make apply          # scale up node pool
make init-values    # picks up enable_deployments = true
make deploy         # rolls out host-backend + listener + operator

init-values appends the deployments addon overlay (langsmith-values-agent-deploys.yaml) to the values chain, which sets:

yaml
config:
  deployment:
    enabled: true                        # REQUIRED — without this, listener and operator are skipped silently
    url: "https://<your-hostname>"       # must match config.hostname
WATCHOUT — config.deployment.url must include https://
Missing the protocol causes operator-deployed agents to stay stuck in DEPLOYING state. See issue #5.

3b — Deploy

bash
make deploy

3c — Verify

bash
kubectl get pods -n langsmith | grep -E "host-backend|listener|operator"
# Expected: all Running
kubectl get lgp -n langsmith          # list LangSmith Deployments
kubectl get crd | grep langchain      # operator CRDs registered

All three should be Running. Total pod count: ~20 Running + 3 Completed jobs.

WATCHOUT — config.deployment.enabled: true is required
Setting only config.deployment.url without enabled: true causes the chart to silently skip creating listener and operator — no error, they just never appear. See issue #5.

What Pass 3 adds

Pass 3 enables LangSmith Deployments — deploy and manage LangGraph graphs as API servers directly from the LangSmith UI. Three new pods are added to the cluster:

PodRoleWI
langsmith-host-backendLangSmith Deployments control plane API — manages deployment lifecycle, stores state in shared PostgreSQL
langsmith-listenerWatches host-backend, creates/updates LangGraphPlatform CRDs in Kubernetes
langsmith-operatorReconciles CRDs — creates per-deployment K8s Deployments, StatefulSets, Services

Prerequisite: Pass 2 running.

Step 1 — Scale node pool

Before enabling, bump default_node_pool_min_count to at least 5. The operator spawns agent deployment pods on demand and needs node headroom:

# infra/terraform.tfvars
default_node_pool_min_count = 5      # operator pods need headroom
enable_deployments          = true
Scale nodes before enabling Deployments
Without sufficient node capacity, operator-spawned agent pods stay in Pending state indefinitely. Scale the node pool first, then enable.

Step 2 — Apply, regenerate values, deploy

cd terraform/azure
make apply          # scale up node pool (~5 min)
make init-values    # picks up enable_deployments = true → generates addon overlay
make deploy         # rolls out host-backend + listener + operator

make init-values appends the LangSmith Deployments addon overlay (langsmith-values-agent-deploys.yaml) to the values chain. It automatically injects:

config:
  deployment:
    enabled: true                          # REQUIRED — without this, listener and operator are skipped silently
    url: "https://<your-hostname>"         # must match config.hostname (with protocol)
    tlsEnabled: true                       # set based on tls_certificate_source
config.deployment.url must include https://
Missing the protocol causes operator-deployed agents to stay stuck in DEPLOYING state indefinitely. The URL is injected automatically by make init-values — do not set it manually in the overlay file, as it will be overwritten.
config.deployment.enabled: true is required
Setting only config.deployment.url without enabled: true causes the chart to silently skip creating listener and operator — no error, they just never appear.

Step 3 — Verify

# All three pods Running
kubectl get pods -n langsmith | grep -E "host-backend|listener|operator"

# LangSmith Deployments CRDs registered
kubectl get crd | grep langchain

# List LangSmith Deployments (empty on first deploy — populated when you create a deployment)
kubectl get lgp -n langsmith

Expected: langsmith-host-backend, langsmith-listener, and langsmith-operator all Running. Total pod count: ~20 Running + 3 Completed jobs.

KEDA scaling for Deployments workers

KEDA is already installed in Pass 1. With enable_deployments = true, the operator creates KEDA ScaledObject resources for each agent deployment's worker queue. Worker pods scale down to zero when idle and scale up based on Redis queue depth.

No KEDA configuration is needed in terraform.tfvars — the operator manages it automatically when creating agent deployments.

Terraform path

If you are using the Terraform Helm path (Pass 2 via make apply-app), enable LangSmith Deployments in app/terraform.tfvars:

enable_agent_deploys = true

Then:

make init-app     # refresh infra outputs
make apply-app    # update Helm release
Pass 4 — Optional

Pass 4 — Agent Builder

Goal
AI-assisted creation and management of LangGraph agents from the LangSmith UI.
Duration
~10 minutes
Prerequisite
Pass 3 must be enabled.
LangSmith Azure — Pass 4 + 5 Architecture (Agent Builder and Insights)
click to zoom

What gets added

PodTypeRole
langsmith-agent-builder-tool-serverStaticMCP tool execution server — code/file editing tools for the AI
langsmith-agent-builder-trigger-serverStaticWebhook receiver and scheduled trigger engine
langsmith-agent-bootstrapJob (Completed)Registers the bundled Agent Builder agent via the operator — runs once
agent-builder-<hash> + queue + redis + lg-<hash>-0Dynamic (operator-managed)Agent Builder agent deployment — created by operator when bootstrap Job runs

4a — Enable in terraform.tfvars

Requires enable_deployments = true (Pass 3 must already be enabled).

hcl
# infra/terraform.tfvars
enable_deployments   = true
enable_agent_builder = true

Then regenerate values and deploy:

bash
# Run from terraform/azure/
make init-values
make deploy

init-values appends the agent builder addon overlay (langsmith-values-agent-builder.yaml) to the values chain.

Encryption key is read from langsmith-config-secret
Do not set config.agentBuilder.encryptionKey inline in values-overrides.yaml. The chart reads it from langsmith-config-secret via existingSecretName. Setting it inline would override the secret reference and create a mismatch. See issue #7.

4b — Deploy

bash
make deploy

4c — Verify

bash
kubectl get pods -n langsmith | grep agent-builder
# Expected: tool-server Running, trigger-server Running, agentBootstrap Completed

kubectl get pods -n langsmith | grep -E "tool-server|trigger-server|Bootstrap"
kubectl get lgp -n langsmith   # operator-managed Agent Builder deployment

Expected: 3 static pods (tool-server, trigger-server, bootstrap Job) + 4 dynamic pods (api-server, queue, redis, postgres StatefulSet). Total: ~26 pods.

After apply, an Agent Builder section appears in the LangSmith UI.

WATCHOUT — Roll frontend after agentBootstrap completes
The agentBootstrap Job creates the langsmith-polly-config ConfigMap that the frontend reads for the Polly UI. If the frontend was already running when bootstrap completed, Polly shows "Unable to connect to LangGraph server". Fix: kubectl rollout restart deployment langsmith-frontend -n langsmith

What Pass 4 adds

Pass 4 enables the Agent Builder — visual AI-assisted creation and management of LangGraph agents from the LangSmith UI. No terraform apply is needed for this pass — it only requires make init-values && make deploy.

PodTypeRole
langsmith-agent-builder-tool-serverStaticMCP tool execution server — code/file editing tools for the AI
langsmith-agent-builder-trigger-serverStaticWebhook receiver and scheduled trigger engine
langsmith-agent-bootstrapJob (Completed)Registers the bundled Agent Builder agent via the operator — runs once
agent-builder-<hash> + queue + redis + lg-<hash>-0Dynamic (operator-managed)Agent Builder deployment — created by operator when bootstrap Job runs

Prerequisite: Pass 3 must be enabled (enable_deployments = true). Pass 4 requires enable_deployments = true — enabling Agent Builder without Deployments causes a preflight error.

Step 1 — Enable in terraform.tfvars

# infra/terraform.tfvars
enable_deployments   = true    # Pass 3 — required prerequisite
enable_agent_builder = true    # Pass 4

Step 2 — Regenerate values and deploy

cd terraform/azure
make init-values    # appends langsmith-values-agent-builder.yaml to values chain
make deploy         # rolling update — ~10 min for bootstrap Job to complete

make init-values appends the Agent Builder addon overlay (langsmith-values-agent-builder.yaml) to the values chain. This overlay:

  • Enables the Agent Builder UI and its two supporting services
  • Sets backend.agentBootstrap: true — a post-install job that registers Agent Builder as a LangSmith Deployment and creates the required ConfigMap
  • Sets conservative agent worker pod resources (1 CPU / 1 Gi) instead of the chart's default 4 CPU / 8 Gi

Step 3 — Verify

# Static pods Running, bootstrap Job Completed
kubectl get pods -n langsmith | grep -E "tool-server|trigger-server|Bootstrap"

# Operator-managed dynamic pods (4 pods — api-server, queue, redis, postgres StatefulSet)
kubectl get pods -n langsmith | grep agent-builder

# Operator-managed LangSmith Deployment for Agent Builder
kubectl get lgp -n langsmith

Expected: 3 static pods (tool-server, trigger-server, bootstrap Job) + 4 dynamic pods. Total: ~26 pods. After make deploy, an Agent Builder section appears in the LangSmith UI navigation.

Roll frontend after agentBootstrap completes
The agentBootstrap Job creates the langsmith-polly-config ConfigMap that the frontend reads for the Polly UI. If the frontend was already running when bootstrap completed, Polly shows "Unable to connect to LangGraph server". Fix:
kubectl rollout restart deployment langsmith-frontend -n langsmith
Encryption key is read from langsmith-config-secret
Do not set config.agentBuilder.encryptionKey inline in values-overrides.yaml. The chart reads it from langsmith-config-secret via existingSecretName. Setting it inline overrides the secret reference and creates a mismatch.

Workload Identity for Agent Builder

Both langsmith-agent-builder-tool-server and langsmith-agent-builder-trigger-server need Workload Identity to access Azure Blob Storage. Their federated credentials are pre-registered in modules/k8s-cluster/main.tf — no additional setup is needed.

If you add a new pod that needs Blob access, update service_accounts_for_workload_identity in modules/k8s-cluster/variables.tf and run terraform apply -target=module.aks.

Terraform path

If using the Terraform Helm path, enable in app/terraform.tfvars:

enable_agent_deploys  = true   # required prerequisite
enable_agent_builder  = true

Then:

make init-app
make apply-app
Pass 5 — Optional

Pass 5 — Insights

Goal
AI-powered trace analytics (Clio). Surfaces patterns and anomalies in LangSmith traces.
Duration
~5 minutes
Prerequisite
Pass 3 must be enabled. Pass 4 and Pass 5 are independent — both require Pass 3 but not each other.

Pass 5 adds a single flag to the Helm values — no new static pods. Clio deploys as a dynamic LangGraph deployment via the operator when first invoked from the UI.

5a — Enable in terraform.tfvars

Requires enable_deployments = true (Pass 3 must already be enabled).

hcl
# infra/terraform.tfvars
enable_deployments = true
enable_insights    = true
enable_polly       = true

Then regenerate values and deploy:

bash
# Run from terraform/azure/
make init-values
make deploy

init-values appends the insights and polly addon overlays to the values chain.

5b — Deploy

bash
make deploy

5c — Verify

bash
kubectl get pods -n langsmith | grep -E "clickhouse|polly|clio"
# ClickHouse already running from Pass 2; Insights operator deploys clio pods
kubectl get pods -n langsmith -w     # watch for new clio/analytics pods to come up

helm get values langsmith -n langsmith | grep -A3 insights
# Expected: enabled: true

Pod count after Pass 5 is identical to Pass 4 (~22 running). Clio appears as a dynamic pod when invoked from the UI.

Encryption keys must never change after first enable
insights_encryption_key and polly_encryption_key must never change after first enable — changing either breaks all existing encrypted data permanently. There is no recovery path.
WATCHOUT — Roll frontend after first Polly enable
If Polly UI shows "Unable to connect to LangGraph server" after enabling, the frontend started before the bootstrap ConfigMap was ready. Fix: kubectl rollout restart deployment langsmith-frontend -n langsmith

What Pass 5 adds

Pass 5 enables two features — Insights and Polly — both of which require Pass 3 (LangSmith Deployments). They are independent of each other: you can enable either one without the other.

Insights — AI-powered trace analytics (Clio). Surfaces patterns and anomalies in LangSmith traces. Clio deploys as a dynamic LangGraph deployment via the operator when first invoked from the UI. No new static pods are added.

Polly — AI-powered evaluation and monitoring agent. Runs as a dynamic LangGraph deployment. Sets resource limits for the Polly worker (2 CPU / 4 Gi request, 4 CPU / 8 Gi limit, scales 1–5 replicas).

No terraform apply is needed for Pass 5 — only make init-values && make deploy.

Prerequisite: Pass 3 must be enabled (enable_deployments = true). Pass 4 and Pass 5 are independent — both require Pass 3 but not each other.

Step 1 — Enable in terraform.tfvars

# infra/terraform.tfvars
enable_deployments = true    # Pass 3 — required prerequisite
enable_insights    = true    # Pass 5 — Insights / Clio analytics
enable_polly       = true    # Pass 5 — Polly AI evaluation agent

You can enable just one:

enable_insights = true    # Insights only (Polly not needed)
# or
enable_polly    = true    # Polly only (Insights not needed)

Step 2 — Regenerate values and deploy

cd terraform/azure
make init-values    # appends insights and polly addon overlays to values chain
make deploy         # rolling update — ~5 min

make init-values appends the addon overlays based on clickhouse_source in terraform.tfvars:

  • clickhouse_source = "in-cluster" → generates a minimal overlay (config.insights.enabled: true only). The Helm chart manages ClickHouse internally.
  • clickhouse_source = "external" → generates a full overlay with clickhouse.external.enabled: true and a langsmith-clickhouse secret reference. You must create this secret with the ClickHouse host and credentials before deploying.
Do not manually copy the Insights example file for in-cluster ClickHouse
The helm/values/examples/langsmith-values-insights.yaml example has clickhouse.external.enabled: true and existingSecretName: langsmith-clickhouse. Manually copying it when using in-cluster ClickHouse causes CreateContainerConfigError because the secret doesn't exist. Always use make init-values to generate the correct file.

Step 3 — Verify

# ClickHouse already running from Pass 2
# Insights and Polly deploy as dynamic pods when first invoked from the UI
kubectl get pods -n langsmith | grep -E "clickhouse|polly|clio"

# Watch for dynamic pods when you first use Insights in the UI
kubectl get pods -n langsmith -w

# Confirm Insights is enabled in Helm values
helm get values langsmith -n langsmith | grep -A3 insights
# Expected: enabled: true

Pod count after Pass 5 is identical to after Pass 4 at rest (~22 running). Clio and Polly appear as dynamic pods when invoked from the UI.

Encryption keys must never change after first enable
insights_encryption_key and polly_encryption_key must never change after first enable. Changing either permanently corrupts all existing encrypted data. There is no recovery path. These keys are stored in Key Vault and never rotated automatically.
Roll frontend after first Polly enable
If the Polly UI shows "Unable to connect to LangGraph server" after enabling, the frontend started before the bootstrap ConfigMap was ready. Fix:
kubectl rollout restart deployment langsmith-frontend -n langsmith

Terraform path

If using the Terraform Helm path, enable in app/terraform.tfvars:

enable_agent_deploys = true   # required prerequisite
enable_insights      = true
enable_polly         = true

Then:

make init-app
make apply-app

All 5 passes summary

After completing all passes, your deployment runs:

PassNew podsTotal ~running
Pass 2Core LangSmith (backend, frontend, queue, ingest-queue, clickhouse, etc.)~17
Pass 3host-backend, listener, operator~20
Pass 4tool-server, trigger-server, bootstrap Job + 4 dynamic Agent Builder pods~26
Pass 5No new static pods (Clio + Polly appear dynamically on first use)~22 at rest

Light Deploy (All In-Cluster)

For demos, POCs, or short-lived dev environments, skip the managed Postgres and Redis. The Helm chart manages all in-cluster pods.

LangSmith Azure — Light Deploy Architecture (All In-Cluster)
click to zoom

terraform.tfvars settings

hcl
postgres_source   = "in-cluster"
redis_source      = "in-cluster"
clickhouse_source = "in-cluster"

With these settings, no PostgreSQL or Redis subnets are created — the VNet contains only the AKS subnet. postgres_connection_url and redis_connection_url outputs are empty.

Helm values for light deploy

With postgres_source = "in-cluster" and redis_source = "in-cluster" set in terraform.tfvars, make init-values generates values-overrides.yaml without postgres/redis connection URL fields — the chart uses in-cluster pods instead.

bash
# Run from terraform/azure/
make k8s-secrets
make init-values
make deploy

For a full copy-paste walkthrough of the all-in-cluster deploy (sslip.io hostname, Let's Encrypt TLS, no external DBs), see terraform/azure/BUILDING_LIGHT_LANGSMITH.md.

Not for production
In-cluster Postgres and Redis have no persistence guarantees beyond the node lifecycle. Use the production tier (external managed services) for any deployment that holds persistent data.

Bring Your Own VNet

If you have an existing VNet (e.g. connected via ExpressRoute or with custom firewall rules), skip VNet creation:

hcl
# terraform.tfvars
create_vnet        = false
vnet_id            = "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>"
aks_subnet_id      = "/subscriptions/<sub-id>/.../subnets/<aks-subnet>"
postgres_subnet_id = "/subscriptions/<sub-id>/.../subnets/<postgres-subnet>"
redis_subnet_id    = "/subscriptions/<sub-id>/.../subnets/<redis-subnet>"

Subnet requirements

SubnetRequirement
AKS/19 or larger. No delegation. Azure CNI assigns pod IPs from this range — each node consumes up to 30 pod IPs.
PostgreSQLAny size. Must be delegated to Microsoft.DBforPostgreSQL/flexibleServers. No other resources.
Redis/28 or larger. Must be exclusive to Redis (no other resources in the subnet).

Terraform State Backend

For team use and production, store state in Azure Blob Storage.

bash
az group create --name my-tfstate-rg --location eastus
az storage account create \
  --name mytfstateaccount \
  --resource-group my-tfstate-rg \
  --sku Standard_LRS
az storage container create \
  --name tfstate \
  --account-name mytfstateaccount

Uncomment and configure the backend block in terraform/azure/infra/backend.tf:

hcl
terraform {
  backend "azurerm" {
    resource_group_name  = "my-tfstate-rg"
    storage_account_name = "mytfstateaccount"
    container_name       = "tfstate"
    key                  = "langsmith.tfstate"
  }
}
bash
terraform init -reconfigure

Upgrading LangSmith

DB migrations are one-way
LangSmith uses Alembic forward-only migrations. After upgrading, you cannot downgrade — the old chart version will not recognize the newer schema. Test in a separate environment first. See issue #3.
bash
# Check available versions
helm repo update
helm search repo langchain/langsmith --versions | head -10

# Upgrade via Makefile — re-generates values from current terraform outputs, then deploys
# Run from terraform/azure/
make deploy
Encryption keys must never change
deployments_encryption_key, agent_builder_encryption_key, insights_encryption_key, and polly_encryption_key must stay stable across upgrades. They are stored in langsmith-config-secret from Key Vault — do not rotate them.
bash
# Check current deployed version
helm list -n langsmith
helm get metadata langsmith -n langsmith

Teardown

Uninstall Helm before terraform destroy
The Azure Load Balancer created by NGINX is not tracked by Terraform. Azure blocks VNet deletion while the LB holds a subnet reference. If you run make destroy first, it will stall. Always run make uninstall first. See issue #9.

Always run in this order — never skip steps:

bash
# Run from terraform/azure/
make uninstall   # removes Helm releases, LGP CRD, langsmith namespace (removes Azure Load Balancer)
make destroy     # terraform destroy — safe now that LB is gone
make clean       # removes local secrets, generated values, local tfstate (LAST)
Irreversible
make destroy permanently deletes the AKS cluster, PostgreSQL database (all data), Redis cache, and Blob Storage. Back up important data first.
Key Vault soft-delete
If keyvault_purge_protection = false (the dev/test default), purge the soft-deleted vault after destroy to allow immediate name reuse:
bash
az keyvault purge --name langsmith-kv<identifier> --location <region>
If keyvault_purge_protection = true, the vault name is reserved for 90 days — you cannot reuse the same identifier until the hold expires.

Architecture Overview

LangSmith on Azure uses AKS with Azure CNI (pods get VNet IPs), OIDC Workload Identity for keyless blob access, NGINX ingress with cert-manager TLS, and private-endpoint-only PostgreSQL and Redis.

Production Deploy (External Postgres + Redis)

LangSmith Azure — Production Architecture (AKS, PostgreSQL, Redis, Blob Storage, Key Vault)
click to zoom

Networking topology

SubnetCIDRContains
AKS nodes + podssubnet-0 (10.0.0.0/19)All Kubernetes workloads (Azure CNI)
PostgreSQLsubnet-postgres (10.0.32.0/20)Azure DB for PostgreSQL Flexible Server (external tier only)
Redissubnet-redis (10.0.48.0/20)Azure Cache for Redis Premium (external tier only)

All subnets are private. PostgreSQL and Redis are accessible only from within the VNet via private DNS resolution. No public endpoints.

Workload Identity (Blob Storage)

LangSmith pods access Azure Blob Storage without static keys. Azure AD token exchange via the AKS OIDC issuer:

StepWhat happens
1Pod has label azure.workload.identity/use: "true" and service account annotation azure.workload.identity/client-id: <id>
2AKS Workload Identity webhook injects AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_FEDERATED_TOKEN_FILE
3Pod presents K8s service account token to Azure AD OIDC endpoint
4Azure AD issues short-lived access token for the Managed Identity
5Pod reads/writes blobs — no static key in any secret or env var

Which pods need Workload Identity

PodPassNeeds WI
langsmith-backend2
langsmith-platform-backend2
langsmith-queue2
langsmith-ingest-queue2
langsmith-host-backend3
langsmith-listener3
langsmith-agent-builder-tool-server4
langsmith-agent-builder-trigger-server4
langsmith-frontend, langsmith-playground, langsmith-ace-backend, langsmith-clickhouse, langsmith-operator2–3

All federated credentials are pre-registered in modules/k8s-cluster/main.tf. Workload Identity is centralized in the cluster module — federated credentials, the managed identity, and the OIDC issuer configuration all live there. If you add a new pod that accesses blob storage, add its service account name to the service_accounts_for_workload_identity list and re-apply.

Key Vault Secret Management

Azure Key Vault (RBAC mode) stores all LangSmith secrets. Terraform is the sole writer. setup-env.sh only reads from Key Vault after Pass 1.

Secret name in Key VaultK8s secret keyUsed by
langsmith-api-key-saltapi_key_saltAPI key hashing
langsmith-jwt-secretjwt_secretBasic Auth sessions
langsmith-license-keylangsmith_license_keyEnterprise license
langsmith-admin-passwordinitial_org_admin_passwordInitial org admin
langsmith-deployments-encryption-keydeployments_encryption_keyPass 3 Fernet encryption
langsmith-agent-builder-encryption-keyagent_builder_encryption_keyPass 4 Fernet encryption
langsmith-insights-encryption-keyinsights_encryption_keyPass 5 Fernet encryption
langsmith-polly-encryption-keypolly_encryption_keyPolly agent Fernet encryption
bash
# View Key Vault name (run from terraform/azure/)
terraform -chdir=infra output keyvault_name

# Read a secret directly
az keyvault secret show \
  --vault-name $(terraform -chdir=infra output -raw keyvault_name) \
  --name langsmith-api-key-salt \
  --query value -o tsv

Resource Sizing

AKS node pools

PoolVM SizevCPURAMMinMaxPurpose
defaultStandard_D8s_v3832 GB110Core LangSmith services, system pods
largeStandard_D16s_v31664 GB02ClickHouse (15 GB RAM request), LGP agent pods

Recommended max_count by pass

PassWhat's addedRecommended max_count
Pass 2Core LangSmith (external Postgres + Redis)4
Pass 3host-backend, listener, operator4
Pass 4Agent Builder tool + trigger server5–6
Pass 5Clio (Insights) analytics pods6+

To increase capacity — update terraform.tfvars and re-apply:

hcl
default_node_pool_max_count = 6   # increase as needed
bash
# Run from terraform/azure/
make apply   # AKS autoscaler picks up new max immediately — no node restart

IP Address Plan

RangeCIDRUsed by
VNet10.0.0.0/17All resources
AKS nodes + pods10.0.0.0/19Azure CNI pod IPs
PostgreSQL10.0.32.0/20Delegated subnet (external tier only)
Redis10.0.48.0/20Exclusive subnet (external tier only)
K8s ClusterIP10.0.64.0/20K8s service IPs (not in VNet)
K8s DNS10.0.64.10CoreDNS service IP

Variable Reference

VariableDefaultDescription
subscription_idAzure subscription ID (required)
locationeastusAzure region
identifier""Suffix appended to all resource names (e.g. -prod, -dev-dz). Must start with a hyphen or be empty. Internal hyphens allowed.
environmentdevEnvironment tag on all resources
owner""Owner tag applied to all resources
cost_center""Cost center tag for billing attribution
postgres_sourceexternalexternal = Azure DB for PostgreSQL (private VNet). in-cluster = Helm chart manages its own Postgres pod (dev/demo only).
redis_sourceexternalexternal = Azure Cache for Redis (private VNet). in-cluster = Helm chart manages its own Redis pod (dev/demo only).
clickhouse_sourcein-clusterin-cluster = ClickHouse deployed as Helm pod (dev/POC only). external = LangChain Managed ClickHouse (recommended for production).
postgres_admin_usernamelangsmithPostgreSQL admin username
postgres_admin_password""PostgreSQL admin password (sensitive). Set via setup-env.sh.
postgres_subnet_address_prefix["10.0.32.0/20"]CIDR for the PostgreSQL subnet
redis_subnet_address_prefix["10.0.48.0/20"]CIDR for the Redis subnet
redis_capacity2Redis Cache tier (P2 = 13 GB)
default_node_pool_vm_sizeStandard_D8s_v3AKS node VM size (8 vCPU, 32 GB). Use Standard_D4s_v3 for light/demo only.
default_node_pool_min_count1Min nodes for the default pool. Set to 3 for production (Pass 2 needs ~14.4 vCPU; 3× D8s_v3 provides 76% headroom).
default_node_pool_max_count10Max nodes for autoscaler. Increase as needed per pass.
sizing_profileproductionHelm sizing overlay: minimum | dev | production | production-large. Read by init-values.sh and deploy.sh — Terraform ignores this value.
dns_label""Azure Public IP DNS label for the ingress LoadBalancer. Works with nginx, istio, istio-addon, envoy-gateway. Results in <label>.<region>.cloudapp.azure.com. Leave empty to skip.
additional_node_poolslarge: D16s_v3 0–2Extra node pools. Default includes a large pool (Standard_D16s_v3, 16 vCPU, 64 GB) scaled to zero when idle. Required for ClickHouse (15 GB RAM request).
aks_service_cidr10.0.64.0/20K8s ClusterIP range — must not overlap the VNet
aks_dns_service_ip10.0.64.10CoreDNS service IP — must be within aks_service_cidr
aks_deletion_protectiontruePrevent accidental AKS cluster deletion. Set false for dev/test.
ingress_controllernginxIngress controller type. nginx deploys NGINX via Helm in the ingress-nginx namespace.
langsmith_namespacelangsmithKubernetes namespace for LangSmith workloads
langsmith_release_namelangsmithHelm release name (used for Workload Identity federated credential subjects)
langsmith_domain""Hostname for LangSmith (e.g. langsmith.example.com)
langsmith_helm_chart_version""Pin a specific Helm chart version. Empty = use latest.
create_vnettrueCreate a new VNet. Set false to bring your own.
vnet_id""Existing VNet resource ID. Required when create_vnet = false.
blob_ttl_enabledtrueEnable lifecycle TTL rules on blob container
blob_ttl_short_days14TTL for short-lived trace blobs
blob_ttl_long_days400TTL for long-lived trace blobs
keyvault_name""Override Key Vault name (default: langsmith-kv<identifier>)
keyvault_purge_protectiontrueEnable Key Vault purge protection. Disable before destroy to allow immediate name reuse.
postgres_deletion_protectiontruePrevent accidental PostgreSQL server deletion. Set false for dev/test.
tls_certificate_sourceletsencryptletsencrypt = HTTP-01 via cert-manager (ClusterIssuer applied manually). dns01 = DNS-01 via Azure DNS + Workload Identity (ClusterIssuer created by Terraform). none = no TLS.
letsencrypt_email""Email for Let's Encrypt notifications. Required when tls_certificate_source is letsencrypt or dns01.
cert_manager_identity_client_id""Client ID of the cert-manager Managed Identity. Wired automatically from k8s-cluster output. Required when tls_certificate_source = "dns01".
dns_zone_name""Azure DNS zone name (e.g. langsmith.mycompany.com). Required when tls_certificate_source = "dns01".
dns_resource_group_name""Resource group containing the Azure DNS zone. Required when tls_certificate_source = "dns01".
langsmith_license_key""LangSmith enterprise license key (sensitive). Stored in Key Vault.
langsmith_admin_password""Initial admin password (sensitive). Stored in Key Vault as langsmith-admin-password.
langsmith_api_key_salt""Salt for hashing API keys (sensitive). Generated by setup-env.sh. Must stay stable.
langsmith_jwt_secret""JWT secret for Basic Auth sessions (sensitive). Generated by setup-env.sh.
langsmith_deployments_encryption_key""Fernet key for LangSmith Deployments (Pass 3). Generated by setup-env.sh. Must stay stable.
langsmith_agent_builder_encryption_key""Fernet key for Agent Builder (Pass 4). Generated by setup-env.sh. Must stay stable.
langsmith_insights_encryption_key""Fernet key for Insights (Pass 5). Generated by setup-env.sh. Must stay stable — changing it permanently corrupts existing insights data.
langsmith_polly_encryption_key""Fernet key for Polly agent. Stored in Key Vault as langsmith-polly-encryption-key. Must never change after first deploy — changing it breaks existing Polly data.
create_waffalseEnable Azure WAF policy (OWASP 3.2 + bot protection). Independent of other optional modules — safe to add post-deploy.
create_diagnosticsfalseEnable Log Analytics workspace + diagnostic settings for AKS, Key Vault, and PostgreSQL. Recommended for production observability and audit logging.
enable_aks_diagtrueCreate the AKS diagnostic setting inside the diagnostics module. Uses a boolean flag (not a resource ID check) because count must be known at plan time.
enable_keyvault_diagtrueCreate the Key Vault diagnostic setting inside the diagnostics module.
enable_postgres_diagfalseCreate the PostgreSQL diagnostic setting inside the diagnostics module. Set to true when postgres_source = "external".
create_bastionfalseEnable a jump VM for private AKS cluster access via az ssh vm. No public IP required.
create_dns_zonefalseEnable Azure DNS zone + A record. Use when you own a custom domain and want Azure to manage DNS resolution. Required for DNS-01 cert issuance.
availability_zones["1"]Availability zones for AKS node pools and PostgreSQL (e.g. ["1", "2", "3"]). Set to [] to disable zone pinning.
postgres_standby_availability_zone""Zone for the PostgreSQL standby replica (e.g. "2"). Set when enabling zone-redundant HA mode.
enable_deploymentsfalsePass 3 — enable LangSmith Deployments (host-backend, listener, operator). Read by deploy.sh — Terraform ignores this value.
enable_agent_builderfalsePass 4 — enable Agent Builder UI. Read by deploy.sh — Terraform ignores this value. Requires enable_deployments = true.
enable_insightsfalsePass 5 — enable Insights / Clio. Read by deploy.sh — Terraform ignores this value. Requires enable_deployments = true.
enable_pollyfalsePass 5 — enable Polly AI eval agent. Read by deploy.sh — Terraform ignores this value. Requires enable_deployments = true.

Postgres Module Variables

VariableDefaultDescription
database_namelangsmithName of the PostgreSQL database to create and use in the connection URL. The connection_url output uses this variable instead of a hardcoded database name.

Core Variables

VariableDefaultDescription
subscription_idAzure subscription ID (required)
locationeastusAzure region
identifier""Suffix appended to all resource names (e.g. -prod, -dev-dz). Must start with a hyphen or be empty.
environmentdevEnvironment tag on all resources
owner""Owner tag applied to all resources
cost_center""Cost center tag for billing attribution

Deployment Tier

VariableDefaultDescription
postgres_sourceexternalexternal = Azure DB for PostgreSQL (private VNet). in-cluster = Helm chart manages its own Postgres pod (dev/demo only).
redis_sourceexternalexternal = Azure Cache for Redis (private VNet). in-cluster = Helm chart manages its own Redis pod (dev/demo only).
clickhouse_sourcein-clusterin-cluster = ClickHouse deployed as Helm pod (dev/POC only). external = LangChain Managed ClickHouse (recommended for production).

PostgreSQL

VariableDefaultDescription
postgres_admin_usernamelangsmithPostgreSQL admin username
postgres_admin_password""PostgreSQL admin password (sensitive). Set via setup-env.sh.
postgres_subnet_address_prefix["10.0.32.0/20"]CIDR for the PostgreSQL subnet
postgres_deletion_protectiontruePrevent accidental PostgreSQL server deletion. Set false for dev/test.
database_namelangsmithName of the PostgreSQL database to create. Used in the connection_url output.

Redis

VariableDefaultDescription
redis_subnet_address_prefix["10.0.48.0/20"]CIDR for the Redis subnet
redis_capacity2Redis Cache tier (P2 = 13 GB)

AKS Node Pools

VariableDefaultDescription
default_node_pool_vm_sizeStandard_D8s_v3AKS node VM size (8 vCPU, 32 GB). Use Standard_D4s_v3 for light/demo only.
default_node_pool_min_count1Min nodes for the default pool. Set to 3 for production. Set to 5 before enabling Pass 3.
default_node_pool_max_count10Max nodes for autoscaler.
additional_node_poolslarge: D16s_v3 0–2Extra node pools. Default includes a large pool (Standard_D16s_v3, 16 vCPU, 64 GB) scaled to zero when idle. Required for ClickHouse (15 GB RAM request).
aks_service_cidr10.0.64.0/20K8s ClusterIP range — must not overlap the VNet.
aks_dns_service_ip10.0.64.10CoreDNS service IP — must be within aks_service_cidr.
aks_deletion_protectiontruePrevent accidental AKS cluster deletion. Set false for dev/test.
availability_zones["1"]Availability zones for AKS node pools (e.g. ["1", "2", "3"]). Set to [] to disable zone pinning.

Ingress Controller

VariableDefaultDescription
ingress_controllernginxIngress controller: nginx | istio-addon | istio | agic | envoy-gateway. See INGRESS_CONTROLLERS.md for the full TLS compatibility matrix.

DNS and TLS

VariableDefaultDescription
dns_label""Azure Public IP DNS label for the ingress LoadBalancer. Results in <label>.<region>.cloudapp.azure.com. Works with nginx, istio, istio-addon, envoy-gateway.
langsmith_domain""Custom hostname for LangSmith (e.g. langsmith.example.com). Takes priority over dns_label.
tls_certificate_sourceletsencryptletsencrypt = HTTP-01 via cert-manager. dns01 = DNS-01 via Azure DNS + Workload Identity. none = no TLS.
letsencrypt_email""Email for Let's Encrypt notifications. Required when tls_certificate_source is letsencrypt or dns01.
cert_manager_identity_client_id""Client ID of the cert-manager Managed Identity. Wired automatically from k8s-cluster output. Required when tls_certificate_source = "dns01".
create_dns_zonefalseEnable Azure DNS zone + A record. Required for DNS-01 cert issuance.
dns_zone_name""Azure DNS zone name (e.g. langsmith.mycompany.com). Required when tls_certificate_source = "dns01".
dns_resource_group_name""Resource group containing the Azure DNS zone. Required when tls_certificate_source = "dns01".

LangSmith Application

VariableDefaultDescription
langsmith_namespacelangsmithKubernetes namespace for LangSmith workloads
langsmith_release_namelangsmithHelm release name (used for Workload Identity federated credential subjects)
langsmith_helm_chart_version""Pin a specific Helm chart version. Empty = use latest.
sizing_profileproductionHelm sizing overlay: minimum | dev | production | production-large. Read by init-values.sh — Terraform ignores this value.

Blob Storage

VariableDefaultDescription
blob_ttl_enabledtrueEnable lifecycle TTL rules on the blob container
blob_ttl_short_days14TTL for short-lived trace blobs
blob_ttl_long_days400TTL for long-lived trace blobs

Key Vault

VariableDefaultDescription
keyvault_name""Override Key Vault name (default: langsmith-kv<identifier>)
keyvault_purge_protectiontrueEnable Key Vault purge protection. Set false for dev/test to allow immediate name reuse after destroy.

Network (BYO VNet)

VariableDefaultDescription
create_vnettrueCreate a new VNet. Set false to bring your own.
vnet_id""Existing VNet resource ID. Required when create_vnet = false.

High Availability

VariableDefaultDescription
postgres_high_availability_mode""PostgreSQL HA mode (e.g. ZoneRedundant). Requires GeneralPurpose or MemoryOptimized SKU.
postgres_standby_availability_zone""Zone for the PostgreSQL standby replica. Set when enabling zone-redundant HA.

Optional Modules

VariableDefaultDescription
create_waffalseEnable Azure WAF policy (OWASP 3.2 + bot protection). Safe to add post-deploy.
create_diagnosticsfalseEnable Log Analytics workspace + diagnostic settings for AKS, Key Vault, and PostgreSQL. Recommended for production.
enable_aks_diagtrueCreate the AKS diagnostic setting inside the diagnostics module.
enable_keyvault_diagtrueCreate the Key Vault diagnostic setting inside the diagnostics module.
enable_postgres_diagfalseCreate the PostgreSQL diagnostic setting. Set true when postgres_source = "external".
create_bastionfalseEnable a jump VM for private AKS cluster access via az ssh vm. No public IP required.

Addon Pass Flags

These flags are read by init-values.sh and deploy.sh. Terraform ignores them — they only affect which Helm addon overlay files are generated.

VariableDefaultDescription
enable_deploymentsfalsePass 3 — enable LangSmith Deployments (host-backend, listener, operator). Scale default_node_pool_min_count to 5 first.
enable_agent_builderfalsePass 4 — enable Agent Builder UI. Requires enable_deployments = true.
enable_insightsfalsePass 5 — enable Insights / Clio analytics. Requires enable_deployments = true.
enable_pollyfalsePass 5 — enable Polly AI eval agent. Requires enable_deployments = true.

Sensitive Variables (set via setup-env.sh)

These are written to secrets.auto.tfvars by make setup-env and stored in Azure Key Vault by Terraform. Never set these inline in terraform.tfvars.

VariableDescription
langsmith_license_keyLangSmith enterprise license key
langsmith_admin_passwordInitial org admin password
langsmith_api_key_saltSalt for hashing API keys — must stay stable after first deploy
langsmith_jwt_secretJWT secret for Basic Auth sessions
langsmith_deployments_encryption_keyFernet key for LangSmith Deployments (Pass 3) — must never change
langsmith_agent_builder_encryption_keyFernet key for Agent Builder (Pass 4) — must never change
langsmith_insights_encryption_keyFernet key for Insights (Pass 5) — must never change
langsmith_polly_encryption_keyFernet key for Polly — must never change

Quick Reference

All commands run from terraform/azure/. Run make help to see the full target list. For copy-paste commands and expected outputs for each pass, see the Quick Reference page.

5-Pass deployment summary

PassWhatMake target
1AKS + Postgres + Redis + Blob + Key Vault + cert-manager + KEDAmake apply
1.5Cluster credentials + K8s secrets from Key Vaultmake kubeconfig && make k8s-secrets
2LangSmith Helm (~25 pods production)make init-values && make deploy
3+ LangSmith Deployments (enable_deployments = true) — scale nodes to min 5 firstmake apply && make init-values && make deploy
4+ Agent Builder (enable_agent_builder = true)make init-values && make deploy
5+ Insights + Polly (enable_insights = true, enable_polly = true)make init-values && make deploy

Day-2 operations

bash
make status         # 9-section health check
make status-quick   # skip Key Vault + K8s queries
make deploy         # re-deploy after Helm value changes
make init-values    # re-generate values after Terraform changes
make kubeconfig     # refresh cluster credentials
make k8s-secrets    # re-create langsmith-config-secret

Glossary

values chain
The ordered set of Helm -f files loaded by deploy.sh: values.yamlvalues-overrides.yaml → sizing file → addon files. Last file wins on conflicts.
sizing profile
Controls resource requests/limits and HPA settings. Set via sizing_profile in terraform.tfvars. Options: minimum, dev, production, production-large. Change by setting the flag and running make init-values && make deploy — no terraform apply needed.
enable_* flags
Boolean flags in terraform.tfvars that control which addon Helm values files init-values.sh generates (enable_deployments, enable_agent_builder, enable_insights, enable_polly). No terraform apply needed — they only affect Helm values.
langsmith-config-secret
Kubernetes Secret in the langsmith namespace holding 8 application keys pulled from Key Vault. Created by make k8s-secrets. The chart reads it via config.existingSecretName: langsmith-config-secret. Keys: api_key_salt, jwt_secret, langsmith_license_key, initial_org_admin_password, deployments_encryption_key, agent_builder_encryption_key, insights_encryption_key, polly_encryption_key.
Workload Identity (WI)
AKS OIDC issuer + Azure Managed Identity + federated credentials = pods access Azure Blob Storage without static credentials. No secrets in pods or env vars. All federated credentials are registered in modules/k8s-cluster/main.tf.
Fernet keys
Symmetric encryption keys used for Passes 3–5 data (deployments_encryption_key, agent_builder_encryption_key, insights_encryption_key, polly_encryption_key). Generated once by setup-env.sh and stored in Key Vault. Must never change after first use — changing any of them permanently corrupts the data they protect.
sslip.io
Free wildcard DNS service — <ip-with-dashes>.sslip.io resolves to the IP. Used for quick testing without a custom domain. No registration required. Example: NGINX IP 20.1.2.3 → hostname 20-1-2-3.sslip.io.

Known Issues

Click any issue to expand.

1
Key Vault secrets already exist but are not in Terraform state
Import required
2
langsmith-backend-auth-bootstrap stuck in CreateContainerConfigError
Fix key name
3
Cannot roll back to an older chart version after DB migration
Roll forward only
4
Helm install times out: timed out waiting for the condition
Increase timeout
5
listener and operator pods never appear after Pass 3
Add enabled: true
6
Duplicate top-level config: key silently drops values
Merge into one block
7
Encryption keys must not change after first deploy
Do not rotate
8
Pod panics: blob-storage health-check failed / AADSTS700213
Missing federated credential
9
terraform destroy stalls on VNet/subnet deletion
Uninstall Helm first
10
405 Not Allowed on prompts, datasets, and UI pages after upgrade to 0.13.26+
Edit frontend ConfigMap
11
backend-ch-migrations job fails: secret "langsmith-postgres-secret" not found (in-cluster mode)
Create alias secrets
12
vCPU quota exceeded — autoscaler backoff or node pool rotation fails
Request quota increase
13
Istio addon revision not supported
Update istio_addon_revision
14
Key Vault purge protection cannot be disabled after enabling
Purge and recreate
15
Front Door returns 404 — UI not loading (Istio + Front Door)
Fix originHostHeader
16
database "langsmith" does not exist — backend pods crashlooping
terraform apply
17
Polly shows 'Unable to connect to LangGraph server' / connects to localhost:8123
Restart frontend or fix extraEnv
18
agent-builder-tool-server or polly in CrashLoopBackOff — child processes die silently
Debug pod
19
langsmith-agent-bootstrap hook times out on first Pass 3–5 deploy
Wait then re-deploy
20
listener pods OOMKilled — CrashLoopBackOff with dev sizing
Verify values chain
21
Stale HPA scales listener or host-backend to max replicas unexpectedly
Delete stale HPA
29
AGIC pod CrashLoopBackOff — persistent 403 errors on Application Gateway
Fixed in Terraform
28
DSv3 quota fully exhausted — switch to DSv2 family as fallback
Switch VM family

Diagnostic Commands

bash
# Pod status
kubectl get pods -n langsmith
kubectl describe pod <pod-name> -n langsmith

# Logs
kubectl logs -n langsmith -l app=langsmith-backend --tail=100 -f
kubectl logs -n langsmith -l app=langsmith-platform-backend --tail=50

# Ingress + TLS
kubectl get ingress -n langsmith
kubectl get certificate -n langsmith
kubectl describe certificate -n langsmith

# Helm release status
helm list -n langsmith
helm get values langsmith -n langsmith
helm status langsmith -n langsmith
helm history langsmith -n langsmith

# Workload Identity check
kubectl get sa langsmith-ksa -n langsmith -o yaml | grep annotations -A3
kubectl exec -n langsmith deploy/langsmith-backend -- env | grep AZURE

# NGINX health probe
NGINX_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -s http://$NGINX_IP/nginx-health

# Key Vault — list all secrets (run from terraform/azure/)
az keyvault secret list --vault-name $(terraform -chdir=infra output -raw keyvault_name) -o table

# K8s secrets
kubectl get secrets -n langsmith | grep langsmith
kubectl get secret langsmith-config-secret -n langsmith -o jsonpath='{.data}' | python3 -m json.tool