LangSmithAWS EKS

Quickstart

Get from zero to a running LangSmith instance on EKS in under an hour.

First time?
Run these commands in order. Each step builds on the previous. Return to the full guide below for configuration details, advanced options, and per-pass troubleshooting.
# 1 — Unzip the Terraform modules provided by your LangChain SA
unzip aws.zip
cd aws
# 2 — Generate terraform.tfvars interactively
#     Re-running is safe — Enter accepts current values
make quickstart
# 3 — Store secrets in SSM Parameter Store
source infra/scripts/setup-env.sh
# 4 — Deploy infrastructure (~20–25 min)
make init
make plan
make apply
# 5 — Configure kubectl
make kubeconfig

# Verify nodes are ready
kubectl get nodes
# 6 — Deploy LangSmith
cd ../helm
source scripts/init-values.sh
bash scripts/deploy.sh
# 7 — Get the endpoint
kubectl get svc -n langsmith
What gets deployed
Pass 1 creates the VPC, EKS cluster, RDS PostgreSQL, ElastiCache Redis, and S3 bucket. Pass 2 installs the LangSmith Helm chart. Passes 3–5 are optional add-ons (Deployments, Agent Builder, Insights).
Professional Services — AWS EKS

LangSmith on AWS
Self-hosted deployment on EKS, managed with Terraform.

01
Infrastructure
~25 min
02
LangSmith
~10 min
03
Deployments
~5 min
04
Agent Builder
~10 min
05
Insights
~5 min
01
Infrastructure
VPC, EKS cluster, RDS PostgreSQL, ElastiCache Redis, S3 bucket, IRSA, EKS addons
Required
02
LangSmith
LangSmith Helm chart — traces, prompts, evaluations, org management
Required
03
LangSmith Deployments
Deploy and manage LangGraph graphs from the LangSmith UI
Optional
04
Agent Builder
Build, test, and serve LangGraph-based agents — requires license entitlement
Optional
05
Insights
AI-powered trace analysis and anomaly detection — requires external ClickHouse
Optional

Architecture

AWS resources created

ResourceTypePurpose
VPCaws_vpcIsolated network — 5 private, 3 public subnets across AZs
NAT Gatewayaws_nat_gatewayOutbound internet access for private subnets
EKS Clusteraws_eks_clusterKubernetes — managed node groups with autoscaling
EBS CSI DriverEKS addonPersistent volume support
ALB ControllerEKS addon (Blueprints)AWS Application Load Balancer ingress
Cluster AutoscalerEKS addon (Blueprints)Node autoscaling
RDS PostgreSQLaws_db_instancePostgreSQL 14 — org config, run metadata, graph checkpoints
ElastiCache Redisaws_elasticache_clusterRedis 7.0 — trace ingestion queue, pub/sub
S3 Bucketaws_s3_bucketRaw trace objects — VPC endpoint only access
S3 VPC Endpointaws_vpc_endpointPrivate S3 access without internet routing
IRSA Roleaws_iam_roleIAM Roles for Service Accounts — pod-level S3 access
GP3 Storage ClassKubernetesDefault storage class with volume expansion
Network Firewallaws_networkfirewall_firewallFQDN-based egress filtering — opt-in (create_firewall = true). Inspects TLS SNI and HTTP Host headers; drops all traffic not in the domain allowlist.

Prerequisites

Required tools

# AWS CLI v2
# macOS
brew install awscli
# Linux: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

# Terraform (>= 1.5)
brew tap hashicorp/tap && brew install hashicorp/tap/terraform

# kubectl
brew install kubectl

# Helm (>= 3.12)
brew install helm

# eksctl (useful for kubeconfig and debugging)
brew install eksctl

# Verify
aws --version
terraform version
kubectl version --client
helm version

Required AWS IAM permissions

The IAM user or role running Terraform needs the following managed policies (or equivalent inline policies):

PolicyPurpose
AmazonEKSClusterPolicyCreate and manage EKS clusters
AmazonVPCFullAccessCreate VPC, subnets, route tables, NAT
AmazonRDSFullAccessCreate and manage RDS instances
AmazonElastiCacheFullAccessCreate ElastiCache clusters
AmazonS3FullAccessCreate S3 buckets and VPC endpoints
IAMFullAccessCreate IRSA roles and policies

Authenticate and configure

# Configure AWS credentials
aws configure
# or use environment variables:
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-west-2

# Verify access
aws sts get-caller-identity
aws ec2 describe-availability-zones --query 'AvailabilityZones[].ZoneName' --output table

Repository Layout

terraform/aws/
├── infra/              ← Terraform root — run terraform from here
│   ├── main.tf         ← Wires all sub-modules, IRSA + ESO role setup
│   ├── variables.tf    ← All configurable inputs with defaults
│   ├── scripts/        ← setup-env.sh, set-kubeconfig.sh, preflight.sh, quickstart.sh, quickdeploy.sh, secrets-status.sh
│   └── modules/
│       ├── vpc/        ← VPC, subnets (5 private / 3 public), NAT, route tables
│       ├── eks/        ← EKS cluster, node groups, addons, IRSA role, GP3 storage class
│       ├── postgres/   ← RDS PostgreSQL, subnet group, security group, IAM auth
│       ├── redis/      ← ElastiCache Redis, subnet group, security group
│       ├── storage/    ← S3 bucket, VPC endpoint, bucket policy (VPC-only access)
│       ├── alb/        ← Pre-provisioned ALB (opt-in; ALB access logs opt-in)
│       ├── cloudtrail/ ← CloudTrail trail + S3 bucket (opt-in)
│       ├── waf/        ← WAFv2 Web ACL attached to ALB (opt-in)
│       ├── firewall/   ← AWS Network Firewall — FQDN egress filter (opt-in)
│       └── k8s-bootstrap/ ← Namespace, KEDA, cert-manager, ESO, Envoy Gateway (opt-in)
└── helm/
    ├── scripts/        ← init-values.sh, deploy.sh, apply-eso.sh, tls.sh, uninstall.sh
    └── values/
        ├── examples/   ← Reference templates (sizing, addons, Envoy Gateway, dataplane)
        │   ├── langsmith-values-ingress-envoy-gateway.yaml  ← Gateway API overlay
        │   ├── langsmith-values-dataplane.yaml              ← langgraph-dataplane chart values
        │   └── dataplane-rbac.yaml                          ← RBAC for dataplane namespace
        └── ...         ← Active base + overrides + sizing/addon files
Bring Your Own VPC
Set create_vpc = false and provide vpc_id, private_subnet_ids, and public_subnet_ids to use an existing VPC. The EKS and RDS modules will deploy into the provided subnets.

Configuration

Create a terraform.tfvars file in terraform/aws/infra/:

hcl
# Resource naming — all resources are named {name_prefix}-{environment}-*
name_prefix = "acme"          # short identifier, lowercase, no spaces
environment = "production"    # or "dev", "staging", etc.

# AWS region
region = "us-west-2"

# TLS mode: "none" (HTTP), "acm" (ACM cert on ALB/NGINX), or "letsencrypt" (cert-manager DNS-01 for Istio/Envoy)
tls_certificate_source = "none"

# VPC — set create_vpc = false to use an existing VPC
create_vpc = true
# vpc_id             = "vpc-xxxxxxxx"        # if create_vpc = false
# private_subnet_ids = ["subnet-xx", ...]    # if create_vpc = false
# public_subnet_ids  = ["subnet-xx", ...]    # if create_vpc = false

# EKS
eks_cluster_version = "1.31"
eks_managed_node_groups = {
  default = {
    name           = "node-group-default"
    instance_types = ["m5.4xlarge"]
    min_size       = 3
    max_size       = 10
  }
}

# RDS PostgreSQL
postgres_instance_type = "db.t3.large"
postgres_storage_gb    = 10

# ElastiCache Redis
redis_instance_type = "cache.m6g.xlarge"

# Gateway mode — pick ONE (all false = ALB native, simplest)
# enable_nginx_ingress = true   # ALB → NGINX controller → pods
# enable_envoy_gateway = true   # ALB → Envoy proxy:10080 → HTTPRoutes (split dataplane)
# enable_istio_gateway = true   # ALB → Istio:80 → VirtualService (mTLS mesh)

# TLS: "none" (HTTP), "acm" (ACM cert on ALB), "letsencrypt" (cert-manager DNS-01)
tls_certificate_source = "none"
# langsmith_domain = "langsmith.example.com"  # auto-provisions Route 53 zone + ACM cert

# Sizing + addon flags
sizing_profile     = "production"
# enable_deployments   = true
# enable_agent_builder = true
# enable_insights      = true
# enable_polly         = true

Terraform state backend (recommended)

hcl
# backend.tf
terraform {
  backend "s3" {
    bucket = "your-terraform-state-bucket"
    key    = "langsmith/aws/terraform.tfstate"
    region = "us-west-2"
  }
}

Pass 1 — Required Infrastructure

What gets created
VPC, EKS cluster + addons, RDS PostgreSQL, ElastiCache Redis, S3 bucket + VPC endpoint, IRSA role
Duration
~20–25 minutes
Fast path
After running make quickstart and source infra/scripts/setup-env.sh once, use make quickdeploy to chain Pass 1 + Pass 2 in one command. It gates on secrets being loaded and terraform.tfvars existing, then runs: terraform applykubeconfiginit-valueshelm deploy.
bash
cd terraform/aws

# Generate terraform.tfvars interactively (re-run safe — Enter accepts current values)
make quickstart

# Prompts for license key and admin password; auto-generates salt and JWT secret.
# Stores all secrets in SSM Parameter Store — sourced (not executed) to export TF_VAR_*
source infra/scripts/setup-env.sh

make init
make plan
make apply
Verify secrets status
After sourcing setup-env.sh, run make secrets to confirm all SSM parameters are set and TF_VAR_* variables are exported — before running make plan.
EKS first apply
EKS cluster creation takes 12–15 minutes. Node group creation and EKS addon installation (ALB controller, cluster-autoscaler) add another 3–5 minutes. Do not interrupt the apply.
source, not execute
Always run source infra/scripts/setup-env.sh — not ./infra/scripts/setup-env.sh. The script exports TF_VAR_postgres_password and TF_VAR_redis_auth_token into the calling shell. Running it without source silently exports nothing and Terraform fails at plan time. Run make setup-env if you forget the exact command.
Envoy Gateway (opt-in)
Set enable_envoy_gateway = true in terraform.tfvars before running make apply to install Envoy Gateway as an alternative to the ALB ingress controller. This is required for multi-namespace dataplane (LangSmith Deployments) setups where agent pods run in a separate namespace. When enabled, Pass 2 must use the langsmith-values-ingress-envoy-gateway.yaml overlay instead of the standard ALB ingress values.

After apply — configure kubectl

bash
# Sets kubeconfig to the EKS cluster
make kubeconfig

# Verify cluster access
kubectl get nodes
kubectl get pods -n kube-system
Post-infra check
Run make preflight-post after make apply to verify the cluster is reachable, all SSM parameters are present, and Helm values files exist before starting Pass 2.

Pass 1 — Infrastructure

Provisions: VPC, EKS cluster, RDS PostgreSQL, ElastiCache Redis, S3 bucket + VPC endpoint, ALB, IRSA role, ESO IRSA role, SSM secrets.

cd terraform/aws

Pass 2 — Required LangSmith

What gets created
LangSmith Helm release — API server, backend workers, frontend, ClickHouse, ESO secret sync
Duration
~8–12 minutes
In-cluster ClickHouse is for dev/POC only
The default deployment runs ClickHouse as a single in-cluster pod with no replication or backups. For production deployments, use LangChain Managed ClickHouse.
bash
cd terraform/aws

# Reads Terraform outputs and generates Helm values with
# RDS endpoint, Redis endpoint, S3 bucket, IRSA role ARN, and regional S3 API URL
make init-values

# Applies ESO ClusterSecretStore + ExternalSecret (syncs SSM → K8s secret),
# then runs helm upgrade --install
make deploy
Two-pass ALB hostname
On first deploy, config.hostname is blank. After the Helm release completes, deploy.sh reads the ALB hostname from the ingress and automatically writes it into langsmith-values-{env}.yaml, then runs a second Helm upgrade to lock in the hostname. This is expected — not an error.

Resource sizing

Set sizing_profile in infra/terraform.tfvars, then re-run make init-values to copy the matching values file:

hcl
# infra/terraform.tfvars
sizing_profile = "production"         # multi-replica with HPA (recommended)
sizing_profile = "production-large"   # high-volume (~50 users, ~1000 traces/sec)
sizing_profile = "dev"                # single-replica, minimal resources (dev/CI/demos)
sizing_profile = "default"            # chart defaults (no sizing overlay applied)
bash
# Re-generate values after changing sizing_profile:
make init-values && make deploy

Verify

bash
kubectl get pods -n langsmith
kubectl get ingress -n langsmith

Pass 2 — LangSmith Application

Two paths — pick one:

Fast Path — Single Command Deploy

If source infra/scripts/setup-env.sh and make quickstart have already been run, you can chain all of Pass 1 and Pass 2 in one command:

Pass 3 — Optional LangSmith Deployments

What gets created
Deployments UI, operator, host-backend, listener — deploy and manage LangGraph graphs from the LangSmith UI
Duration
~5 minutes
Prerequisites
config.hostname must be set in langsmith-values-{env}.yaml before enabling Deployments. The operator uses the hostname to construct agent endpoint URLs. If it is blank, deployed graphs will never reach RUNNING.

Set the flag in infra/terraform.tfvars and re-run init-values:

hcl
# infra/terraform.tfvars
enable_deployments = true
TLS setting is automatic
init-values.sh reads your tls_certificate_source from terraform.tfvars and sets config.deployment.tlsEnabled accordingly. You do not need to edit the values file manually.
bash
cd terraform/aws
make init-values   # copies langsmith-values-agent-deploys.yaml with correct TLS setting
make deploy

KEDA is installed during Pass 1 by the k8s-bootstrap Terraform module — no manual KEDA install is needed.

Pass 4 — Optional Agent Builder

What gets created
Agent Builder runtime, tool server, trigger server, bootstrap job — build and serve LangGraph-based agents from the UI
Duration
~10 minutes
Prerequisites
Pass 3 (Deployments) must be active — langsmith-values-agent-deploys.yaml must be present. Agent Builder requires config.deployment.enabled: true. Agent Builder also requires a license entitlement — contact LangChain if this feature is not visible in your UI.

1. Generate and store the encryption key

The Agent Builder encryption key is generated once and stored in SSM Parameter Store. It is pulled into the cluster by ESO automatically when the values file is present.

bash
# Generate a Fernet key
KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")

# Store in SSM — replace <name_prefix> and <environment> with your terraform.tfvars values
aws ssm put-parameter \
  --region <region> \
  --name "/langsmith/<name_prefix>-<environment>/agent-builder-encryption-key" \
  --value "$KEY" \
  --type SecureString
Do not rotate this key
The encryption key protects stored Agent Builder state. Changing it after first deploy permanently corrupts existing agent data. Generate once and never rotate.

2. Enable Agent Builder

Set the flag in infra/terraform.tfvars and re-run init-values:

hcl
# infra/terraform.tfvars
enable_agent_builder = true   # requires enable_deployments = true
bash
cd terraform/aws
make init-values   # copies langsmith-values-agent-builder.yaml
make deploy

Verify

bash
# Agent Builder pods (appear after bootstrap job completes — ~5 min)
kubectl get pods -n langsmith | grep -E "agent-builder|lg-"

# Bootstrap job status
kubectl get jobs -n langsmith | grep bootstrap

Pass 5 — Optional Insights

What gets created
AI-powered trace analysis, anomaly detection, LLM-as-judge evaluation at scale
Duration
~5 minutes
External ClickHouse required
Insights requires an external ClickHouse instance — in-cluster ClickHouse is not supported. Use AWS Marketplace (ClickHouse managed service), a LangChain-managed ClickHouse instance, or a self-hosted ClickHouse cluster reachable from the EKS VPC.

1. Generate and store the encryption key

bash
KEY=$(python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")

aws ssm put-parameter \
  --region <region> \
  --name "/langsmith/<name_prefix>-<environment>/insights-encryption-key" \
  --value "$KEY" \
  --type SecureString
Do not rotate this key
Changing the Insights encryption key after first deploy permanently corrupts existing insights data.

2. Enable Insights

Set the flag and ClickHouse connection details in infra/terraform.tfvars:

hcl
# infra/terraform.tfvars
enable_insights   = true
clickhouse_source = "external"

# Fill in after obtaining your ClickHouse credentials:
# clickhouse_host = "<clickhouse-hostname>"
# clickhouse_port = 9440   # native protocol (TLS)
# clickhouse_tls  = true
bash
cd terraform/aws
make init-values   # copies langsmith-values-insights.yaml with your ClickHouse config
make deploy

Verify

bash
# Insights deploys a Clio pod on first invocation from the UI
kubectl get pods -n langsmith | grep clio

Architecture Overview

Ingress / Gateway modes

Four mutually exclusive gateway modes are supported. Set exactly one flag in infra/terraform.tfvars:

ModeVariableTraffic pathTLS
ALB native (default)noneALB → frontend NodePortACM or Let's Encrypt HTTP-01
NGINXenable_nginx_ingress = trueALB → TGB → NGINX:80 → ClusterIPACM (terminates at ALB)
Envoy Gatewayenable_envoy_gateway = trueALB → TGB → Envoy proxy:10080 → HTTPRouteACM (terminates at ALB)
Istioenable_istio_gateway = trueALB → TGB → Istio:80 → VirtualServicecert-manager DNS-01 (in-cluster)

IRSA — IAM Roles for Service Accounts

EKS pods access S3 using IRSA — the Kubernetes service account is annotated with an IAM role ARN. AWS injects temporary credentials via the Pod Identity Webhook. No static AWS credentials are stored in Kubernetes secrets.

S3 VPC endpoint

A Gateway VPC Endpoint for S3 is created and associated with all private route tables. Trace data written to S3 never traverses the public internet — traffic routes directly from the EKS nodes to S3 within the AWS network.

Network Firewall (opt-in)

Setting create_firewall = true deploys an AWS Network Firewall between the private subnets and the NAT gateway. All outbound internet traffic is inspected using a domain allowlist — only FQDNs in firewall_allowed_fqdns are permitted; everything else is dropped. The firewall inspects TLS SNI for HTTPS and the HTTP Host header for plaintext traffic. Internal VPC traffic (pod-to-pod, pod-to-RDS, pod-to-ElastiCache) routes via the local VPC route and bypasses the firewall entirely.

Requires create_vpc = true. Cost: ~$0.395/hr per endpoint + $0.065/GB data processed.

Variable Reference

VariableDefaultDescription
name_prefixrequiredShort identifier (max 15 chars, lowercase) — all resources are named {name_prefix}-{environment}-*
environmentrequiredEnvironment label (production, dev, etc.) — part of all resource names
regionus-west-2AWS region for all resources
create_vpctrueCreate a new VPC. Set false to use existing.
vpc_id""Existing VPC ID (if create_vpc = false)
private_subnet_ids[]Existing private subnet IDs (if create_vpc = false)
public_subnet_ids[]Existing public subnet IDs (if create_vpc = false)
eks_cluster_version1.31EKS Kubernetes version
eks_managed_node_groups{default: m5.4xlarge}Managed node group definitions (instance type, min/max size)
enable_public_eks_clustertrueEnable public EKS API endpoint. Set false for private — requires bastion.
create_langsmith_irsa_roletrueCreate IRSA role for LangSmith pods (S3 access)
postgres_instance_typedb.t3.largeRDS instance class
postgres_storage_gb10Initial RDS storage in GB (autoscales to max 100 GB)
postgres_iam_database_authentication_enabledtrueEnable IAM database authentication for RDS
redis_instance_typecache.m6g.xlargeElastiCache node type
sizing_profiledefaultHelm sizing: production, production-large, dev, minimum, default
enable_deploymentsfalseEnable LangSmith Deployments — listener, operator, host-backend (Pass 3)
enable_agent_builderfalseEnable Agent Builder UI (requires enable_deployments)
enable_insightsfalseEnable ClickHouse-backed analytics (requires external ClickHouse)
enable_pollyfalseEnable Polly AI eval/monitoring (requires enable_deployments + Polly entitlement)
enable_envoy_gatewayfalseInstall Envoy Gateway (Gateway API) as an alternative to ALB ingress. Required for multi-namespace dataplane deployments. Installs gateway-helm v1.3.0 and creates GatewayClass eg + Gateway langsmith-gateway.
alb_access_logs_enabledfalseEnable ALB access logging to a dedicated S3 bucket. Useful for traffic analysis and compliance.
create_cloudtrailfalseCreate a CloudTrail trail logging all AWS API calls to S3. Skip if an account-level or org-level trail already exists.
cloudtrail_multi_regiontrueRecord API calls across all regions. Recommended — single-region trails miss global service events.
cloudtrail_log_retention_days365Days to retain CloudTrail logs in S3. Set 0 to keep indefinitely.
create_waffalseAttach a WAFv2 Web ACL to the ALB. Includes AWS managed rules for OWASP Top 10, IP reputation, and known bad inputs. Cost: ~$8–10/mo base.
create_firewallfalseDeploy AWS Network Firewall for FQDN-based egress filtering. Intercepts all outbound internet traffic from private subnets and drops everything not in firewall_allowed_fqdns. Requires create_vpc = true. Cost: ~$0.395/hr/endpoint + $0.065/GB processed (~$285/mo base).
firewall_allowed_fqdns["beacon.langchain.com"]Domains allowed for outbound internet traffic when create_firewall = true. Matched against TLS SNI (HTTPS) and HTTP Host header. All other destinations are dropped. Add entries for LangChain Managed ClickHouse, model providers, or package registries as needed.
firewall_subnet_cidr10.0.64.0/21CIDR for the firewall subnet. Must be within the VPC CIDR and must not overlap with the default private (10.0.0.0/21–10.0.32.0/21) or public (10.0.40.0/21–10.0.56.0/21) subnets.
enable_nginx_ingressfalseInstall NGINX ingress-nginx controller. ALB TGB wires the pre-provisioned ALB target group to the NGINX controller pods. Mutually exclusive with Envoy and Istio.
enable_envoy_gatewayfalseInstall Envoy Gateway controller. ALB TGB targets Envoy proxy pods on port 10080. Supports cross-namespace HTTPRoute routing (split dataplane). Mutually exclusive with NGINX and Istio.
enable_istio_gatewayfalseConfigure Istio gateway resources (requires Istio installed separately). ALB TGB targets istio-ingressgateway on port 80 (Istio 1.23+ uses NET_BIND_SERVICE). Mutually exclusive with NGINX and Envoy.
istio_nlb_scheme"internet-facing"Scheme for the Istio ingress gateway NLB. "internet-facing" for public access, "internal" for VPC-only. Only used when enable_istio_gateway = true.
tls_certificate_sourcenoneTLS mode: none (HTTP only), acm (ACM certificate on ALB — works with ALB and NGINX), letsencrypt (cert-manager DNS-01 via Route 53 — for Istio/Envoy in-cluster TLS).
langsmith_domain""Custom domain (e.g. langsmith.example.com). When set with tls_certificate_source = "acm" and no acm_certificate_arn, Terraform creates a Route 53 hosted zone, ACM certificate, and DNS alias automatically.
postgres_sourceexternalPostgreSQL backend: external (RDS — recommended) or in-cluster (Helm-managed single pod, dev/POC only).
redis_sourceexternalRedis backend: external (ElastiCache — recommended) or in-cluster (Helm-managed, no auth, dev/POC only).
clickhouse_sourcein-clusterClickHouse backend: in-cluster (single StatefulSet pod — dev/POC only, no replication or backups) or external (required for Insights/Pass 5).
langsmith_deployments_encryption_keyauto-generatedFernet key for LangSmith Deployments (Pass 3). Generated by setup-env.sh on first run. Must stay stable — rotating it after deploy invalidates all deployment state.
langsmith_agent_builder_encryption_keymanualFernet key for Agent Builder (Pass 4). Generate with python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" and store in SSM at /langsmith/{name_prefix}-{environment}/agent-builder-encryption-key.
langsmith_insights_encryption_keymanualFernet key for Insights (Pass 5). Same generation method as Agent Builder. Store in SSM at /langsmith/{name_prefix}-{environment}/insights-encryption-key. Changing this key permanently corrupts existing insights data.

Variable Reference

VariableDefaultRequiredDescription
name_prefixyesPrefix for all resource names (1–11 chars, lowercase)
environmentdevnoEnvironment: dev, staging, prod, test, uat
regionus-west-2noAWS region for all resources
create_vpctruenoCreate a new VPC (set false to use existing)
vpc_idnullwhen !create_vpcExisting VPC ID
private_subnets[]when !create_vpcExisting private subnet IDs
public_subnets[]when !create_vpcExisting public subnet IDs
vpc_cidr_blocknullwhen !create_vpcExisting VPC CIDR block
enable_public_eks_clustertruenoEnable public EKS API endpoint
eks_public_access_cidrs["0.0.0.0/0"]noCIDRs allowed to reach the public EKS API endpoint
eks_cluster_version1.31noEKS Kubernetes version
eks_managed_node_group_defaults{ami_type: AL2023}noDefault config for managed node groups
eks_managed_node_groups{default: m5.4xlarge}noManaged node group definitions
create_gp3_storage_classtruenoCreate and set gp3 as default StorageClass
eks_cluster_enabled_log_types["api", "audit", ...]noEKS control plane log types (CloudWatch)
eks_addons{}noEKS managed add-on configurations
create_langsmith_irsa_roletruenoCreate IRSA role for LangSmith pods (S3 access)
postgres_sourceexternalnoexternal (RDS) or in-cluster (Helm)
postgres_instance_typedb.t3.largenoRDS instance class
postgres_storage_gb10noInitial RDS storage in GB
postgres_max_storage_gb100noMaximum RDS storage in GB (autoscaling)
postgres_usernamelangsmithnoRDS database username
postgres_engine_version16noPostgreSQL engine version for RDS
postgres_password""when externalRDS password — use TF_VAR_postgres_password
postgres_iam_database_authentication_enabledtruenoEnable IAM database authentication on RDS
postgres_deletion_protectiontruenoEnable deletion protection on RDS
postgres_backup_retention_period7noDays to retain automated RDS backups (0 = disabled)
redis_sourceexternalnoexternal (ElastiCache) or in-cluster (Helm)
redis_instance_typecache.m6g.xlargenoElastiCache node type
redis_auth_token""when externalElastiCache auth token (min 16 chars) — use TF_VAR_redis_auth_token
s3_ttl_enabledtruenoEnable S3 lifecycle rules for trace TTL
s3_ttl_short_days14noTTL for ttl_s/ prefix in days
s3_ttl_long_days400noTTL for ttl_l/ prefix in days
s3_kms_key_arn""noKMS CMK ARN for S3 encryption (empty = SSE-S3)
s3_versioning_enabledfalsenoEnable S3 bucket versioning
tls_certificate_sourceacmnoacm, letsencrypt, or none
acm_certificate_arn""when acmACM certificate ARN
letsencrypt_email""when letsencryptEmail for Let's Encrypt
langsmith_domain""noCustom hostname (empty = use ALB DNS name)
langsmith_namespacelangsmithnoKubernetes namespace for LangSmith
clickhouse_sourcein-clusternoin-cluster or external
alb_schemeinternet-facingnoALB scheme: internet-facing or internal
alb_access_logs_enabledfalsenoEnable ALB access logging to S3
create_bastionfalsenoCreate EC2 bastion host for private cluster access (SSM or SSH)
bastion_instance_typet3.micronoEC2 instance type for bastion
bastion_key_namenullnoEC2 key pair for SSH (empty = SSM only)
bastion_enable_sshfalsenoOpen port 22 on bastion security group
bastion_ssh_allowed_cidrs[]noCIDRs allowed to SSH to bastion
bastion_root_volume_size_gb20noRoot EBS volume size for bastion
create_cloudtrailfalsenoCreate CloudTrail trail for AWS API audit
cloudtrail_multi_regiontruenoRecord API calls across all regions
cloudtrail_log_retention_days365noDays to retain CloudTrail logs
create_waffalsenoAttach WAFv2 Web ACL to ALB
create_firewallfalsenoDeploy AWS Network Firewall for FQDN-based egress filtering. Requires create_vpc = true. Cost: ~$0.395/hr/endpoint + $0.065/GB.
firewall_allowed_fqdns["beacon.langchain.com"]noDomains allowed for outbound internet traffic when create_firewall = true. Matched against TLS SNI (HTTPS) and HTTP Host header. All other destinations are dropped.
firewall_subnet_cidr"10.0.64.0/21"noCIDR for the firewall subnet. Must not overlap with private (10.0.0.0/21–10.0.32.0/21) or public (10.0.40.0/21–10.0.56.0/21) subnets.
sizing_profiledefaultnoHelm sizing: production, production-large, dev, minimum, default
enable_deploymentsfalsenoEnable LangGraph Platform (listener, operator, host-backend)
enable_agent_builderfalsenoEnable Agent Builder (requires enable_deployments)
enable_insightsfalsenoEnable ClickHouse-backed analytics
enable_pollyfalsenoEnable Polly AI eval/monitoring (requires enable_deployments)
enable_usage_telemetryfalsenoEnable extended usage telemetry reporting
langsmith_deployments_encryption_key""noFernet key for LangSmith Deployments
langsmith_agent_builder_encryption_key""noFernet key for Agent Builder
langsmith_insights_encryption_key""noFernet key for Insights
owner""noOwner tag applied to all resources
cost_center""noCost center tag for billing
tags{}noAdditional tags applied to all resources

Quick Reference

First-time setup

bash
cd terraform/aws

make quickstart              # generates terraform.tfvars interactively
source infra/scripts/setup-env.sh   # creates SSM secrets + exports TF_VAR_* (must be sourced)
make init && make plan && make apply   # ~20–25 min
make kubeconfig
make init-values
make deploy

Day-2 operations

bash
make status            # full deployment health check
make deploy            # re-deploy after changing Helm values or upgrading chart
make init-values       # re-generate values after Terraform changes
make apply-eso         # re-sync ESO secrets without redeploying
make ssm               # manage SSM Parameter Store secrets interactively
make kubeconfig        # refresh cluster credentials

5-pass summary

PassWhatCommand
1VPC + EKS + RDS + ElastiCache + S3 + IRSA + ESO + cert-manager + KEDAmake apply
1.5Cluster credentials + SSM secretsmake kubeconfig
2LangSmith Helmmake init-values && make deploy
3+ LangSmith Deployments (enable_deployments = true)make init-values && make deploy
4+ Agent Builder (enable_agent_builder = true)make init-values && make deploy
5+ Insights (enable_insights = true)make init-values && make deploy

Enable optional addons

hcl
# infra/terraform.tfvars — set flags, then: make init-values && make deploy
enable_deployments   = true   # required for Agent Builder and Polly
enable_agent_builder = true   # requires enable_deployments = true
enable_insights      = true
enable_polly         = true   # requires enable_deployments = true + Polly entitlement
enable_usage_telemetry = false  # extended usage telemetry (optional)

Gateway mode

Set exactly one gateway flag — they are mutually exclusive. All false = ALB native (default, simplest).

hcl
# infra/terraform.tfvars — pick ONE:
enable_nginx_ingress = true    # ALB → NGINX → pods (ACM TLS at ALB)
enable_envoy_gateway = true    # ALB → Envoy proxy:10080 → HTTPRoutes (split dataplane)
enable_istio_gateway = true    # ALB → Istio:80 → VirtualService (mTLS, DNS-01 TLS)
bash
make quickstart      # wizard adds gateway selection (Section 6)
make apply           # re-installs gateway controller via k8s-bootstrap
make init-values     # regenerates values overlay for new gateway mode
make deploy

Security add-ons

hcl
# infra/terraform.tfvars — all opt-in, all default false:
alb_access_logs_enabled = true    # ALB traffic logs → S3
create_cloudtrail       = true    # AWS API audit trail
create_waf              = true    # WAFv2 on ALB (~$10/mo)
create_firewall         = true    # AWS Network Firewall FQDN egress (~$0.40/hr)
create_bastion          = true    # SSM/SSH bastion for private EKS access

Glossary

TermMeaning
values chaindeploy.sh loads Helm values files in order: base → overrides → sizing → addons. The last file wins on conflicts.
sizing profileA pre-built Helm values file that sets resources, replicaCount, and HPA settings for all LangSmith components. Set via sizing_profile in terraform.tfvars, applied by make init-values.
enable_* flagsBoolean flags in infra/terraform.tfvars that tell init-values.sh which addon values files to copy from examples/. No terraform apply needed — just make init-values && make deploy.
IRSAIAM Roles for Service Accounts — EKS pods access S3 using temporary credentials injected by the Pod Identity Webhook. No static AWS credentials in K8s secrets.
ESOExternal Secrets Operator — syncs SSM Parameter Store secrets into a langsmith-config Kubernetes secret. deploy.sh applies the ClusterSecretStore and ExternalSecret before running Helm.
Fernet keysSymmetric encryption keys for LangSmith Deployments, Agent Builder, Insights, and Polly. Generated by setup-env.sh and stored in SSM. Never rotate after first deploy — changing them permanently corrupts existing encrypted data.
values-overrides.yamlThe live, gitignored file generated by init-values.sh. Contains your ALB hostname, RDS endpoint, Redis endpoint, S3 bucket, and IRSA role ARN. Do not edit directly — re-run make init-values.

Troubleshooting

Full reference
See the Troubleshooting guide for the complete issue list and diagnostic commands.
1
terraform apply fails: EKS cluster not ready for node group
manual
2
kubectl commands fail: error: You must be logged in to the server
manual
3
ALB not created after Helm install
manual
4
RDS connection refused from EKS pods
manual
5
S3 access denied from pods
manual
6
EKS nodes not autoscaling
manual
7
ElastiCache Redis connection timeout
manual
8
Pods in CreateContainerConfigError after deploy
manual
9
Helm release fails: context deadline exceeded
manual
10
Helm upgrade fails: conflict with "manager" using networking.k8s.io/v1
fixed
11
Ingress deleted: new ALB provisioned with a different hostname
manual
12
Agent Builder deployment stuck in DEPLOYING (never reaches RUNNING)
manual
13
In-cluster Redis AUTH error: AUTH called without any password configured
fixed
14
Namespace stuck in Terminating after teardown (KEDA stale API group)
manual
15
ACM certificate stuck in PENDING_VALIDATION with CAA_ERROR
fixed
16
Envoy Gateway: Could not find Envoy proxy service for Gateway (expected race)
manual
17
Istio ALB health check failing — wrong target port (8080 vs 80)
fixed
18
ESO helm release timeout on terraform destroy (pre-deleted namespace)
manual