LangSmith on AWS — Architecture
Platform Layers
LangSmith on AWS is deployed in three passes.
┌──────────────────────────────────────────────────────────────────────────────┐
│ Pass 3 — LangSmith Deployments (enable_langsmith_deployments = true) │
│ │
│ Purpose: Deploy and manage LangGraph applications from the LangSmith UI. │
│ │
│ Adds to cluster: │
│ • host-backend — deployment lifecycle API │
│ • listener — syncs desired state into Kubernetes │
│ • operator — controls LGP CRD and manages rollouts │
│ │
│ Per deployed graph: │
│ • api-server, queue, redis, postgres (operator-managed) │
│ │
│ Requires: KEDA (installed in Pass 1 via k8s-bootstrap module) │
├──────────────────────────────────────────────────────────────────────────────┤
│ Pass 2 — LangSmith Base Platform (deploy_langsmith = true) │
│ │
│ Purpose: Observability, tracing, evaluations, experiments, API keys. │
│ │
│ Components (Helm chart, namespace: langsmith): │
│ • backend — core API server │
│ • frontend — React UI │
│ • playground — LLM prompt playground │
│ • queue — background job worker │
│ • clickhouse — trace analytics store │
│ • redis — task queue (in-cluster or ElastiCache) │
│ • postgres — metadata store (in-cluster or RDS) │
│ │
│ Storage: RDS PostgreSQL → metadata / S3 → trace blobs (VPC endpoint) │
│ Ingress: AWS ALB | NGINX | Envoy Gateway | Istio (see Ingress Options) │
├──────────────────────────────────────────────────────────────────────────────┤
│ Pass 1 — AWS Infrastructure │
│ │
│ Networking: VPC + private/public subnets + single NAT gateway │
│ Compute: EKS cluster + managed node group + cluster autoscaler │
│ Database: RDS PostgreSQL (db.t3.large, private subnets) │
│ Cache: ElastiCache Redis (cache.m6g.xlarge, private subnets) │
│ Storage: S3 bucket (VPC Gateway Endpoint — no public internet) │
│ Add-ons: ALB controller + EBS CSI driver + metrics server │
│ Bootstrap: k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in]), │
│ cert-manager (standalone, active when tls=letsencrypt) │
│ Ingress: ALB (default) or Envoy Gateway (opt-in, enable_envoy_gateway) │
│ Opt-in: Network Firewall (FQDN-based egress filtering) │
└──────────────────────────────────────────────────────────────────────────────┘
Component → Storage Mapping
| Component | Storage backend | Access method |
|---|---|---|
| backend | RDS PostgreSQL | Private subnet, security group |
| backend | S3 bucket | IRSA + VPC Gateway Endpoint |
| clickhouse | EBS volume (GP3, EKS PVC) | Local |
| redis | ElastiCache or in-cluster | Private subnet, security group |
| LGP operator | RDS PostgreSQL (shared) | Private subnet, security group |
Network Topology
Default (ALB ingress):
Internet
│
▼
AWS Application Load Balancer (ALB — port 80 or 443)
│ TLS via ACM / Let's Encrypt (optional)
▼
EKS Cluster (private subnets)
├── kube-system namespace
│ ├── aws-load-balancer-controller
│ ├── cluster-autoscaler
│ ├── ebs-csi-driver
│ └── keda
└── langsmith namespace
├── backend, frontend, playground, queue, clickhouse
└── redis (in-cluster) OR ElastiCache ──► private subnet
└── RDS PostgreSQL ──────────────► private subnet
└── S3 bucket ──────────────► VPC Gateway Endpoint (no public route)
Envoy Gateway (opt-in, enable_envoy_gateway = true):
Internet
│
▼
AWS Network Load Balancer (NLB — ACM TLS termination at port 443)
│
▼
envoy-gateway-system namespace
└── Envoy proxy (GatewayClass: eg, Gateway: langsmith-gateway)
│ HTTPRoute → langsmith-frontend:80
▼
langsmith namespace
└── backend, frontend, playground, queue, clickhouse, ...
langsmith-agents namespace (optional — dataplane)
└── langgraph-dataplane listener + operator + agent pods
└── HTTPRoute attaches to shared langsmith-gateway (cross-namespace via allowedRoutes: All)
Egress path with Network Firewall (optional — create_firewall = true)
When Network Firewall is enabled, all outbound internet traffic from private subnets is inspected before reaching the NAT gateway:
EKS pods / RDS / ElastiCache (private subnets)
│ 0.0.0.0/0 → firewall endpoint (private route table)
▼
AWS Network Firewall (firewall subnet, same AZ as NAT gateway)
│ domain allowlist: TLS SNI + HTTP Host inspection
│ ALLOWLIST: firewall_allowed_fqdns (default: beacon.langchain.com)
│ DROP: all other established connections
▼
NAT Gateway (public subnet)
│
▼
Internet
Internal traffic (pod-to-pod, pod-to-RDS, pod-to-ElastiCache) routes via the local VPC route and never touches the firewall.
Ingress Options
Four mutually exclusive ingress options are supported. The choice determines whether split dataplane (agent pods in a separate namespace) is possible:
| Option | Variable | Split dataplane | Traffic path | When to use |
|---|---|---|---|---|
| ALB (AWS LBC) | (default) | No | ALB → frontend NodePort | Default. Single-namespace deployments, POC, simplest TLS via ACM. |
| NGINX Ingress | enable_nginx_ingress = true | No | ALB → TGB → NGINX controller → frontend ClusterIP | When NGINX is already the standard in your org. ALB TGB wires the pre-provisioned ALB target group to the NGINX pods. |
| Envoy Gateway | enable_envoy_gateway = true | Yes | ALB → TGB → Envoy proxy pod:10080 → HTTPRoute → services | Cross-namespace HTTPRoute routing. Recommended for split dataplane on new AWS deployments. |
| Istio | enable_istio_gateway = true | Yes | ALB → TGB → istio-ingressgateway:80 → VirtualService → services | For clusters with Istio already installed or when mTLS mesh is required. Istio 1.23+ binds port 80 directly via NET_BIND_SERVICE. |
Split Dataplane — Why ALB Cannot Support It
Standard Kubernetes Ingress is namespace-scoped. The ALB controller can only route to
services within the same namespace as the Ingress resource. Agent pods in langsmith-agents
are invisible to an Ingress in langsmith.
Envoy Gateway and Istio both support cross-namespace routing via the Kubernetes Gateway API (HTTPRoutes) and VirtualServices respectively.
ALB + Envoy Gateway (chained pattern)
When a customer has an existing ALB with SSO (Okta/Cognito), WAF, and TLS configured, Envoy Gateway is added behind it rather than replacing it:
Internet
│
▼
ALB ← unchanged: WAF, SSO (Okta/Cognito OIDC), TLS, DNS
│
▼ (ALB target group retargeted to Envoy NLB — only change)
Envoy Gateway NLB (internal-scheme, auto-provisioned by k8s-bootstrap)
│
├── HTTPRoute → langsmith ns (control plane)
└── HTTPRoute → langsmith-agents ns (agent pods — split dataplane)
See helm/values/examples/langsmith-values-ingress-envoy-gateway.yaml for the Helm values
and doc_use_cases/enable_split_dataplane_aws/ for the full split dataplane guide.
IRSA (IAM Roles for Service Accounts)
IRSA is used instead of static credentials for S3 access:
- An IAM Role is created with a trust policy scoped to the EKS cluster's OIDC issuer.
- The role is granted
s3:GetObject,s3:PutObject,s3:DeleteObject,s3:ListBucketon the LangSmith bucket. - The Kubernetes Service Account in
langsmithnamespace is annotated with the role ARN. - Pods receive temporary credentials via the EKS token webhook — no static AWS keys required.
Module Dependency Graph
vpc ──► firewall (AWS Network Firewall, optional — create_firewall = true)
│
├──► eks ──► k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in])
│ └──► cert-manager (standalone module — Let's Encrypt DNS-01 via Route 53 IRSA)
│
├──► postgres (RDS, private subnets from VPC)
├──► redis (ElastiCache, private subnets from VPC)
├──► storage (S3 bucket + VPC Gateway Endpoint)
├──► alb (pre-provisioned ALB, public subnets)
│ └──► alb_access_logs (S3 bucket for ALB access logs — opt-in)
├──► dns (Route 53 zone + ACM cert, optional)
├──► bastion (jump host for private EKS access, optional)
├──► cloudtrail (audit logging, optional)
├──► waf (WAF ACL on ALB, optional)
└──► firewall (Network Firewall egress filter, optional)
all ──► langsmith (root module)
Opt-In Security Modules
Four modules are disabled by default and can be enabled in terraform.tfvars:
| Module | Variable | Default | Purpose |
|---|---|---|---|
| Network Firewall | create_firewall | false | FQDN-based egress filtering — drops all outbound traffic not in firewall_allowed_fqdns. Requires create_vpc = true. Cost: ~$0.395/hr/endpoint + $0.065/GB processed. |
| ALB access logs | alb_access_logs_enabled | false | Traffic analysis and compliance |
| CloudTrail | create_cloudtrail | false | API call logging (skip if org trail exists) |
| WAF | create_waf | false | WAFv2 Web ACL — OWASP Top 10, IP reputation, known bad inputs |
| Network Firewall | create_firewall | false | AWS Network Firewall — FQDN-based egress filtering for private subnets. Intercepts outbound traffic between the private subnet route tables and the NAT gateway; allows only domains in firewall_allowed_fqdns (TLS SNI + HTTP Host). Requires create_vpc = true. Cost: ~$0.40/hr per endpoint + $0.065/GB processed. |
cert-manager Module
cert-manager is a standalone module (was previously embedded inside k8s-bootstrap). It deploys cert-manager into the cert-manager namespace and provisions a ClusterIssuer for Let's Encrypt DNS-01 challenge resolution via Route 53 IRSA. It is enabled automatically when tls_certificate_source = "letsencrypt".
| Module | When active | Purpose |
|---|---|---|
cert-manager | tls_certificate_source = "letsencrypt" | Deploys cert-manager + ClusterIssuer for Let's Encrypt DNS-01 via Route 53 IRSA |
Validated Behaviors & Known Constraints
Discovered during the gateway permutation test run (April 2026, all four modes validated).
| # | Area | Constraint / Fix |
|---|---|---|
| 1 | ACM wildcard SANs | langchain.com root zone has 0 issue "amazon.com" CAA but not 0 issuewild "amazon.com". Wildcard SANs (*.subdomain.langchain.com) always fail with CAA_ERROR. The dns module requests only the apex domain cert — no wildcard SAN. |
| 2 | In-cluster Redis | The LangSmith Helm chart deploys Redis without requirepass. The Terraform k8s_bootstrap module writes redis://langsmith-redis:6379 (no password). Do not add an auth token unless you also configure the Helm chart Redis values. |
| 3 | name_prefix length | Maximum 15 characters (not 11). Names like dz-nginx-tst (12) are valid. |
| 4 | Istio port | Istio 1.23+ ingressgateway listens on port 80 directly via NET_BIND_SERVICE capability — not port 8080. ALB TGB health check and SG rules must target port 80. |
| 5 | NGINX TGB port | NGINX ingress-nginx controller pods listen on port 80. The TargetGroupBinding target type is ip. |
| 6 | Envoy proxy port | Envoy proxy pods listen on port 10080 (not 80) when running as non-root. The TGB servicePort must be 10080. |
| 7 | Destroy order | Always run terraform destroy first and let Terraform handle namespace + Helm release lifecycle. Pre-deleting namespaces causes the helm_release Terraform resource to timeout (~5m) because Helm cannot uninstall cleanly into a terminating namespace. |
| 8 | Stuck Terminating namespaces | KEDA's stale external.metrics.k8s.io/v1beta1 API group causes NamespaceDeletionDiscoveryFailure. Clear with: kubectl get namespace $ns -o json | python3 -c "import sys,json; d=json.load(sys.stdin); d['spec']['finalizers']=[]; print(json.dumps(d))" | kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f - |
Default Resource Sizes
| Resource | Default size | vCPU | Memory |
|---|---|---|---|
| EKS node | m5.4xlarge | 16 | 64 GB |
| RDS PostgreSQL | db.t3.large | 2 | 8 GB |
| ElastiCache Redis | cache.m6g.xlarge | 4 | 13.07 GB |
| RDS storage | 10 GB | — | — |
DNS & TLS (Custom Domain)
Three paths for TLS, configured via tls_certificate_source:
| Mode | Behavior | Gateway |
|---|---|---|
none | HTTP:80 only. No certificate. | Any |
acm | HTTPS:443 with HTTP→HTTPS redirect. ACM certificate (auto-provisioned or BYO). | ALB, NGINX |
letsencrypt | HTTPS via cert-manager + Let's Encrypt DNS-01 (Route 53 IRSA). | Istio, Envoy |
CAA constraint:
langchain.comhas0 issue "amazon.com"but not0 issuewild "amazon.com". ACM cannot issue wildcard certificates (*.subdomain.langchain.com) for subdomains oflangchain.com. Thednsmodule requests the apex domain only. Customers using their own domain are not affected.
Why ACM vs cert-manager?
ACM certificates are non-exportable — AWS attaches them directly to the ALB. This makes ACM the right choice when TLS terminates at the ALB, but it cannot be used when TLS terminates inside the cluster (Istio Gateway, Envoy Gateway) because those gateways require the actual certificate material as a Kubernetes secret.
cert-manager (tls_certificate_source = "letsencrypt") handles in-cluster TLS for Istio and Envoy. The letsencrypt value is the reference implementation — it deploys cert-manager with a Let's Encrypt ACME ClusterIssuer. In production, swap the ClusterIssuer for any cert-manager-compatible issuer:
| Issuer | When to use |
|---|---|
| Let's Encrypt (default) | Public domain, internet access, free |
ACM Private CA (aws-privateca-issuer) | AWS-native, air-gap friendly, private domains — ~$400/mo |
Venafi (cert-manager-venafi) | Enterprise PKI, regulated environments |
HashiCorp Vault (cert-manager-vault) | Self-hosted PKI |
| DigiCert / Sectigo / others | ACME or custom issuer plugins |
The Terraform module provisions the cert-manager IRSA role and Route 53 permissions. The ClusterIssuer manifest is the only thing that changes between issuers.
Auto-provisioned DNS (recommended for new deployments)
When langsmith_domain is set (and acm_certificate_arn is empty), Terraform activates the dns module which creates:
- A Route 53 hosted zone for the domain
- An ACM certificate with DNS validation records
- A Route 53 alias record pointing the domain to the ALB
Staged deploy pattern: You can set langsmith_domain with tls_certificate_source = "none" first. Terraform creates the zone and cert but does not block on validation. Delegate NS records at your registrar, then flip to tls_certificate_source = "acm" in a later apply — Terraform blocks until the cert validates, then wires it into the ALB HTTPS listener.
Bring-your-own certificate
Set acm_certificate_arn directly to skip the dns module entirely. For in-cluster gateways, create a Kubernetes TLS secret manually and reference it in the Gateway/VirtualService — no cert-manager required.
Envoy Gateway (Gateway API Ingress)
Envoy Gateway is an opt-in alternative to the ALB ingress controller, enabled by setting enable_envoy_gateway = true in terraform.tfvars.
What the k8s-bootstrap module creates
| Resource | Name | Namespace |
|---|---|---|
| Helm release | envoy-gateway (chart: gateway-helm v1.3.0) | envoy-gateway-system |
| GatewayClass | eg | cluster-scoped |
| Gateway | langsmith-gateway | langsmith |
The GatewayClass is created explicitly (not via the certgen job) to ensure it persists across re-applies. The Gateway exposes listeners on port 80 (HTTP) and port 443 (HTTP, for ACM TLS termination at NLB) with allowedRoutes.namespaces.from: All — enabling HTTPRoutes from any namespace to attach.
How traffic flows
Client
│
▼ NLB (AWS NLB created by Envoy Gateway for the Gateway resource)
│ ACM TLS termination at port 443 (annotated by deploy.sh)
▼
Envoy proxy pod (envoy-gateway-system namespace)
│
▼ HTTPRoute (langsmith namespace) — created by LangSmith Helm chart when gateway.enabled: true
│
▼ langsmith-frontend:80
Multi-namespace dataplane support
Gateway API's allowedRoutes: All setting makes Envoy Gateway the recommended ingress for multi-namespace dataplane deployments. Each namespace running langgraph-dataplane can attach an HTTPRoute to the shared langsmith-gateway without modifying the Gateway resource.
Apply the dataplane RBAC manifest once per dataplane namespace to allow langsmith-host-backend to stream pod logs:
kubectl apply -f helm/values/dataplane-rbac.yaml
# Edit namespace: field if using a namespace other than langsmith-agents
Verification Commands
# EKS cluster status
aws eks describe-cluster --name <cluster-name> --query "cluster.status"
# Node health
kubectl get nodes -o wide
# ALB status
kubectl get ingress -n langsmith
# RDS status
aws rds describe-db-instances \
--query "DBInstances[?DBInstanceIdentifier=='<db-id>'].DBInstanceStatus"
# ElastiCache status
aws elasticache describe-replication-groups \
--query "ReplicationGroups[?ReplicationGroupId=='<group-id>'].Status"
# S3 bucket from pod (via VPC endpoint)
kubectl run s3-test --rm -it --image=amazon/aws-cli -n langsmith -- \
aws s3 ls s3://<bucket-name>