LangSmithAWS EKS

LangSmith on AWS — Architecture


Platform Layers

LangSmith on AWS is deployed in three passes.

┌──────────────────────────────────────────────────────────────────────────────┐
│  Pass 3 — LangSmith Deployments  (enable_langsmith_deployments = true)       │
│                                                                              │
│  Purpose: Deploy and manage LangGraph applications from the LangSmith UI.   │
│                                                                              │
│  Adds to cluster:                                                            │
│    • host-backend   — deployment lifecycle API                               │
│    • listener       — syncs desired state into Kubernetes                    │
│    • operator       — controls LGP CRD and manages rollouts                 │
│                                                                              │
│  Per deployed graph:                                                         │
│    • api-server, queue, redis, postgres  (operator-managed)                  │
│                                                                              │
│  Requires: KEDA (installed in Pass 1 via k8s-bootstrap module)              │
├──────────────────────────────────────────────────────────────────────────────┤
│  Pass 2 — LangSmith Base Platform  (deploy_langsmith = true)                 │
│                                                                              │
│  Purpose: Observability, tracing, evaluations, experiments, API keys.        │
│                                                                              │
│  Components (Helm chart, namespace: langsmith):                              │
│    • backend        — core API server                                        │
│    • frontend       — React UI                                               │
│    • playground     — LLM prompt playground                                  │
│    • queue          — background job worker                                  │
│    • clickhouse     — trace analytics store                                  │
│    • redis          — task queue (in-cluster or ElastiCache)                 │
│    • postgres       — metadata store (in-cluster or RDS)                     │
│                                                                              │
│  Storage:  RDS PostgreSQL → metadata / S3 → trace blobs (VPC endpoint)      │
│  Ingress:  AWS ALB | NGINX | Envoy Gateway | Istio  (see Ingress Options)    │
├──────────────────────────────────────────────────────────────────────────────┤
│  Pass 1 — AWS Infrastructure                                                 │
│                                                                              │
│  Networking: VPC + private/public subnets + single NAT gateway               │
│  Compute:    EKS cluster + managed node group + cluster autoscaler          │
│  Database:   RDS PostgreSQL (db.t3.large, private subnets)                  │
│  Cache:      ElastiCache Redis (cache.m6g.xlarge, private subnets)          │
│  Storage:    S3 bucket (VPC Gateway Endpoint — no public internet)          │
│  Add-ons:    ALB controller + EBS CSI driver + metrics server               │
│  Bootstrap:  k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in]),            │
│              cert-manager (standalone, active when tls=letsencrypt)         │
│  Ingress:    ALB (default) or Envoy Gateway (opt-in, enable_envoy_gateway)  │
│  Opt-in:     Network Firewall (FQDN-based egress filtering)                 │
└──────────────────────────────────────────────────────────────────────────────┘

Component → Storage Mapping

ComponentStorage backendAccess method
backendRDS PostgreSQLPrivate subnet, security group
backendS3 bucketIRSA + VPC Gateway Endpoint
clickhouseEBS volume (GP3, EKS PVC)Local
redisElastiCache or in-clusterPrivate subnet, security group
LGP operatorRDS PostgreSQL (shared)Private subnet, security group

Network Topology

Default (ALB ingress):

Internet
    │
    ▼
AWS Application Load Balancer (ALB — port 80 or 443)
    │  TLS via ACM / Let's Encrypt (optional)
    ▼
EKS Cluster (private subnets)
  ├── kube-system namespace
  │     ├── aws-load-balancer-controller
  │     ├── cluster-autoscaler
  │     ├── ebs-csi-driver
  │     └── keda
  └── langsmith namespace
        ├── backend, frontend, playground, queue, clickhouse
        └── redis (in-cluster) OR ElastiCache ──► private subnet
              └── RDS PostgreSQL ──────────────► private subnet
                    └── S3 bucket ──────────────► VPC Gateway Endpoint (no public route)

Envoy Gateway (opt-in, enable_envoy_gateway = true):

Internet
    │
    ▼
AWS Network Load Balancer (NLB — ACM TLS termination at port 443)
    │
    ▼
envoy-gateway-system namespace
  └── Envoy proxy (GatewayClass: eg, Gateway: langsmith-gateway)
        │  HTTPRoute → langsmith-frontend:80
        ▼
langsmith namespace
  └── backend, frontend, playground, queue, clickhouse, ...

langsmith-agents namespace (optional — dataplane)
  └── langgraph-dataplane listener + operator + agent pods
        └── HTTPRoute attaches to shared langsmith-gateway (cross-namespace via allowedRoutes: All)

Egress path with Network Firewall (optional — create_firewall = true)

When Network Firewall is enabled, all outbound internet traffic from private subnets is inspected before reaching the NAT gateway:

EKS pods / RDS / ElastiCache (private subnets)
    │  0.0.0.0/0 → firewall endpoint (private route table)
    ▼
AWS Network Firewall (firewall subnet, same AZ as NAT gateway)
    │  domain allowlist: TLS SNI + HTTP Host inspection
    │  ALLOWLIST: firewall_allowed_fqdns (default: beacon.langchain.com)
    │  DROP: all other established connections
    ▼
NAT Gateway (public subnet)
    │
    ▼
Internet

Internal traffic (pod-to-pod, pod-to-RDS, pod-to-ElastiCache) routes via the local VPC route and never touches the firewall.


Ingress Options

Four mutually exclusive ingress options are supported. The choice determines whether split dataplane (agent pods in a separate namespace) is possible:

OptionVariableSplit dataplaneTraffic pathWhen to use
ALB (AWS LBC)(default)NoALB → frontend NodePortDefault. Single-namespace deployments, POC, simplest TLS via ACM.
NGINX Ingressenable_nginx_ingress = trueNoALB → TGB → NGINX controller → frontend ClusterIPWhen NGINX is already the standard in your org. ALB TGB wires the pre-provisioned ALB target group to the NGINX pods.
Envoy Gatewayenable_envoy_gateway = trueYesALB → TGB → Envoy proxy pod:10080 → HTTPRoute → servicesCross-namespace HTTPRoute routing. Recommended for split dataplane on new AWS deployments.
Istioenable_istio_gateway = trueYesALB → TGB → istio-ingressgateway:80 → VirtualService → servicesFor clusters with Istio already installed or when mTLS mesh is required. Istio 1.23+ binds port 80 directly via NET_BIND_SERVICE.

Split Dataplane — Why ALB Cannot Support It

Standard Kubernetes Ingress is namespace-scoped. The ALB controller can only route to services within the same namespace as the Ingress resource. Agent pods in langsmith-agents are invisible to an Ingress in langsmith.

Envoy Gateway and Istio both support cross-namespace routing via the Kubernetes Gateway API (HTTPRoutes) and VirtualServices respectively.

ALB + Envoy Gateway (chained pattern)

When a customer has an existing ALB with SSO (Okta/Cognito), WAF, and TLS configured, Envoy Gateway is added behind it rather than replacing it:

Internet
    │
    ▼
ALB ← unchanged: WAF, SSO (Okta/Cognito OIDC), TLS, DNS
    │
    ▼  (ALB target group retargeted to Envoy NLB — only change)
Envoy Gateway NLB (internal-scheme, auto-provisioned by k8s-bootstrap)
    │
    ├── HTTPRoute → langsmith ns        (control plane)
    └── HTTPRoute → langsmith-agents ns (agent pods — split dataplane)

See helm/values/examples/langsmith-values-ingress-envoy-gateway.yaml for the Helm values and doc_use_cases/enable_split_dataplane_aws/ for the full split dataplane guide.


IRSA (IAM Roles for Service Accounts)

IRSA is used instead of static credentials for S3 access:

  1. An IAM Role is created with a trust policy scoped to the EKS cluster's OIDC issuer.
  2. The role is granted s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket on the LangSmith bucket.
  3. The Kubernetes Service Account in langsmith namespace is annotated with the role ARN.
  4. Pods receive temporary credentials via the EKS token webhook — no static AWS keys required.

Module Dependency Graph

vpc  ──►  firewall  (AWS Network Firewall, optional — create_firewall = true)
│
├──►  eks  ──►  k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in])
│                └──►  cert-manager  (standalone module — Let's Encrypt DNS-01 via Route 53 IRSA)
│
├──►  postgres    (RDS, private subnets from VPC)
├──►  redis       (ElastiCache, private subnets from VPC)
├──►  storage     (S3 bucket + VPC Gateway Endpoint)
├──►  alb         (pre-provisioned ALB, public subnets)
│     └──►  alb_access_logs  (S3 bucket for ALB access logs — opt-in)
├──►  dns         (Route 53 zone + ACM cert, optional)
├──►  bastion     (jump host for private EKS access, optional)
├──►  cloudtrail  (audit logging, optional)
├──►  waf         (WAF ACL on ALB, optional)
└──►  firewall    (Network Firewall egress filter, optional)
          all ──►  langsmith (root module)

Opt-In Security Modules

Four modules are disabled by default and can be enabled in terraform.tfvars:

ModuleVariableDefaultPurpose
Network Firewallcreate_firewallfalseFQDN-based egress filtering — drops all outbound traffic not in firewall_allowed_fqdns. Requires create_vpc = true. Cost: ~$0.395/hr/endpoint + $0.065/GB processed.
ALB access logsalb_access_logs_enabledfalseTraffic analysis and compliance
CloudTrailcreate_cloudtrailfalseAPI call logging (skip if org trail exists)
WAFcreate_waffalseWAFv2 Web ACL — OWASP Top 10, IP reputation, known bad inputs
Network Firewallcreate_firewallfalseAWS Network Firewall — FQDN-based egress filtering for private subnets. Intercepts outbound traffic between the private subnet route tables and the NAT gateway; allows only domains in firewall_allowed_fqdns (TLS SNI + HTTP Host). Requires create_vpc = true. Cost: ~$0.40/hr per endpoint + $0.065/GB processed.

cert-manager Module

cert-manager is a standalone module (was previously embedded inside k8s-bootstrap). It deploys cert-manager into the cert-manager namespace and provisions a ClusterIssuer for Let's Encrypt DNS-01 challenge resolution via Route 53 IRSA. It is enabled automatically when tls_certificate_source = "letsencrypt".

ModuleWhen activePurpose
cert-managertls_certificate_source = "letsencrypt"Deploys cert-manager + ClusterIssuer for Let's Encrypt DNS-01 via Route 53 IRSA

Validated Behaviors & Known Constraints

Discovered during the gateway permutation test run (April 2026, all four modes validated).

#AreaConstraint / Fix
1ACM wildcard SANslangchain.com root zone has 0 issue "amazon.com" CAA but not 0 issuewild "amazon.com". Wildcard SANs (*.subdomain.langchain.com) always fail with CAA_ERROR. The dns module requests only the apex domain cert — no wildcard SAN.
2In-cluster RedisThe LangSmith Helm chart deploys Redis without requirepass. The Terraform k8s_bootstrap module writes redis://langsmith-redis:6379 (no password). Do not add an auth token unless you also configure the Helm chart Redis values.
3name_prefix lengthMaximum 15 characters (not 11). Names like dz-nginx-tst (12) are valid.
4Istio portIstio 1.23+ ingressgateway listens on port 80 directly via NET_BIND_SERVICE capability — not port 8080. ALB TGB health check and SG rules must target port 80.
5NGINX TGB portNGINX ingress-nginx controller pods listen on port 80. The TargetGroupBinding target type is ip.
6Envoy proxy portEnvoy proxy pods listen on port 10080 (not 80) when running as non-root. The TGB servicePort must be 10080.
7Destroy orderAlways run terraform destroy first and let Terraform handle namespace + Helm release lifecycle. Pre-deleting namespaces causes the helm_release Terraform resource to timeout (~5m) because Helm cannot uninstall cleanly into a terminating namespace.
8Stuck Terminating namespacesKEDA's stale external.metrics.k8s.io/v1beta1 API group causes NamespaceDeletionDiscoveryFailure. Clear with: kubectl get namespace $ns -o json | python3 -c "import sys,json; d=json.load(sys.stdin); d['spec']['finalizers']=[]; print(json.dumps(d))" | kubectl replace --raw "/api/v1/namespaces/$ns/finalize" -f -

Default Resource Sizes

ResourceDefault sizevCPUMemory
EKS nodem5.4xlarge1664 GB
RDS PostgreSQLdb.t3.large28 GB
ElastiCache Rediscache.m6g.xlarge413.07 GB
RDS storage10 GB

DNS & TLS (Custom Domain)

Three paths for TLS, configured via tls_certificate_source:

ModeBehaviorGateway
noneHTTP:80 only. No certificate.Any
acmHTTPS:443 with HTTP→HTTPS redirect. ACM certificate (auto-provisioned or BYO).ALB, NGINX
letsencryptHTTPS via cert-manager + Let's Encrypt DNS-01 (Route 53 IRSA).Istio, Envoy

CAA constraint: langchain.com has 0 issue "amazon.com" but not 0 issuewild "amazon.com". ACM cannot issue wildcard certificates (*.subdomain.langchain.com) for subdomains of langchain.com. The dns module requests the apex domain only. Customers using their own domain are not affected.

Why ACM vs cert-manager?

ACM certificates are non-exportable — AWS attaches them directly to the ALB. This makes ACM the right choice when TLS terminates at the ALB, but it cannot be used when TLS terminates inside the cluster (Istio Gateway, Envoy Gateway) because those gateways require the actual certificate material as a Kubernetes secret.

cert-manager (tls_certificate_source = "letsencrypt") handles in-cluster TLS for Istio and Envoy. The letsencrypt value is the reference implementation — it deploys cert-manager with a Let's Encrypt ACME ClusterIssuer. In production, swap the ClusterIssuer for any cert-manager-compatible issuer:

IssuerWhen to use
Let's Encrypt (default)Public domain, internet access, free
ACM Private CA (aws-privateca-issuer)AWS-native, air-gap friendly, private domains — ~$400/mo
Venafi (cert-manager-venafi)Enterprise PKI, regulated environments
HashiCorp Vault (cert-manager-vault)Self-hosted PKI
DigiCert / Sectigo / othersACME or custom issuer plugins

The Terraform module provisions the cert-manager IRSA role and Route 53 permissions. The ClusterIssuer manifest is the only thing that changes between issuers.

Auto-provisioned DNS (recommended for new deployments)

When langsmith_domain is set (and acm_certificate_arn is empty), Terraform activates the dns module which creates:

  • A Route 53 hosted zone for the domain
  • An ACM certificate with DNS validation records
  • A Route 53 alias record pointing the domain to the ALB

Staged deploy pattern: You can set langsmith_domain with tls_certificate_source = "none" first. Terraform creates the zone and cert but does not block on validation. Delegate NS records at your registrar, then flip to tls_certificate_source = "acm" in a later apply — Terraform blocks until the cert validates, then wires it into the ALB HTTPS listener.

Bring-your-own certificate

Set acm_certificate_arn directly to skip the dns module entirely. For in-cluster gateways, create a Kubernetes TLS secret manually and reference it in the Gateway/VirtualService — no cert-manager required.


Envoy Gateway (Gateway API Ingress)

Envoy Gateway is an opt-in alternative to the ALB ingress controller, enabled by setting enable_envoy_gateway = true in terraform.tfvars.

What the k8s-bootstrap module creates

ResourceNameNamespace
Helm releaseenvoy-gateway (chart: gateway-helm v1.3.0)envoy-gateway-system
GatewayClassegcluster-scoped
Gatewaylangsmith-gatewaylangsmith

The GatewayClass is created explicitly (not via the certgen job) to ensure it persists across re-applies. The Gateway exposes listeners on port 80 (HTTP) and port 443 (HTTP, for ACM TLS termination at NLB) with allowedRoutes.namespaces.from: All — enabling HTTPRoutes from any namespace to attach.

How traffic flows

Client
  │
  ▼ NLB (AWS NLB created by Envoy Gateway for the Gateway resource)
  │  ACM TLS termination at port 443 (annotated by deploy.sh)
  ▼
Envoy proxy pod (envoy-gateway-system namespace)
  │
  ▼ HTTPRoute (langsmith namespace) — created by LangSmith Helm chart when gateway.enabled: true
  │
  ▼ langsmith-frontend:80

Multi-namespace dataplane support

Gateway API's allowedRoutes: All setting makes Envoy Gateway the recommended ingress for multi-namespace dataplane deployments. Each namespace running langgraph-dataplane can attach an HTTPRoute to the shared langsmith-gateway without modifying the Gateway resource.

Apply the dataplane RBAC manifest once per dataplane namespace to allow langsmith-host-backend to stream pod logs:

kubectl apply -f helm/values/dataplane-rbac.yaml
# Edit namespace: field if using a namespace other than langsmith-agents

Verification Commands

# EKS cluster status
aws eks describe-cluster --name <cluster-name> --query "cluster.status"

# Node health
kubectl get nodes -o wide

# ALB status
kubectl get ingress -n langsmith

# RDS status
aws rds describe-db-instances \
  --query "DBInstances[?DBInstanceIdentifier=='<db-id>'].DBInstanceStatus"

# ElastiCache status
aws elasticache describe-replication-groups \
  --query "ReplicationGroups[?ReplicationGroupId=='<group-id>'].Status"

# S3 bucket from pod (via VPC endpoint)
kubectl run s3-test --rm -it --image=amazon/aws-cli -n langsmith -- \
  aws s3 ls s3://<bucket-name>