VPC Architecture

I built a three-tier VPC across three Availability Zones in us-east-1 (CIDR 10.20.0.0/16). Each tier has a distinct trust level enforced by both route tables and security groups.

ChessChat infrastructure diagram showing three-tier VPC, ECS Fargate, ALB, Redis, DynamoDB, CloudFront and Cognito
TierSubnetsHostsInternet
Public 10.20.0.0/20, 10.20.16.0/20, 10.20.32.0/20 ALB, NAT Gateway Inbound + outbound via IGW
Private App 10.20.48.0/20, 10.20.64.0/20, 10.20.80.0/20 ECS Fargate tasks Outbound only via NAT Gateway
Private Data 10.20.96.0/20, 10.20.112.0/20, 10.20.128.0/20 ElastiCache Redis None — no internet route

The /20 subnets provide 4,094 IP addresses for each subnet (taking away the 5 addresses that AWS reserve). This is way more addresses than I will need, but there isn't really any downside in this isolated project. A tighter design would use smaller CIDR blocks based on the expected host count. In a shared enterprise VPC where IP address space is a shared resource, I would size my subnets accordingly.

Why Three Tiers?

It isolates my resources, which makes my application and infrastructure more secure. Security groups ensure only expected traffic is allowed at each layer. Route tables in the private tiers route 0.0.0.0/0 to a NAT Gateway rather than an Internet Gateway — NAT is outbound-only, so there is no inbound internet path to ECS tasks or Redis even if a security group rule were misconfigured.

NAT Gateway Strategy

A single NAT Gateway handles outbound internet from the private app subnets. This introduces a single-AZ egress dependency. If I intended to design my infrastructure with more resilience, I would have a NAT gateway in each AZ. The Terraform VPC module accepts nat_gateway_mode = "single" | "per_az" so switching to per-AZ NAT requires one small change.


Security Groups

Security groups enforce a strict one-directional trust chain. Each layer only accepts traffic from the layer directly above it.

Internet (0.0.0.0/0)
  │  443 / 80
  ▼
ALB Security Group
  │  8080 (app container port)
  ▼
Fargate Task Security Group
  │  6379 (Redis)
  ▼
Redis Security Group

The Redis rule references the Fargate task security group ID rather than a CIDR block. Since Fargate tasks are ephemeral and get random IPs, any new task automatically gets access just by being a member of that security group — no IP management needed as tasks scale.


VPC Endpoints

AWS API calls from Fargate tasks route through VPC Endpoints rather than the NAT Gateway. This eliminates NAT data processing charges on ECR image pulls, CloudWatch log writes, and Secrets Manager reads — and keeps traffic inside the AWS network.

EndpointTypeCostPurpose
S3GatewayFreeECR image layer storage, static assets
DynamoDBGatewayFreeDatabase access from app tier
ECR APIInterface~$7/moContainer image metadata
ECR DockerInterface~$7/moContainer image pull
CloudWatch LogsInterface~$7/moApplication log delivery
Secrets ManagerInterface~$7/moRedis auth token at task startup

VPC endpoints could reduce NAT Gateway usage for AWS service traffic such as ECR pulls, CloudWatch Logs, Secrets Manager, S3, and DynamoDB. Gateway endpoints for S3 and DynamoDB are free, but other interface endpoints have their own hourly and per-GB costs — so whether they save money depends on AZ count, endpoint count, and actual traffic volume. More on this in my cost breakdown.


Terraform Structure

All infrastructure is defined in Terraform with a modular structure — one module per AWS service boundary. The root main.tf wires modules together, passing outputs as inputs (e.g. VPC subnet IDs feed into ECS and ElastiCache).

terraform/
├── modules/
│   ├── vpc/                  # VPC, subnets, IGW, NAT, route tables, endpoints
│   ├── ecs/                  # IAM task execution role and task role only
│   ├── ecs_compute/          # ECS cluster, task definition, service, ECR repo, SG
│   ├── alb/                  # ALB, target groups, listeners, ACM cert
│   ├── dynamodb/             # All DynamoDB tables and GSIs
│   ├── elasticache/          # Redis replication group, subnet group, param group
│   ├── cognito/              # User pool, app client, Google IdP
│   ├── route53/              # Hosted zone, alias records, ACM validation
│   ├── monitoring/           # CloudWatch alarms, dashboards, SNS, budgets
│   ├── static_edge/          # S3 bucket + CloudFront for chess-chat.com apex
│   └── github_actions_oidc/  # IAM OIDC provider + least-privilege deploy role
├── environments/dev/
│   └── terraform.tfvars
├── main.tf           # Module wiring
├── backend.tf        # S3 remote state config
└── versions.tf       # Provider version pins
ECS split into two modules. The ECS identity module (ecs/) creates IAM roles and policies only — task execution role and task role. The ECS compute module (ecs_compute/) creates the cluster, task definition, ECS service, Fargate security group, ECR repository, and CloudWatch log group. This split breaks a circular dependency: the Redis security group needs to reference the Fargate task security group as its source, but the task definition needs the Redis endpoint. Separating the resources removes the cycle.

Module Outputs as Inputs

Modules expose typed outputs that other modules consume directly. For example:

# In main.tf
module "elasticache" {
  source             = "./modules/elasticache"
  private_data_subnet_ids = module.vpc.private_data_subnet_ids
  fargate_sg_id      = module.ecs.fargate_task_sg_id
}

module "ecs_compute" {
  source         = "./modules/ecs_compute"
  redis_endpoint = module.elasticache.primary_endpoint
  redis_secret_arn = module.elasticache.auth_token_secret_arn
}

This wiring means Terraform's dependency graph is explicit — it knows to create ElastiCache before the ECS service, and to destroy the ECS service before ElastiCache.


Remote State

Terraform state is stored in S3 with versioning enabled and file-based locking (use_lockfile = true). State versioning means any previous infrastructure state can be restored from S3 version history if needed.

# backend.tf
terraform {
  backend "s3" {
    bucket       = "chesschat-tfstate-xxxxxxxxxxxx-us-east-1"
    key          = "dev/terraform.tfstate"
    region       = "us-east-1"
    use_lockfile = true
  }
}

The state bucket name includes the AWS account ID to guarantee global uniqueness without a random suffix that would need to be tracked separately.


CI/CD Pipeline

GitHub Actions handles continuous deployment. Authentication to AWS uses OIDC (OpenID Connect). GitHub's OIDC provider is trusted by an IAM role, and the workflow assumes that role for the duration of the pipeline run.

Deploy pipeline steps on push to main (single job, sequential):

  1. Checkout source
  2. Configure AWS credentials via OIDC
  3. Login to Amazon ECR
  4. Build Docker image and push to ECR (commit SHA tag + main-latest tag)
  5. Generate config.js from GitHub Actions variables (Cognito IDs, app host, cookie config) and sync static auth assets to S3; config.js is uploaded with Cache-Control: no-store so Cognito config changes are immediately live — all other assets use a 5-minute CDN cache
  6. Trigger full CloudFront invalidation (/*)
  7. Register new ECS task definition revision
  8. Deploy ECS service and wait for stable state
  9. Upload deployment metadata artifact

Tests are not part of the deploy pipeline. Backend quality, frontend quality, and Terraform quality checks run only in PR workflows and are enforced as required status checks on main — branch protection is configured with enforce_admins = true and required_conversation_resolution = true, with force-push and deletion disabled. All of this branch protection configuration is managed as Infrastructure as Code using Terraform's GitHub provider, so the repository settings are version-controlled and applied consistently alongside the rest of the infrastructure.


Redis vs DynamoDB — The Split

The platform uses two data stores with deliberately separate responsibilities. The guiding principle: Redis holds what is needed right now; DynamoDB stores what needs to be remembered always. Active room metadata, in-progress game state, and the WebSocket connection map all live in Redis, they expire naturally when the session ends. Completed game records, user profiles and stats, and the friends connections live in DynamoDB.

Importantly, DynamoDB receives exactly one write per completed game, regardless of how many moves were played. All mid-game state lives in Redis. This keeps DynamoDB costs proportional to games completed, not game activity.


DynamoDB Schema

The schema uses a multi-table design — one table per domain — rather than a single-table pattern. IAM permissions, CloudWatch metrics, and PITR settings are all cleaner per-table. The social domain was also added after initial design; retrofitting single-table key namespacing would have been challenging, so this choice provided me with a level of freedom that I wouldn't have with a single table design. To stay as honest as possible, Claude helped design the schema based on my application requirements.

Users

users
PK user_id (S) Cognito sub — immutable UUID
GSI username-index PK: username (S)
username · display_name · email · chess_com_username · chess_com_verified
total_games · wins · losses · draws · created_at · last_login_at
Why user_id as PK, not username? Usernames can change. I wanted to implement a feature that utilsed chess.com OAuth integration, which would include a displacement mechanic where a verified chess.com holder can claim a username held by an unverified user. However, this didn't eventuate. Using an immutable Cognito sub means a username change is a single UpdateItem, not a data migration.

Games

games
PK game_id (S) UUID
SK ended_at (S) ISO 8601 timestamp string
GSI 1 white-player-index PK: white_player_id · SK: ended_at
GSI 2 black-player-index PK: black_player_id · SK: ended_at
white_player_id → users · black_player_id → users
winner · result · total_moves · duration_seconds · room_code
time_white_remaining · time_black_remaining · moves[] · fen_history[] · pgn
Why two GSIs? A player alternates colours across games. To query their full history you need games where they were white AND games where they were black. DynamoDB GSI partition keys map to a single attribute — two GSIs with two parallel queries (merged in the application) is a solution. A single GSI cannot serve both lookup directions.

Social Tables

Five additional tables support the social layer:

friendships
PK user_id (S)
SK friend_user_id (S)
None — mirrored writes
friend_requests
PK recipient_user_id (S)
SK request_id (S)
GSI sender-requests-index
challenges
PK challenge_id (S)
GSI challenger-index
GSI challenged-index
notifications
PK user_id (S)
SK notification_id (S)
None
pair_rooms
PK pair_id (S) Deterministic composite of both user IDs, sorted alphabetically
None

Friendships are written in both directions simultaneously (A→B and B→A), so listing a user's friends is a single Query PK=user_id with no GSI required. The Challenges table mirrors the two-GSI pattern from Games — a challenge involves two parties and must be visible to both. The pair_rooms table stores a persistent room code per friend pair: when a challenge is accepted, the backend derives a deterministic pair_id (both user IDs sorted alphabetically and concatenated) and looks up or creates the room code for that pair, so returning friends always get the same room rather than a new random one each time.


Redis Schema

Redis stores ephemeral session state. All keys have a TTL and are treated as disposable — Redis state loss is survivable; DynamoDB loss is not.

room:{room_code}          String (JSON-encoded), TTL 60min
  {
    status,               waiting | active | teardown | terminal
    chime_meeting_id,
    chime_meeting_data,
    participants,         per-user connection state
    active_game,          null if no game running
    games_played          array of completed game summaries
  }

ws:{connection_id}        String (JSON-encoded), TTL 60min
  { user_id, room_code, connected_at }

room_expiries             Sorted Set (ZSET)
                          member: room_code, score: expiry Unix timestamp
                          Background worker scans for expired rooms

room_reconnect_deadlines  Sorted Set (ZSET)
                          member: room_code, score: reconnect grace deadline timestamp
                          Background worker scans for elapsed grace periods

game_finalization_queue   List — jobs awaiting DynamoDB write
game_finalization_deadletter  List — jobs that failed after 5 attempts

Room and connection state is stored as JSON strings using SET/GET, not as Redis Hashes. WebSocket connection state lives in Redis (not in Fargate task memory) so that any task can look up any connection — the foundation for distributed WebSocket routing when the pub/sub relay is added.

The two sorted sets enable a background expiry worker to reliably scan for and act on expired rooms and elapsed reconnect grace periods, even when the originating WebSocket connection is gone. This is more robust than relying solely on Redis key TTL, which fires no callback on expiry.

The finalization queue is a Redis List used as a lightweight job queue: completed game data waits there until it can be written to DynamoDB. If DynamoDB is momentarily unavailable, the job sits in the queue rather than being lost. Jobs that fail five times are parked in the dead letter list for inspection rather than retried indefinitely.


Data Flow

Room Creation

  1. User A sends a challenge to User B → WebSocket message to Fargate
  2. Backend generates 8-char alphanumeric room code
  3. Room written to Redis with 60-minute TTL
  4. Room code returned to User A's browser
  5. User B enters code → backend validates room exists
  6. Backend creates Chime SDK meeting, generates attendee tokens for both
  7. Meeting credentials pushed to both clients via WebSocket
  8. Chime SDK initialises on both frontends — video active

Chess Move

  1. Player drags piece — chess.js validates move locally (UX speed)
  2. Move sent to backend via WebSocket: { move: "e2e4" }
  3. Backend validates server-side with chess.js (authoritative source of truth)
  4. Board FEN, move list, and clocks updated in Redis
  5. Move broadcast to both clients: move_made event including moveType and isCheck metadata fields — the server drives the sound trigger type
  6. Both frontends update board, clocks, and play the mapped sound cue

Game Completion

  1. End condition detected (checkmate / resignation / timeout)
  2. Game metadata calculated (duration, time remaining, move count)
  3. Finalization job pushed to Redis List (game_finalization_queue) — write is decoupled from the in-flight WebSocket session to survive Fargate task crashes
  4. Background worker pops the job and writes to DynamoDB as a conditional PutItem (idempotent — safe to retry)
  5. Failed writes retry with exponential backoff (base 250ms, up to 5 attempts); jobs failing all 5 attempts are pushed to game_finalization_deadletter
  6. User stats updated — UpdateItem on Users table
  7. Active game cleared from Redis room state
  8. game_ended event broadcast to both clients

Session Lifecycle

Room created (User A)
  │  status: waiting
  ▼
User B joins
  │  status: active — Chime meeting created
  ▼
Games played (unlimited sequential games)
  │  Each game: Redis active_game → DynamoDB via finalization queue
  ▼
Player disconnects
  │  12-second reconnect grace (tracked in room_reconnect_deadlines ZSET)
  │  Reconnect events carry a reconnectVersion — frontend ignores stale
  │  updates with version ≤ last seen, preventing UI regression
  │  If grace expires → teardown begins
  ▼
Both players leave (or grace expires)
  │  Room transitioned to terminal: true via Redis WATCH/MULTI/EXEC (atomic)
  │  New join attempts rejected once terminal
  │  Chime meeting deleted
  │  Active game (if any) queued for DynamoDB write
  │  DEL room:{code} + ZREM from both sorted sets
  ▼
Room consumed — code is single-use

The terminal state transition is atomic (Redis optimistic locking via WATCH/MULTI/EXEC). The subsequent cleanup — DEL and removal from the two sorted sets — runs as separate commands after the transition is confirmed. Single-use enforcement relies on the terminal state check: new join attempts are rejected once the room is terminal, before the key is deleted.


WebSocket Routing

WebSocket connections require session affinity — a client must reach the same Fargate task to maintain its connection. The ALB target group has sticky sessions enabled (lb_cookie, 24-hour duration).

Current scaling constraint. WebSocket routing is single-task until a Redis pub/sub message relay is implemented. At that point, any task can relay messages to any other task and sticky sessions become a preference, not a requirement. This constraint is documented and the upgrade path is clear.

Connection state is stored in Redis (not task memory), so the foundation for distributed routing is already in place. Implementing the pub/sub relay is the remaining step.


Video Infrastructure

Current: Amazon Chime SDK

Chime SDK provides a managed WebRTC implementation. The server creates a meeting and issues attendee tokens; the client SDK handles negotiation, STUN/TURN, and media streams. Cost: $0.0017 per attendee-minute.

A 1-hour session with two players = 120 attendee-minutes = $0.20 per session.

Planned: Native WebRTC + Coturn

The migration replaces Chime SDK with browser-native WebRTC and a self-hosted Coturn TURN server. Cost becomes a flat EC2 instance fee rather than per-minute billing.

Chime SDKWebRTC + Coturn
Cost model$0.0017/attendee-minFlat EC2 (~$15–30/mo)
Break-even~150 hours/month of video
ControlManaged (limited)Full (codec, bitrate, signalling)
TURN requiredManaged by ChimeSelf-hosted (~15% of sessions)
SFU neededNo (1-on-1)No (1-on-1)

Migration sequence: keep Chime now → swap to WebRTC + single Coturn instance pre-launch → add redundancy post-launch → add regional instances based on user geography.


Security Model

ControlImplementation
Network isolationThree-tier VPC, private subnets for app and data planes
Least-privilege IAMSeparate task execution role (ECR/logs/secrets) and task role (DynamoDB/Redis/Chime)
Secrets managementSecrets Manager for Redis auth token and third-party API keys — never in env vars
TLS everywhereACM certificates on ALB and CloudFront, HTTP → HTTPS redirect enforced
Container securityECR image scanning on push, lifecycle policy (retain last 10 images)
Encryption at restDynamoDB SSE (AWS-managed), Redis encryption at rest enabled
AuthCognito with email verification, password complexity, optional MFA
AuditCloudTrail (all API calls), VPC Flow Logs, CloudWatch application logs
Static site isolationS3 bucket private, CloudFront OAC — bucket accessible only from this distribution
Session cookie (v1)JS-managed cross-subdomain cookie — accepted risk, HttpOnly migration planned (ADR-0003)

IAM Role Separation

ECS uses two distinct IAM roles per task:

  • Task Execution Role — used by the ECS agent to pull the container image from ECR, write logs to CloudWatch, and fetch secrets from Secrets Manager at startup. The application code never uses this role.
  • Task Role — used by the application at runtime. Grants access to specific DynamoDB table ARNs, Chime SDK meeting operations, and ElastiCache describe operations. No wildcard resources.

Observability

Infrastructure Alarms

All alarms are CloudWatch Alarms, configured in Terraform's monitoring/ module and routed to an SNS topic for email notification. They cover the four infrastructure components that could silently degrade the platform — compute, load balancing, the database, and cost.

MetricThresholdWindow
ECS CPU utilisation≥ 75%1 × 5-min period
ECS memory utilisation≥ 80%1 × 5-min period
ALB 5xx errors≥ 5 in period1 × 5-min period
ALB unhealthy targets> 02 × 5-min periods (10 min effective); treat_missing_data = breaching — catches scale-to-zero
ALB healthy hosts lowHealthyHostCount < 1treat_missing_data = breaching
DynamoDB throttled requestsAnyWarning
Redis CPU≥ 75%2 × 5-min periods
Monthly cost80% / 90% / 100% of $100 budgetPercentage alerts

Application Metrics

8 custom metrics are emitted to CloudWatch in the Chesschat/Dev namespace from the backend service (src/services/metrics.js):

  • WsConnectionsOpened / WsConnectionsClosed
  • GamesStarted / GamesEnded
  • GamePersistSucceeded / GamePersistRetried / GamePersistFailed
  • AppErrors

These appear alongside infrastructure metrics on the CloudWatch operations dashboard. The three GamePersist* metrics track the finalization worker's reliability in real time — retries and dead-letters are immediately visible without log parsing.

Logging

Application logs are written to CloudWatch Logs (/ecs/chesschat/application) with 30-day retention. VPC Flow Logs capture all network traffic in a separate log group with the same retention.

Every HTTP request is assigned an x-correlation-id — either the client-supplied value or a generated UUID — which is logged alongside method, path, status, and latency. WebSocket sessions are traced the same way using a sessionId and connectionId. When an alarm fires, the correlation ID is the starting point: find the matching log entries and trace the request end to end.


Cost Model

The March bill came to $212.25 (pre-tax) at near-zero actual usage. Almost the entire cost is fixed always-on infrastructure — the variable components (compute, data transfer, Chime) were negligible, whuhc makes sense as I was the onyl person using it.

ServiceMarch actualNotes
VPC Interface Endpoints$89.284 endpoints × 3 AZs = 12 endpoint-hours/hr — per-AZ pricing
NAT Gateway$33.48Hourly charge only — 0.025 GB data processed
ElastiCache Redis$23.811 primary + 1 replica, t4g.micro
ALB$16.71Fixed hourly — negligible LCU usage
Domain registrar$15.00Annual fee
Public IPv4 addresses$14.85$0.005/hr per in-use address — AWS 2024 pricing change
ECS Fargate$9.400.25 vCPU / 512 MiB baseline task
CloudWatch$8.4028 custom metrics at $0.30/metric
Other$1.32Secrets Manager, Route 53, ECR, Chime ($0.03)
Total$212.25Pre-tax

The Always-On Problem

At this scale, the bill exposes a structural issue: roughly $163 of the $212 is fixed cost that runs regardless of whether anyone uses the platform. The NAT Gateway processed 0.025 GB of data but cost $33. The VPC endpoints processed 1.67 GB but cost $89. The ALB served negligible traffic but cost $17. This is the real cost of production-grade infrastructure running at development load.

The biggest surprise was the VPC Interface Endpoints. Interface endpoints are billed per-AZ — each endpoint deployed across three AZs counts as three hourly charges. Four endpoints across three AZs produces 12 endpoint-hours per clock hour, not 4. At low data volumes, this makes interface endpoints significantly more expensive than simply routing through NAT. The free Gateway endpoints (S3 and DynamoDB) are unaffected by this as they have no hourly charge.

At Scale, the Math Changes

The fixed costs stay flat while variable costs grow with usage. NAT data processing charges ($0.045/GB) and Fargate compute scale with real load — at meaningful traffic, the always-on baseline becomes a smaller fraction of total spend. The VPC endpoints also start paying for themselves: routing ECR pulls and CloudWatch log writes through endpoints rather than NAT avoids data processing charges that compound quickly at scale. The architecture is correctly sized for production; it is just expensive to run at near-zero load.

The WebRTC + Coturn migration eliminates Chime per-minute billing, replacing it with a flat EC2 instance cost — the right trade once session volume is predictable.