Flash Sale System
High-concurrency e-commerce flash sale platform
Overview
A production-grade flash sale system designed to handle 10,000 concurrent purchase requests without overselling. Built with a monolith-first approach then migrated to microservices using the Strangler Fig pattern.
Quick Info
The Problem
Flash sales cause massive traffic spikes that overwhelm traditional RDBMS locking mechanisms, causing overselling, poor latency, and DB connection pool exhaustion under load.
My Solution
Moved inventory management to Redis with atomic Lua scripting for check-and-decrement in a single round trip. Used BullMQ to queue order processing and Kafka for event fanout to downstream services.
Key Features
- →Atomic inventory deduction via Redis Lua script — zero overselling at 10,000 concurrent requests
- →BullMQ order queue prevents DB connection pool exhaustion during sale spikes
- →Kafka event fanout triggers email, analytics, and inventory sync independently
- →JWT access + refresh token rotation with Idempotency key on purchase endpoint
- →Deployed on AWS ECS Fargate with GitHub Actions CI/CD pipeline
- →Real-time inventory updates pushed to all connected clients via Socket.IO + Redis Adapter
🎬Demo Video
🏗️System Design (HLD)
Key Design Decisions
Chose Redis over PostgreSQL row locking for inventory — DB locking caused 3s p99 latency under 5k concurrent users. Redis Lua script reduced this to under 40ms.
Chose BullMQ over direct DB writes — connection pool of 20 was exhausting in under 2 seconds during load tests. BullMQ gives the DB a controlled write rate.
Chose Kafka alongside BullMQ — BullMQ handles reliable job processing, Kafka handles decoupled event fanout to email, analytics, and inventory sync services independently.
🗄️DB Schema & Optimizations
Performance Optimizations
Added index on (user_id, created_at DESC) — the hot path for order history queries which ran on every dashboard load.
Index only on WHERE status = 'active' — active sale lookup is the only hot path, reduced index size significantly.
Set max 20 connections on ECS Fargate. Prevented connection exhaustion during 5k concurrent user load test.
⚙️Technical Deep Dive
Race condition in inventory deduction — two users could purchase the last item simultaneously
PostgreSQL row-level locking with SELECT FOR UPDATE — caused lock contention and 3s+ latency under load
Atomic Lua script in Redis: check inventory and decrement in a single operation, making it impossible for two requests to read the same count
Zero overselling across 50 load test runs at 10,000 concurrent users
DB connection pool exhausted within 2 seconds of sale start under 5k concurrent users
Increasing pool size — not viable on ECS Fargate t3.micro, caused memory pressure
BullMQ queue absorbs the spike. Workers process jobs at a controlled rate the DB can handle. Failed jobs retry with exponential backoff.
DB connection usage stayed under 60% even at 10k concurrent purchase attempts
Email and analytics services were tightly coupled to the order processing flow, causing cascading failures
Direct HTTP calls from order service — one downstream failure would fail the entire purchase
Published order.confirmed events to Kafka. Each downstream service is an independent consumer group and fails without affecting the purchase flow.
Email service outage no longer affects purchase success rate
📈Performance & Scale Numbers
🔒Security Implementation
| Layer | Implementation |
|---|---|
| Auth | JWT access token (15min) + refresh token rotation (7 days) stored in HTTP-only cookies |
| Headers | Helmet.js — XSS protection, clickjacking prevention, MIME sniffing disabled |
| Rate Limiting | express-rate-limit — 100 req/min per IP on purchase endpoint, 10 req/min on auth |
| Idempotency | Idempotency key required on POST /order — prevents duplicate orders on client retry |
| CORS | Whitelist of allowed origins only — no wildcard in production |
| Input Validation | Zod schema validation on all request bodies — rejects malformed payloads before hitting business logic |
⚡Caching Strategy (Redis)
Sale details cached on first request. Invalidated immediately when admin updates sale via cache-aside pattern.
Redis is the source of truth for inventory, not a cache. Updated atomically via Lua script on every purchase.
Refresh tokens stored in Redis with TTL equal to JWT expiry. Enables instant token revocation.
Always fetched fresh from PostgreSQL. Consistency is critical — stale order data is unacceptable.
📬Async & Queue Architecture (BullMQ / Kafka)
Purchase Request
→ Redis Lua (atomic inventory check)
↓ success
→ BullMQ: add job to 'orders' queue
↓
→ Worker: validate → write to PostgreSQL
↓
→ Kafka: publish 'order.confirmed'
↓
├── Email Service (consumer group 1)
├── Analytics Service (consumer group 2)
└── Inventory Sync (consumer group 3)
Failed jobs → retry with exponential backoff (3 attempts)
Dead letter → manual review queue🔴Real-time Layer (Socket.IO)
- →sale:started → countdown timer begins for all connected clients simultaneously
- →inventory:updated → live stock count pushed after every successful purchase
- →order:confirmed → buyer's UI updates instantly without polling
- →sale:ended → all clients notified, purchase button disabled automatically
- →Scaling: Redis Adapter syncs Socket.IO events across multiple ECS Fargate nodes
🚀DevOps & CI/CD
Local Development:
Upstash Redis (TLS) + Supabase PostgreSQL
No Docker locally — cloud-hosted services only
CI/CD Pipeline (GitHub Actions):
Push to main
→ ESLint + TypeScript type checks
→ Jest unit tests (must pass)
→ Docker image build
→ Push to AWS ECR
→ Rolling deploy to ECS Fargate
Infrastructure:
ECS Fargate — no server management
Upstash Redis — serverless Redis via TLS
Supabase — managed PostgreSQL
Secrets Manager — env vars injected at runtime
CloudWatch — logs and container metrics📄API Reference (Swagger)
🧪Unit Test Coverage
🔁What I'd Do Differently
- →Would implement circuit breakers (opossum) earlier — learned this after a Kafka consumer outage cascaded
- →Would add k6 load testing to the CI pipeline from day one instead of running manually
- →Would use connection pooling (pgBouncer) in front of PostgreSQL instead of relying solely on pg-pool
- →Would design the DB schema with multi-tenancy in mind from the start
💡Key Learnings
- →Lua scripting in Redis for atomic operations is the correct solution for high-concurrency inventory — not DB locks
- →BullMQ's stalled job checker causes Upstash command limit issues — fix with stalledInterval: 60000
- →Kafka consumer groups give you independent scaling and failure isolation that direct HTTP calls cannot
- →ECS Fargate rolling deploys require health check grace periods to avoid premature task termination