BACK TO WORKSPortfolio Project

Flash Sale System

High-concurrency e-commerce flash sale platform

Node.js
TypeScript
Redis
Kafka
BullMQ
PostgreSQL
Docker
AWS ECS
Flash Sale System

Overview

A production-grade flash sale system designed to handle 10,000 concurrent purchase requests without overselling. Built with a monolith-first approach then migrated to microservices using the Strangler Fig pattern.

Quick Info

RoleBuilt solo, end-to-end
Timeline3 weeks
TypeBackend

The Problem

Flash sales cause massive traffic spikes that overwhelm traditional RDBMS locking mechanisms, causing overselling, poor latency, and DB connection pool exhaustion under load.

My Solution

Moved inventory management to Redis with atomic Lua scripting for check-and-decrement in a single round trip. Used BullMQ to queue order processing and Kafka for event fanout to downstream services.

Key Features

  • Atomic inventory deduction via Redis Lua script — zero overselling at 10,000 concurrent requests
  • BullMQ order queue prevents DB connection pool exhaustion during sale spikes
  • Kafka event fanout triggers email, analytics, and inventory sync independently
  • JWT access + refresh token rotation with Idempotency key on purchase endpoint
  • Deployed on AWS ECS Fargate with GitHub Actions CI/CD pipeline
  • Real-time inventory updates pushed to all connected clients via Socket.IO + Redis Adapter
🎬Demo Video
🏗️System Design (HLD)
HLD Architecture Diagram

Key Design Decisions

01

Chose Redis over PostgreSQL row locking for inventory — DB locking caused 3s p99 latency under 5k concurrent users. Redis Lua script reduced this to under 40ms.

02

Chose BullMQ over direct DB writes — connection pool of 20 was exhausting in under 2 seconds during load tests. BullMQ gives the DB a controlled write rate.

03

Chose Kafka alongside BullMQ — BullMQ handles reliable job processing, Kafka handles decoupled event fanout to email, analytics, and inventory sync services independently.

🗄️DB Schema & Optimizations
ER Diagram

Performance Optimizations

Composite Index on Orders Table

Added index on (user_id, created_at DESC) — the hot path for order history queries which ran on every dashboard load.

Before: 340msAfter: 18ms
Partial Index on FlashSales

Index only on WHERE status = 'active' — active sale lookup is the only hot path, reduced index size significantly.

Before: Full table scanAfter: 80% smaller index
pg-pool Connection Pooling

Set max 20 connections on ECS Fargate. Prevented connection exhaustion during 5k concurrent user load test.

Before: Pool exhausted in 2sAfter: Stable under 10k users
⚙️Technical Deep Dive
Problem

Race condition in inventory deduction — two users could purchase the last item simultaneously

Considered

PostgreSQL row-level locking with SELECT FOR UPDATE — caused lock contention and 3s+ latency under load

Solution

Atomic Lua script in Redis: check inventory and decrement in a single operation, making it impossible for two requests to read the same count

Result ✅

Zero overselling across 50 load test runs at 10,000 concurrent users

Problem

DB connection pool exhausted within 2 seconds of sale start under 5k concurrent users

Considered

Increasing pool size — not viable on ECS Fargate t3.micro, caused memory pressure

Solution

BullMQ queue absorbs the spike. Workers process jobs at a controlled rate the DB can handle. Failed jobs retry with exponential backoff.

Result ✅

DB connection usage stayed under 60% even at 10k concurrent purchase attempts

Problem

Email and analytics services were tightly coupled to the order processing flow, causing cascading failures

Considered

Direct HTTP calls from order service — one downstream failure would fail the entire purchase

Solution

Published order.confirmed events to Kafka. Each downstream service is an independent consumer group and fails without affecting the purchase flow.

Result ✅

Email service outage no longer affects purchase success rate

📈Performance & Scale Numbers
10,000Concurrent Users
<40msp99 Inventory Check
0Oversells in Testing
99.9%Uptime on ECS
74%Test Coverage
~2sCold Start Time
🔒Security Implementation
LayerImplementation
AuthJWT access token (15min) + refresh token rotation (7 days) stored in HTTP-only cookies
HeadersHelmet.js — XSS protection, clickjacking prevention, MIME sniffing disabled
Rate Limitingexpress-rate-limit — 100 req/min per IP on purchase endpoint, 10 req/min on auth
IdempotencyIdempotency key required on POST /order — prevents duplicate orders on client retry
CORSWhitelist of allowed origins only — no wildcard in production
Input ValidationZod schema validation on all request bodies — rejects malformed payloads before hitting business logic
Caching Strategy (Redis)
Active Flash Sale Config — TTL 60s

Sale details cached on first request. Invalidated immediately when admin updates sale via cache-aside pattern.

Inventory Count — No TTL

Redis is the source of truth for inventory, not a cache. Updated atomically via Lua script on every purchase.

User Session Tokens — TTL matches JWT

Refresh tokens stored in Redis with TTL equal to JWT expiry. Enables instant token revocation.

NOT cached: Order history

Always fetched fresh from PostgreSQL. Consistency is critical — stale order data is unacceptable.

📬Async & Queue Architecture (BullMQ / Kafka)
Purchase Request
  → Redis Lua (atomic inventory check)
      ↓ success
  → BullMQ: add job to 'orders' queue
      ↓
  → Worker: validate → write to PostgreSQL
      ↓
  → Kafka: publish 'order.confirmed'
      ↓
  ├── Email Service (consumer group 1)
  ├── Analytics Service (consumer group 2)
  └── Inventory Sync (consumer group 3)

Failed jobs → retry with exponential backoff (3 attempts)
Dead letter → manual review queue
🔴Real-time Layer (Socket.IO)
  • sale:started → countdown timer begins for all connected clients simultaneously
  • inventory:updated → live stock count pushed after every successful purchase
  • order:confirmed → buyer's UI updates instantly without polling
  • sale:ended → all clients notified, purchase button disabled automatically
  • Scaling: Redis Adapter syncs Socket.IO events across multiple ECS Fargate nodes
🚀DevOps & CI/CD
Local Development:
  Upstash Redis (TLS) + Supabase PostgreSQL
  No Docker locally — cloud-hosted services only

CI/CD Pipeline (GitHub Actions):
  Push to main
    → ESLint + TypeScript type checks
    → Jest unit tests (must pass)
    → Docker image build
    → Push to AWS ECR
    → Rolling deploy to ECS Fargate

Infrastructure:
  ECS Fargate     — no server management
  Upstash Redis   — serverless Redis via TLS
  Supabase        — managed PostgreSQL
  Secrets Manager — env vars injected at runtime
  CloudWatch      — logs and container metrics
📄API Reference (Swagger)
Swagger API Reference
🧪Unit Test Coverage
74%Overall Coverage
Test Coverage Report
🔁What I'd Do Differently
  • Would implement circuit breakers (opossum) earlier — learned this after a Kafka consumer outage cascaded
  • Would add k6 load testing to the CI pipeline from day one instead of running manually
  • Would use connection pooling (pgBouncer) in front of PostgreSQL instead of relying solely on pg-pool
  • Would design the DB schema with multi-tenancy in mind from the start
💡Key Learnings
  • Lua scripting in Redis for atomic operations is the correct solution for high-concurrency inventory — not DB locks
  • BullMQ's stalled job checker causes Upstash command limit issues — fix with stalledInterval: 60000
  • Kafka consumer groups give you independent scaling and failure isolation that direct HTTP calls cannot
  • ECS Fargate rolling deploys require health check grace periods to avoid premature task termination