Designing Robust, Scalable & Stable E-Commerce Systems

ArchitectureE-CommerceRedisVarnishCDNGaleraGrafanaRabbitMQOpenTelemetry

E-commerce systems face unique challenges — unpredictable traffic spikes, complex inventory logic, payment reliability, and the absolute requirement of zero data loss. This case study covers the architectural patterns that make high-availability e-commerce work, from cache invalidation strategy to queue-based order processing, Redis topology design, and why load testing must be a first-class engineering practice, not an afterthought.

Cache Invalidation Strategy

The old joke about cache invalidation being one of the two hard problems in computer science is particularly relevant in e-commerce. Full-page caching (via Varnish or Redis FPC) can make a storefront respond in under 50ms, but it's only as good as its purge logic. A cache that serves stale data is worse than no cache at all — it's a cache that lies.

Tag-based cache invalidation is the architectural pattern that makes caching viable for dynamic e-commerce content. Every cached page is tagged with the entities it depends on: product IDs, category IDs, CMS block identifiers, and price rule IDs. When a product price changes, only pages tagged with that specific product ID get purged — not the entire cache. When a category assignment changes, pages tagged with that category get purged.

This sounds simple in theory, but the implementation details matter enormously. A flash sale that changes prices on 500 products needs to purge thousands of pages (category listings, search results, product pages, cross-sell blocks) within seconds. If the purge process takes 30 seconds, you have a 30-second window where customers see incorrect prices — which can mean selling at a loss or violating consumer protection laws in jurisdictions with strict pricing accuracy requirements.

Our implementation uses a two-level caching architecture. Varnish handles the hot cache layer with sub-millisecond response times. Behind Varnish, a Redis FPC layer acts as a warm cache that survives Varnish restarts and provides cache entries to new Varnish nodes in a horizontally scaled setup. Both layers share the same tag-based invalidation system, so a single purge command clears the entry from both layers simultaneously.

CDN caching adds a third layer for static assets (images, CSS, JavaScript) with aggressive Cache-Control headers. But we deliberately keep the CDN out of the dynamic page caching — CDN cache invalidation is slow (often 5–15 seconds globally) and imprecise. For price-sensitive content, 15 seconds of stale data is unacceptable.

Queue-Based Order Processing

Checkout is the single most business-critical user flow in any e-commerce system. Every millisecond of latency and every percentage point of failure rate directly translates to lost revenue. Yet the naive implementation of checkout — synchronously calling inventory, payment gateway, ERP, email service, and analytics in a single request — creates a system where any downstream failure crashes the entire checkout.

The architectural fix is queue-based decoupling. The checkout endpoint's responsibility is reduced to three atomic operations: validate the cart, process payment (the one operation that must be synchronous because the customer is waiting), and write the order record to the database. Everything else — inventory decrement, ERP notification, confirmation email, analytics events, loyalty points — goes into a message queue (RabbitMQ or SQS) for asynchronous processing.

This decoupling has profound reliability implications. If the ERP system is down for maintenance, checkout still works — ERP messages queue up and process when the system recovers. If the email service is overloaded, customers still get their order confirmation, just 30 seconds later instead of immediately. The only synchronous dependency is the payment gateway, and even that has circuit breaker protection with automatic failover to a backup payment processor.

The order processing queue consumers are designed with idempotency built in. Every message carries an idempotency key (typically the order ID), and every consumer checks whether it has already processed that key before executing. This means if a consumer crashes mid-processing and the message gets redelivered, the operation doesn't execute twice — no duplicate inventory decrements, no duplicate ERP entries, no duplicate emails.

For inventory specifically, we use optimistic locking with compensating transactions. The cart validation checks inventory availability, but the actual decrement happens asynchronously. If two concurrent orders claim the last unit, the second consumer detects the conflict (optimistic lock failure) and triggers a compensating transaction — refunding the payment and notifying the customer that the item is no longer available. This is rare in practice (we see it on <0.01% of orders) and far preferable to the alternative of holding database locks during the entire checkout flow.

Redis Topology & Data Integrity

Redis is the Swiss Army knife of e-commerce architecture — session storage, full-page cache, object cache, rate limiting, queue backend. But using a single Redis instance for all of these is an architectural anti-pattern that creates both resource contention and a single point of failure.

The fundamental problem is eviction policy conflicts. Session data requires the noeviction policy — if Redis runs out of memory, it should reject new writes rather than evict an active user's session, which would log them out mid-checkout. Cache data requires allkeys-lru (Least Recently Used) eviction — when memory is full, evict the least-used cache entries to make room for new ones. These two policies are mutually exclusive on a single Redis instance.

Our architecture uses three separate Redis clusters (or ElastiCache replication groups on AWS): one for sessions with noeviction and sufficient memory to hold all active sessions, one for full-page cache with allkeys-lru and aggressive TTLs, and one for object cache with volatile-lru (only evict keys with an expiry set). Each cluster is sized independently based on its specific workload pattern.

This separation also provides failure isolation. If the FPC Redis cluster experiences a memory spike during a traffic surge, it evicts cache entries — which means more requests hit PHP, temporarily slowing down the site. But it doesn't touch session data, so no customers get logged out. If the session Redis cluster goes down entirely, customers lose their sessions (bad) but the site still serves cached pages (the FPC cluster is independent), so anonymous browsing continues.

For payment processing, we implement the saga pattern with idempotency keys. Each payment operation has a unique key, and the payment gateway is called with this key. If the request times out or fails ambiguously, we can safely retry with the same key — the gateway guarantees that the payment is only captured once. This eliminates the dreaded 'did the payment go through?' ambiguity that plagues naive payment implementations.

Event sourcing for order state provides a complete audit trail. Instead of updating an order status field in place (order.status = 'shipped'), we append an event (OrderShipped { order_id, tracking_number, timestamp }). The current order state is derived by replaying all events. This means we can answer questions like 'what was the order state at 3:47 PM on Tuesday?' — which is invaluable for debugging customer complaints and financial reconciliation.

Observability & Load Testing

Observability in e-commerce isn't just about infrastructure metrics (CPU, memory, disk) — it's about business metrics that directly correlate to revenue. We instrument dashboards with both layers, and the alerting thresholds are calibrated to business impact.

Technical metrics include request latency percentiles (P50, P95, P99), error rates by endpoint, cache hit rates, database query duration, and queue depth. Business metrics include conversion rate (orders/sessions), cart abandonment rate, payment success rate, average order value, and revenue per minute. When the payment success rate drops below 99%, an alert fires regardless of whether any technical metric looks unhealthy — because a payment gateway degradation can cost thousands of euros per hour even while all infrastructure appears green.

Distributed tracing via OpenTelemetry spans the entire request lifecycle: from the load balancer through Varnish, Nginx, PHP-FPM, MySQL, Redis, and external API calls. When a customer reports a slow checkout, we can pull up the exact trace for their order and see that the ERP API call took 4.2 seconds — pinpointing the bottleneck without guesswork.

Synthetic monitoring runs a simulated checkout flow every 60 seconds: browse category, view product, add to cart, proceed to checkout, fill in shipping, verify totals. This catches issues that passive monitoring misses — like a JavaScript error that breaks the add-to-cart button on mobile, which doesn't show up in server-side metrics at all.

Load testing is treated as a first-class engineering practice, not a pre-launch checkbox. We run load tests before every major sale event (Black Friday, seasonal promotions) with traffic patterns modeled on historical data. The test validates not just peak throughput but also sustained load (can the system handle 4 hours of elevated traffic without degradation?), spike response (how quickly does auto-scaling react to a sudden 5x traffic increase?), and degradation behavior (what happens when we exceed designed capacity — does the system degrade gracefully or fall off a cliff?).

Database clustering with Galera ensures MySQL availability during these stress tests. The Galera cluster provides synchronous multi-master replication, meaning any node can handle reads and writes. If a node fails, the cluster continues operating without manual intervention. Load test results validate that Galera's synchronous replication overhead doesn't become a bottleneck at peak write volumes during flash sales.

Results

Zero downtime during 10x traffic spikes (Black Friday)
Sub-second page loads at P95 under peak load
99.97% payment success rate with automatic failover
Cache hit rate maintained above 95% during flash sales
Complete audit trail via event sourcing for all order state changes
<0.01% inventory conflict rate with optimistic locking

Want to discuss a similar challenge? Get in touch →