<Marko />
← Back to case studies

Magento 2 on Kubernetes

Magento 2KubernetesAWS EKSDockerHelmKarpenterTerraformPHP-FPMArgo RolloutsVarnish

Running Magento 2 — a notoriously resource-heavy PHP application — inside Kubernetes requires careful orchestration. This case study walks through the complete journey from traditional LAMP deployments to a fully containerized, auto-scaling production environment on AWS EKS, covering the hard-won lessons around stateless design, PHP-FPM tuning, and deployment safety.

Stateless Container Design

Magento's default assumption of a shared filesystem — for media uploads, generated code, and the var/ directory — breaks fundamentally in multi-pod environments. When two pods try to write to the same var/cache directory or serve media from a local disk that only one pod has access to, you get inconsistent behavior, race conditions, and mysterious 404s on images.

The solution requires rethinking how Magento interacts with storage at every level. We externalized pub/media to S3-compatible object storage with a CloudFront CDN in front, moved var/cache and var/page_cache to Redis backends, and configured var/session to use Redis as well. The generated/ directory — containing compiled dependency injection, interceptors, and proxy classes — gets baked into the Docker image at build time rather than generated at runtime.

The build pipeline itself became a critical piece of the architecture. Multi-stage Docker images separate the build environment (where Composer installs dependencies, DI compiles, and static content deploys) from the lean runtime image (PHP-FPM + Nginx + compiled artifacts). This means every container that starts up is identical and ready to serve traffic immediately — no runtime compilation, no filesystem dependencies, no shared state between pods.

PHP-FPM Pool Tuning Per Workload

One of the most impactful discoveries was that running a single PHP-FPM pool for both frontend customer traffic and backend admin/import operations causes resource starvation under load. A single large product import can consume all available FPM children, leaving zero capacity for customer-facing requests. This is invisible during normal operations but catastrophic during peak traffic combined with routine admin tasks.

We split the workload into separate Kubernetes Deployments, each with distinct resource limits and FPM configurations. Frontend pods run with pm.max_children tuned for short-lived, memory-bounded requests (typically 10–15 children per pod with 128MB per child). Admin and cron pods run with fewer children but significantly higher memory limits and longer request_terminate_timeout values to accommodate bulk operations.

The Horizontal Pod Autoscaler scales frontend pods based on custom Prometheus metrics — specifically request latency at the P95 percentile and the ratio of active FPM children to maximum children. This means scaling decisions are driven by actual application pressure, not just CPU utilization, which often lags behind the real bottleneck in PHP applications.

This separation also improved deployment safety. We can deploy new code to admin pods first as a canary, verify index and import operations work correctly, then roll out to frontend pods — all without risking customer-facing availability during the process.

Kubernetes Architecture on AWS EKS

The production cluster runs on AWS EKS with Karpenter handling node provisioning. Karpenter replaced the traditional Cluster Autoscaler because of its ability to provision right-sized nodes within seconds rather than minutes, and its support for mixed instance types and spot instances. For Magento workloads, we defined Karpenter provisioners that prefer compute-optimized instances (c6i family) for frontend pods and memory-optimized instances (r6i family) for admin/cron pods.

Varnish runs as a sidecar container alongside Nginx in the frontend pods, providing full-page cache with tag-based invalidation. This architecture means each pod is self-contained — Nginx receives the request, checks Varnish, and only hits PHP-FPM on cache misses. Cache hit rates typically exceed 90% for anonymous browsing, which means 9 out of 10 requests never touch PHP at all.

Redis is deployed as a StatefulSet with persistent volumes for session storage (where data loss means logged-in customers get logged out) and as a regular Deployment for cache storage (where data loss just means a temporary performance dip). IAM Roles for Service Accounts (IRSA) provide fine-grained AWS permissions — the application pods can access S3 for media storage without storing any AWS credentials in the cluster.

Networking uses AWS Load Balancer Controller with Network Load Balancers for TCP-level load balancing and AWS WAF integration for security. Internal service communication uses Kubernetes ClusterIP services with proper network policies restricting which pods can talk to which services.

Zero-Downtime Deployments & Readiness

Zero-downtime deployments for Magento require more than just setting strategy: RollingUpdate in your Deployment spec. The critical challenge is Magento's slow boot time — a freshly started PHP-FPM pod takes 10–20 seconds before it can reliably serve traffic. During this window, OPcache is cold (every class file triggers a disk read and compilation), Redis connection pools haven't been established, and the DI container hasn't been fully warmed.

A naive HTTP readiness probe that checks the / endpoint will return 200 as soon as PHP-FPM starts accepting connections, even though the first several requests will take 3–5 seconds each as OPcache warms up. If Kubernetes routes traffic to these cold pods during a rolling deployment, users experience cascading slow responses that look like the site is down.

We implemented a custom health endpoint that verifies three conditions before reporting ready: OPcache has loaded at least 80% of the expected number of cached scripts (tracked via opcache_get_status()), Redis connections to all configured backends are established, and a lightweight database query succeeds. The readiness probe has an initial delay of 15 seconds with a 5-second check interval.

For additional deployment safety, we use Argo Rollouts with a canary strategy. New versions receive 10% of traffic for 5 minutes while Prometheus metrics are monitored for error rate increases or latency regressions. If metrics stay healthy, traffic shifts to 50% for another 5 minutes, then 100%. If any metric breaches its threshold, the rollout automatically reverts to the previous version within seconds — no human intervention required.

Results

  • 99.99% uptime through auto-scaling and failure domain isolation
  • Deploy time reduced from 30 minutes to 90 seconds
  • 60% infrastructure cost reduction via right-sizing and Karpenter spot instances
  • Zero-downtime deployments with automatic rollback on metric regression
  • Eliminated cascading 500s during rolling updates via custom readiness probes
  • 90%+ Varnish cache hit rate for anonymous traffic

Want to discuss a similar challenge? Get in touch →