EdTechSouth Asia14 Weeks

Online Education Platform

An EdTech startup serving over 5,000 active students was experiencing weekly outages during scheduled live classes — the worst possible time for a platform failure. Alongside the instability, new course content was taking two to three weeks to reach students due to a broken release process. In 14 weeks, we eliminated the outages, rebuilt the delivery pipeline, and gave the team a platform they could rely on.

Live-Class Outages (Post Launch)

2hrs

Content Release Time (was 3 weeks)

40%

Cloud Cost Reduction

60%

Fewer Production Incidents

!The Challenge

This EdTech platform had found strong product-market fit — student enrolment had grown 3× over 18 months and live class attendance was consistently high. But the infrastructure had not scaled with the product. The platform was running on a single-server monolithic deployment with no redundancy, no environment separation, and no automated recovery mechanisms.

Live classes ran on scheduled times — 8am, 12pm, and 7pm — when hundreds of concurrent students would log in simultaneously. These traffic spikes were entirely predictable, yet the platform would frequently become unresponsive at exactly these moments. The root causes were multiple: a memory leak in the video streaming component, database connection pool exhaustion under concurrent load, and no health check or auto-restart mechanism to recover from failures without manual intervention.

The deployment process made things worse. Engineers deployed directly from their laptops to the production server — a process with no testing gate, no staging environment, and no rollback capability. Changes that introduced regressions went live immediately and impacted all students. The team had started a practice of freezing deployments during live class hours, which meant that urgent bug fixes had to wait hours to be released — and features that students needed were routinely delayed by 2–3 weeks just to avoid the risk of a bad deployment.

The business impact was severe. Refund requests were rising. Student reviews were mentioning unreliability. Churn was increasing at renewal time. With a new enrolment cycle approaching and investor pressure to demonstrate platform maturity, the team needed a fundamental fix — not another patch.

⇄Before vs After

AreaBeforeAfter

Live Class StabilityWeekly outages during scheduled classesZero outages in 8+ weeks post-launch

Content Releases2–3 weeks from code to studentsUnder 2 hours, fully automated

DeploymentsLaptop to production, frozen during class hoursCI/CD pipeline, deploy any time safely

EnvironmentsProduction only — no stagingDev, staging, production fully isolated

Incident DetectionStudents reported issues before engineers knewAutomated alerts within 60 seconds

RollbackManual, took 1–2 hoursAutomated rollback, under 5 minutes

Cloud CostsOversized single server, unoptimised40% lower with right-sized, multi-service infra

⚙Tech Stack

Compute

AWS ECS Fargate, Application Load Balancer, Auto Scaling

Video & Content Delivery

Amazon CloudFront, S3 (recorded content), HLS streaming optimisation

CI/CD

GitHub Actions, Docker, ECR, automated rollback on health check failure

Database

Amazon RDS PostgreSQL (Multi-AZ), connection pooling via PgBouncer

Monitoring & Alerting

Prometheus, Grafana, CloudWatch, PagerDuty with severity runbooks

Infrastructure as Code

Terraform — all environments version-controlled

→What We Did

Root Cause Analysis & Stabilisation

Before touching any infrastructure, we spent the first two weeks doing thorough root cause analysis. We instrumented the application to capture memory usage, database connection counts, and request latency over time. This revealed three distinct failure modes: a memory leak in the WebSocket connection handler for live classes, database connection pool exhaustion under concurrent login load at class start times, and a missing process supervisor meaning crashes required manual SSH to restart. We fixed all three before any infrastructure changes — stabilising the existing system immediately while the larger migration was underway.

Environment Separation & CI/CD Pipeline

We created fully isolated dev, staging, and production environments using Terraform. The staging environment is a scaled-down mirror of production — same architecture, same configuration, just smaller. A GitHub Actions pipeline was built to run all tests automatically on every pull request, deploy to staging on merge to main, and require a manual approval gate before promoting to production. Deployments include automated health checks with rollback triggered if health checks fail within 3 minutes of deploy.

Content Publishing Pipeline

Course content uploads — video recordings, PDFs, quizzes — previously required a developer to manually move files and update database records. We automated the entire pipeline: content uploaded by educators triggers a Lambda function that processes, transcodes (via AWS MediaConvert for video), uploads to S3, invalidates the CloudFront cache, and updates the database. What previously took 2–3 weeks of developer time now completes in under 2 hours, fully without engineering involvement.

Database Resilience

The single RDS instance was upgraded to a Multi-AZ deployment for automatic failover. PgBouncer was introduced as a connection pooler sitting between the application and RDS, eliminating connection exhaustion under the concurrent load of students joining classes simultaneously. Database query performance was profiled and three slow queries causing significant latency during class start were optimised with appropriate indexing.

Observability & Incident Response

We deployed Prometheus and Grafana with dashboards specifically designed around the EdTech use case — concurrent student connections, live class health, content delivery latency, and error rates. PagerDuty alerts are configured with class-schedule awareness, meaning on-call severity escalates automatically 30 minutes before scheduled class times. Runbooks are attached to every alert so on-call engineers know exactly what to do rather than improvising under pressure.

✦Key Engineering Decisions

Decision: Fix root causes before migrating infrastructure

Many teams make the mistake of moving broken software to better infrastructure and expecting it to fix itself. We stabilised the existing system in the first two weeks by fixing the actual application bugs — memory leak, connection pool, crash recovery — before touching the architecture. This gave students a better experience immediately while the larger migration was in progress.

Decision: AWS MediaConvert for video transcoding over a custom solution

Video transcoding is a solved problem. Building a custom transcoding pipeline would have taken weeks and introduced significant ongoing maintenance. MediaConvert handles transcoding at scale, integrates natively with S3 and CloudFront, and costs a fraction of the developer time it would take to build and maintain a custom solution.

Decision: PgBouncer for connection pooling rather than scaling RDS vertically

Connection exhaustion under concurrent load is typically solved by simply upgrading to a larger database instance. But the actual problem was inefficient connection management — the application was opening a new connection per request. PgBouncer solved the root cause at near-zero cost, whereas vertical scaling would have increased the RDS bill significantly without fixing the underlying issue.

⏱Engagement Timeline

Week 1–2

Root Cause Analysis & Immediate Fixes

Instrumented application, identified three distinct failure modes, fixed memory leak, connection pool exhaustion, and crash recovery. Outage frequency dropped immediately.

Week 3–5

Infrastructure Migration & Terraform

Migrated to ECS Fargate. Defined all environments in Terraform. Multi-AZ RDS with PgBouncer deployed.

Week 6–8

CI/CD Pipeline & Environment Separation

GitHub Actions pipeline built. Dev and staging environments created. First deployment via pipeline executed to production.

Week 9–11

Content Pipeline Automation

Lambda + MediaConvert content pipeline built and tested. Educator content publishing workflow automated end to end.

Week 12–13

Observability & Alerting

Prometheus + Grafana deployed. Class-schedule-aware PagerDuty alerts configured. Runbooks written for all alert types.

Week 14

Training & Handover

Full team training on new systems. Documentation delivered. On-call rotation handed over to internal team with confidence.

✓Results Delivered

✓Zero live-class outages in 8+ weeks following launch

✓Content release time: 3 weeks → under 2 hours

✓40% reduction in monthly AWS cloud spend

✓60% fewer production incidents overall

✓Dev, staging, production fully isolated environments

✓Automated CI/CD — safe to deploy any time of day

✓Student churn stabilised at the next enrolment cycle

✓Educators can publish content without developer help

"Our students stopped complaining about outages almost immediately — within the first two weeks, before the full migration was even done. ESSEMVEE fixed the actual problems, not just the symptoms. The content pipeline alone saved us dozens of engineering hours per month."

Founder & CEO

Online Education Platform · South Asia ·

Facing Similar Challenges?

Book a free 30-minute call — no obligation, no sales pitch.

Schedule Free Consultation

Free 30-minute call · No obligation