Online Education Platform
An EdTech startup serving over 5,000 active students was experiencing weekly outages during scheduled live classes — the worst possible time for a platform failure. Alongside the instability, new course content was taking two to three weeks to reach students due to a broken release process. In 14 weeks, we eliminated the outages, rebuilt the delivery pipeline, and gave the team a platform they could rely on.
Client name and identifying details withheld at their request. References available during consultation.
!The Challenge
This EdTech platform had found strong product-market fit — student enrolment had grown 3× over 18 months and live class attendance was consistently high. But the infrastructure had not scaled with the product. The platform was running on a single-server monolithic deployment with no redundancy, no environment separation, and no automated recovery mechanisms.
Live classes ran on scheduled times — 8am, 12pm, and 7pm — when hundreds of concurrent students would log in simultaneously. These traffic spikes were entirely predictable, yet the platform would frequently become unresponsive at exactly these moments. The root causes were multiple: a memory leak in the video streaming component, database connection pool exhaustion under concurrent load, and no health check or auto-restart mechanism to recover from failures without manual intervention.
The deployment process made things worse. Engineers deployed directly from their laptops to the production server — a process with no testing gate, no staging environment, and no rollback capability. Changes that introduced regressions went live immediately and impacted all students. The team had started a practice of freezing deployments during live class hours, which meant that urgent bug fixes had to wait hours to be released — and features that students needed were routinely delayed by 2–3 weeks just to avoid the risk of a bad deployment.
The business impact was severe. Refund requests were rising. Student reviews were mentioning unreliability. Churn was increasing at renewal time. With a new enrolment cycle approaching and investor pressure to demonstrate platform maturity, the team needed a fundamental fix — not another patch.
⇄Before vs After
⚙Tech Stack
→What We Did
Root Cause Analysis & Stabilisation
Before touching any infrastructure, we spent the first two weeks doing thorough root cause analysis. We instrumented the application to capture memory usage, database connection counts, and request latency over time. This revealed three distinct failure modes: a memory leak in the WebSocket connection handler for live classes, database connection pool exhaustion under concurrent login load at class start times, and a missing process supervisor meaning crashes required manual SSH to restart. We fixed all three before any infrastructure changes — stabilising the existing system immediately while the larger migration was underway.
Environment Separation & CI/CD Pipeline
We created fully isolated dev, staging, and production environments using Terraform. The staging environment is a scaled-down mirror of production — same architecture, same configuration, just smaller. A GitHub Actions pipeline was built to run all tests automatically on every pull request, deploy to staging on merge to main, and require a manual approval gate before promoting to production. Deployments include automated health checks with rollback triggered if health checks fail within 3 minutes of deploy.
Content Publishing Pipeline
Course content uploads — video recordings, PDFs, quizzes — previously required a developer to manually move files and update database records. We automated the entire pipeline: content uploaded by educators triggers a Lambda function that processes, transcodes (via AWS MediaConvert for video), uploads to S3, invalidates the CloudFront cache, and updates the database. What previously took 2–3 weeks of developer time now completes in under 2 hours, fully without engineering involvement.
Database Resilience
The single RDS instance was upgraded to a Multi-AZ deployment for automatic failover. PgBouncer was introduced as a connection pooler sitting between the application and RDS, eliminating connection exhaustion under the concurrent load of students joining classes simultaneously. Database query performance was profiled and three slow queries causing significant latency during class start were optimised with appropriate indexing.
Observability & Incident Response
We deployed Prometheus and Grafana with dashboards specifically designed around the EdTech use case — concurrent student connections, live class health, content delivery latency, and error rates. PagerDuty alerts are configured with class-schedule awareness, meaning on-call severity escalates automatically 30 minutes before scheduled class times. Runbooks are attached to every alert so on-call engineers know exactly what to do rather than improvising under pressure.
✦Key Engineering Decisions
Decision: Fix root causes before migrating infrastructure
Many teams make the mistake of moving broken software to better infrastructure and expecting it to fix itself. We stabilised the existing system in the first two weeks by fixing the actual application bugs — memory leak, connection pool, crash recovery — before touching the architecture. This gave students a better experience immediately while the larger migration was in progress.
Decision: AWS MediaConvert for video transcoding over a custom solution
Video transcoding is a solved problem. Building a custom transcoding pipeline would have taken weeks and introduced significant ongoing maintenance. MediaConvert handles transcoding at scale, integrates natively with S3 and CloudFront, and costs a fraction of the developer time it would take to build and maintain a custom solution.
Decision: PgBouncer for connection pooling rather than scaling RDS vertically
Connection exhaustion under concurrent load is typically solved by simply upgrading to a larger database instance. But the actual problem was inefficient connection management — the application was opening a new connection per request. PgBouncer solved the root cause at near-zero cost, whereas vertical scaling would have increased the RDS bill significantly without fixing the underlying issue.
⏱Engagement Timeline
✓Results Delivered
"Our students stopped complaining about outages almost immediately — within the first two weeks, before the full migration was even done. ESSEMVEE fixed the actual problems, not just the symptoms. The content pipeline alone saved us dozens of engineering hours per month."
Founder & CEO
Online Education Platform · South Asia · Name withheld on request
Facing Similar Challenges?
Book a free 30-minute call — no obligation, no sales pitch.
Schedule Free ConsultationFree 30-minute call · No obligation