Overview
On 16 March, 23 March, 9 April and 10 April (outlined in red in the L90 image below), vCore and other services experienced performance degradation, including elevated response times.
Despite normal CPU utilisation on the database (~50%) and healthy API indicators, the system exhibited:
- High request latency
- Increased database commit latency
- Application-level slowness
A controlled
Aurora failover (reader → writer promotion)
restored system performance, resulting in the best observed performance baseline in recent periods.L90 of Cluster Connections

The graph above shows the last 90 days of concurrent database connections. It can be seen that spikes in connection count increased gradually until hitting a critical point mid-last-week which caused a feedback loop of sustained connections, resulting in the performance degradation. Note the total flatline of connections at the far right of the graph (outlined in green), indicating that connections are not piling up and that the failover solution has eliminated stale resources and contentions.
Previous performance issues (notably Monday the 16th and Monday the 23rd of April) are also outlined in red.
Database connections from early Thursday

Latency and Load during degradation

Latency and Load reduction following failover
(note: AWS metrics were wiped on failover. The latency and load peaks shown above represent the average load for the previous 2 days)
Database connection stacking from before, during, and after the incident

Impact
- User Impact:
- Slow page loads across the web application
- Intermittent failures/timeouts on user actions
- Business Impact:
- Degraded user experience
- Increased operational load during incident response
- Risk to customer trust due to instability
Detection
Elevated response times reported via application monitoring
Aurora metrics indicated:
- Increased commit latency
- No corresponding spike in CPU or memory utilisation
Apache metrics showed:
- Increased request duration
- Worker saturation symptoms
Timeline
- ~09:00 Apr 9 Performance degradation begins
- ~10:00 Elevated latency observed in application
- ~10:00-10:30 Database metrics reviewed (CPU normal, latency elevated)
- 11:00 Apache / application restarts attempted (no improvement)
- 12:00 Query analysis identifies high-cost UPDATE with full table scan
- 12:30 Indexes added to mitigate query inefficiency
- 13:00 Connection count to database began to reduce
- 14:00 Performance degraded again
- 09:00 Apr 10 Performance remained degraded
- 14:00 Aurora failover initiated (reader promoted to writer)
- 14:05 Immediate restoration of performance
- ~23:00 Database instance size increased
What Happened (Technical Summary)
The system entered a degraded state characterised by
high database commit latency and request blocking
, despite moderate resource utilisation.Investigation revealed:
- A high-frequency UPDATE query performing a full table scan (~800k rows) due to missing indexing
- Resulting in lock contention and transaction queuing
- Accumulation of long-lived or blocked transactions
- Increasing contention within InnoDB internal structures
Although indexing improvements were applied, the database remained in a degraded state due to
residual transactional and locking contention
.A failover reset:
- Active connections
- Open transactions
- Lock queues
- InnoDB internal state
- Various other caches
This immediately restored normal performance.
Root Cause
Primary Root Cause
A high-frequency database update query executed without an appropriate index, resulting in:
- Full table scans
- Excessive row-level locking
- Transaction contention under load
Contributing Factors
1. Transaction and Lock Accumulation
- Blocked and queued transactions accumulated over time
- Lock contention propagated across unrelated queries due to shared resources
2. Connection Management Characteristics (PHP + Apache Prefork)
- High number of concurrent database connections
- Long-lived connections increasing contention footprint
3. InnoDB State Degradation Under Contention
- Internal structures (lock queues, undo logs, buffer pool efficiency) degraded under sustained load
- System did not self-recover after contention was introduced
4. Lack of Early Detection Signals
No alerting on:
- Commit latency
- Lock wait time
- Long-running transactions
Issue was detected only after user-visible degradation
5. Delayed Recovery Without Reset
- Restarting application layers (Apache/PHP) did not clear database-level contention
- Only a database failover (hard reset of state) resolved the issue
Why It Affected the Entire System
Although the triggering query targeted a specific table, the impact was systemic due to:
- Shared InnoDB resources (buffer pool, lock manager)
- Transaction queue contention affecting unrelated queries
- Connection pool saturation at the application layer
- Increased commit latency impacting all write operations
Resolution
Immediate mitigation achieved via:
- Aurora failover (reader promoted to writer)
Performance returned to baseline immediately after failover
Lessons Learned
- Moderate CPU utilisation does not indicate database health
- Commit latency is a critical early warning signal
- Database engines can enter degraded states that do not self-recover
- Failover acts as a reset, not a root cause fix
Follow-Up Actions
Short Term
Confirm all high-frequency queries are properly indexed
Enable and review slow query logging (lower threshold temporarily)
Monitor and alert on:
- Commit latency
- Lock wait time
- Active transactions (innodb_trx)
Add visibility into connection counts and states
Medium Term
Review connection management strategy (reduce long-lived connections where possible)
Add dashboards for:
- Transaction age
- Lock contention
- Threads running vs connected
Long Term
● Evaluate architectural changes to reduce high-frequency write contention
● Introduce backpressure or rate limiting on heavy write paths
● Consider read/write isolation improvements or workload partitioning
● Formalise database failover as a controlled operational response (not primary mitigation)
Blameless Summary
This incident was caused by a combination of:
- An inefficient query pattern under load
- Insufficient observability into database contention signals
- Expected but unmanaged behaviour of the database under sustained transactional pressure
No single action or individual directly caused the incident.
The system behaved in line with its current design and constraints.
Incident History (Last 90 Days)

* An Outage indicates that the system was completely inaccessible. Performance Degradation indicates that while the system was slow, it was still accessible and most work could be done, albeit at a less efficient rate.
The duration of actual outages within the last 90 days results in an uptime of 99.86%