new

Incidents

26 - Root Cause Analysis (RCA)

Overview

On 16 March, 23 March, 9 April and 10 April (outlined in red in the L90 image below), vCore and other services experienced performance degradation, including elevated response times.

Despite normal CPU utilisation on the database (~50%) and healthy API indicators, the system exhibited:

High request latency
Increased database commit latency
Application-level slowness

A controlled

Aurora failover (reader → writer promotion)

restored system performance, resulting in the best observed performance baseline in recent periods.

L90 of Cluster Connections

The graph above shows the last 90 days of concurrent database connections. It can be seen that spikes in connection count increased gradually until hitting a critical point mid-last-week which caused a feedback loop of sustained connections, resulting in the performance degradation. Note the total flatline of connections at the far right of the graph (outlined in green), indicating that connections are not piling up and that the failover solution has eliminated stale resources and contentions.

Previous performance issues (notably Monday the 16th and Monday the 23rd of April) are also outlined in red.

Database connections from early Thursday

Latency and Load during degradation

Latency and Load reduction following failover

(note: AWS metrics were wiped on failover. The latency and load peaks shown above represent the average load for the previous 2 days)

Database connection stacking from before, during, and after the incident

Impact

User Impact:

Slow page loads across the web application
Intermittent failures/timeouts on user actions

Business Impact:

Degraded user experience
Increased operational load during incident response
Risk to customer trust due to instability

Detection

Elevated response times reported via application monitoring

Aurora metrics indicated:

Increased commit latency
No corresponding spike in CPU or memory utilisation

Apache metrics showed:

Increased request duration
Worker saturation symptoms

Timeline

~09:00 Apr 9 Performance degradation begins
~10:00 Elevated latency observed in application
~10:00-10:30 Database metrics reviewed (CPU normal, latency elevated)
11:00 Apache / application restarts attempted (no improvement)
12:00 Query analysis identifies high-cost UPDATE with full table scan
12:30 Indexes added to mitigate query inefficiency
13:00 Connection count to database began to reduce
14:00 Performance degraded again
09:00 Apr 10 Performance remained degraded
14:00 Aurora failover initiated (reader promoted to writer)
14:05 Immediate restoration of performance
~23:00 Database instance size increased

What Happened (Technical Summary)

The system entered a degraded state characterised by

high database commit latency and request blocking

, despite moderate resource utilisation.

Investigation revealed:

A high-frequency UPDATE query performing a full table scan (~800k rows) due to missing indexing
Resulting in lock contention and transaction queuing
Accumulation of long-lived or blocked transactions
Increasing contention within InnoDB internal structures

Although indexing improvements were applied, the database remained in a degraded state due to

residual transactional and locking contention

A failover reset:

Active connections
Open transactions
Lock queues
InnoDB internal state
Various other caches

This immediately restored normal performance.

Root Cause

Primary Root Cause

A high-frequency database update query executed without an appropriate index, resulting in:

Full table scans
Excessive row-level locking
Transaction contention under load

Contributing Factors

1. Transaction and Lock Accumulation

Blocked and queued transactions accumulated over time
Lock contention propagated across unrelated queries due to shared resources

2. Connection Management Characteristics (PHP + Apache Prefork)

High number of concurrent database connections
Long-lived connections increasing contention footprint

3. InnoDB State Degradation Under Contention

Internal structures (lock queues, undo logs, buffer pool efficiency) degraded under sustained load
System did not self-recover after contention was introduced

4. Lack of Early Detection Signals

No alerting on:

Commit latency
Lock wait time
Long-running transactions

Issue was detected only after user-visible degradation

5. Delayed Recovery Without Reset

Restarting application layers (Apache/PHP) did not clear database-level contention
Only a database failover (hard reset of state) resolved the issue

Why It Affected the Entire System

Although the triggering query targeted a specific table, the impact was systemic due to:

Shared InnoDB resources (buffer pool, lock manager)
Transaction queue contention affecting unrelated queries
Connection pool saturation at the application layer
Increased commit latency impacting all write operations

Resolution

Immediate mitigation achieved via:

Aurora failover (reader promoted to writer)

Performance returned to baseline immediately after failover

Lessons Learned

Moderate CPU utilisation does not indicate database health
Commit latency is a critical early warning signal
Database engines can enter degraded states that do not self-recover
Failover acts as a reset, not a root cause fix

Follow-Up Actions

Short Term

Confirm all high-frequency queries are properly indexed

Enable and review slow query logging (lower threshold temporarily)

Monitor and alert on:

Commit latency
Lock wait time
Active transactions (innodb_trx)

Add visibility into connection counts and states

Medium Term

Review connection management strategy (reduce long-lived connections where possible)

Add dashboards for:

Transaction age
Lock contention
Threads running vs connected

Long Term

● Evaluate architectural changes to reduce high-frequency write contention

● Introduce backpressure or rate limiting on heavy write paths

● Consider read/write isolation improvements or workload partitioning

● Formalise database failover as a controlled operational response (not primary mitigation)

Blameless Summary

This incident was caused by a combination of:

An inefficient query pattern under load
Insufficient observability into database contention signals
Expected but unmanaged behaviour of the database under sustained transactional pressure

No single action or individual directly caused the incident.

The system behaved in line with its current design and constraints.

Incident History (Last 90 Days)

* An Outage indicates that the system was completely inaccessible. Performance Degradation indicates that while the system was slow, it was still accessible and most work could be done, albeit at a less efficient rate.

The duration of actual outages within the last 90 days results in an uptime of 99.86%