26 - Root Cause Analysis (RCA)

Date:
 29 April 2026
Incident Window:
 10:20 - 15:35 AEST
Systems Affected: 
Database cluster, downstream application services
Summary
At approximately 10:20 AEST, the system experienced a rapid spike in database connections, leading to resource exhaustion and degraded application performance. A failover at 10:40 temporarily alleviated the issue.
A second, more severe spike occurred at 14:00, again resulting in database exhaustion. Investigation identified a recently released query related to Commonwealth Unspent Funds as the root cause. The query was executing at high volume due to backlog processing and contained inefficient subqueries.
A hotfix was deployed at 15:35, resolving the issue.
Impact
Intermittent application degradation and timeouts
Elevated database connection usage leading to exhaustion
Reduced system responsiveness during spike windows
Potential delays in provider statement generation
Timeline
10:20: Initial spike in database connections observed
10:40: Database failover performed; connection levels stabilised
10:40–14:00: Investigation into root cause underway
14:00: Second spike; database connections exhausted again
~14:10: Problem query identified 
~14:30–15:30: Query analysis and remediation work
15:35: Hotfix deployed; connection usage returns to normal
Root Cause
A query introduced last week for Support At Home contained two unbounded subqueries.This resulted in:
High memory and CPU overhead per execution
Limited execution plan selection under load
Amplified cost when executed concurrently
The issue was triggered by a backlog of statements queries which resulted in an increase in database connections and subsequent exhaustion.
Resolution
Identified and analysed the problematic query
Implemented a hotfix to optimise/remove unbounded subqueries
Reduced per-query load and execution cost
Deployment at 15:35 resolved connection exhaustion
Blameless Root Cause Statement
The incident was caused by a query design that resulted in large backlog processing. When triggered at scale, it resulted in database load and connection exhaustion. Improvements are required in query design standards, load validation, and bulk processing controls.
What we are doing about it
Last year Visualcare kicked off an infrastructure upgrade. Key updates include containerisation and database sharding. In recent months we have been piloting this with a key partner to assess performance and reliability. This pilot has been successful and is being pushed live in early May. Once complete we will be rolling out these improvements to support improved reliability and performance.
The duration of incidents within the last 30 days has resulted in an uptime of 99.2%.