2026 - Root Cause Analysis (RCA)

Date:
 May 19, 2026
Incident Window: 
09:10 - 16:30 AEST
Systems Affected:
 Visualcare web application (app.visualcare.com.au)
Summary
At approximately 09:10 AEST, customers began reporting that the Visualcare web application had become slow and unresponsive. Investigation identified that the primary web application server had entered a degraded state after running under sustained high CPU load for approximately two weeks.
During the incident, a series of unhandled application errors caused PHP-FPM worker threads to hang, resulting in elevated CPU utilisation on the primary server. The degradation became severe enough that AWS was unable to successfully reboot the instance remotely.
Engineering worked to restore service by reintroducing a previous web server and later deploying a new replacement server behind the load balancer. While these actions partially restored access for many customers, additional routing inconsistencies were identified where some customer domains bypassed the load balancer entirely and pointed directly to the failed server. This caused continued intermittent failures for affected customers until those paths were identified and corrected.
The incident was fully stabilised after the temporary replacement environment was removed and routing behaviour was normalised.
Elevated utilisation. See the healthier utilisation following remediation actions outlined in red.
Impact
Administrators across multiple providers experienced significant slowness, intermittent failures, and periods where the Visualcare web application was unavailable during the incident window.
Worker and Participant access through the Visualcare mobile applications was unaffected.
No data loss or compromise occurred during the incident.
Timeline
09:10: Customers begin reporting unresponsive behaviour in the Visualcare web application. Monitoring shows elevated CPU utilisation on the primary web application server.
09:10 - 09:45: Engineering investigation and triage begins.
09:45: Previous web application server temporarily reintroduced into the load balancer to restore capacity.
09:45 - 10:30: Some customers experience SSL/SNI-related access issues associated with the reintroduced server.
10:30: Previous web application server removed from the load balancer.
12:40: New replacement web application server added into the load balancer.
12:40 - 14:00: Many customers regain access, however intermittent errors continue for some providers.
14:00: Traffic routing adjusted to direct all traffic toward the new server.
14:15: Engineering identifies that some customer domains/TLDs were configured to point directly to the original web server rather than routing through the load balancer, causing failures for affected customers.
14:15 - 16:30: Routing corrections and continued investigation into login instability on the replacement server.
16:30: Replacement server removed from service due to ongoing login-related instability while remediation work continued.
Root Cause
On May 19, a series of unhandled application errors triggered a condition where PHP-FPM worker threads became stuck and did not recover correctly. As the number of hung workers increased, CPU utilisation escalated, resulting in the server becoming unresponsive to both application traffic and infrastructure management operations.
Because some customer domains were configured to bypass the load balancer and point directly to the affected server, those customers continued experiencing failures even after alternate infrastructure was introduced.
Attempts to restore service using both a previous server and a newly provisioned server improved availability for many customers but introduced secondary issues, including SSL/SNI mismatches and login instability, which prolonged the incident duration.
Resolution
Engineering restored partial service by temporarily reintroducing a previous web application server into the load balancer, followed by deployment of a newly provisioned replacement server.
During investigation, it was discovered that some customer domain configurations bypassed the load balancer entirely. These configurations were corrected so traffic could route through the intended infrastructure path.
The affected server was ultimately isolated from production traffic while further remediation and stability work continued.
Follow-Up Actions
Short Term
Identify and remediate the unhandled application errors causing PHP-FPM worker hangs.
Audit all customer domain and TLD routing to ensure traffic consistently traverses the load balancer layer.
Review PHP-FPM worker timeout, recycling, and monitoring configuration to improve recovery behaviour under fault conditions.
Accelerate the planned containerisation and migration of the Visualcare web application and API into the new production environment.