Today's outage occurred within minutes of Monday's incident, indicating a potential correlation. Our ongoing investigation has uncovered a significant surge in traffic during these incidents, which led to one of the five database servers failing and triggering a domino effect. To ensure data integrity, we promptly initiated the restoration process, and no data loss has occurred. Further analysis is underway to prevent future disruptions and address the root cause of these traffic spikes.
At 12:10 PM the primary database cluster for Allotrac experienced an issue that caused server number 3 to crash. The cluster automatically detected that a server had crashed in a manner that could lead to data sync issues and safely disabled access to the database. This caused affected customers to lose access to Allotrac while the issue was addressed.
By 12:13 PM the Allotrac DevOps team had verified the data integrity of the remaining database cluster servers and selected server 0 to initiate an automated restoration. Due to the fact that only one server remained in-sync, the recovery process was more involved than the Monday recovery.
At 2:25 PM the automated recovery of the cluster for server 1 was completed and the Allotrac DevOps team began restoring access to affected customers.
At 3:45 PM access was safely restored to all customers
Work is still underway restoring the prior number of database servers to ensure no degradation in performance.
Posted Oct 31, 2024 - 16:14 AEDT
Update
Expect periods of slowness whilst additional database servers come online
Posted Oct 31, 2024 - 15:33 AEDT
Monitoring
Service restored, monitoring performance
Posted Oct 31, 2024 - 15:16 AEDT
Update
Restoring Service
Posted Oct 31, 2024 - 14:26 AEDT
Update
Shortly commencing service restoration
Posted Oct 31, 2024 - 14:05 AEDT
Identified
The issue has been identified. We have initiated recovery