Printix Sev-1 Incident 11th May 2023
The Printix service was down for 1 hour 40 mins from 17:11 UTC to 19:20 UTC on May 11th due to a failure of the externally hosted Cassandra database service provider.
17:11 UTC 11th May:
Cloud Services team notified by monitoring and by Technical Support team of authentication failures and customer reports of service failure.
17.30 UTC 11th May:
Checks on the Kubernetes service found multiple pods restarting from failure, further checks identified the Cassandra database service as giving connection errors.
17.45 UTC 11th May:
A case was created with the service provider and failure was confirmed by the portal metrics provided by this provider. Hierarchical escalation made to service provide account manager by Sir Director for Cloud Services. Information was provided by us on request to assist troubleshooting..
18.55 UTC 11th May:
Following action by the service provider we were requested to retry connections. It was possible to connect to the database remotely and to restart a few external-auth kubernetes pods successfully. By now the service began to recover and we were also able to log into the partner portal again.
19.20 UTC 11th May
All PagerDuty alerts cleared and Grafana metrics reached a normal state again.
The root cause was a failed change by our Cassandra database service provider to update Kubernetes pods connected to our own Kubernetes based service. Updated pods were added back into our service too early creating a race condition that caused the Printix service to fail. Once these pods were removed from our service it became possible to use Printix once more.
Total service loss for 2 hour 9 minutes.
- Service provider to provide permanent fix from code for future updates.
- Service provider to provide final RCA following call held with Kofax technical team.
- Check with service provider what options exist to better protect us from similar failure mode in the future.