We are currently experiencing issues with Printix Cloud. We are working to resolve the issues. We will post updates to this announcement when we have news. Select Follow if you wish to be notified about these updates.
Incident report
Printix Sev-1 Incident 30&31 March 2023
Overview
The Printix service was disrupted for 1 hour 45 mins from 10:40pm UTC March 30th to 12:25am UTC on March 31st due to issues with the Kafka cluster. Service was restored by reindexing and restarting the Kafka cluster, additional configuration changes were made to network timeout settings within Kafka to insure stability.
What Happened
10:40pm UTC 30th March:
The Kafka cluster was restarted to resolve a connection issue with external events service not connecting to Kafka. We received alerts during this time for Kafka messages not processing.
11:00pm UTC 30th March:
The external events service issue was resolved; however the Kafka cluster was unstable and was not able to process messages for more than a few minutes before going offline again.
11:20pm UTC 30th March:
The cluster was restarted twice but went offline after few minutes again each time. The cluster was then stopped while I worked on reindexing Kafka.
12:25am UTC 31st March:
The cluster was brought back online and stayed stable. Close monitoring of the platform is being performed.
2:30 PM UTC 31st March:
On going platform monitoring identified some performance issues were being seen within Kafka.
5:00 PM UTC 31st March:
Review of past Kafka issues in combination with the close monitoring identified a network timeout value within Kafka that needed to be adjusted. A decision was made to address this change urgently. There was no impact to use during this change. After the change the platform quickly returned to normal operation and has remained stable since.
Resolution
A Kafka subcomponent was unable to view the state of the nodes in the cluster due to some issue with its data directory, this was causing the Kafka cluster to become unstable and go offline again after a few minutes. The issue resolved with a rebuild of the data directory and reindexing the Kafka cluster. A network timeout setting was adjusted to ensure future stability of the platform.
Root Causes
Initial cause was when cluster was restarted to resolve Printix service issue, which then caused an issue with a Kafka subcomponent’s data directory becoming corrupted during restart. An additional underlaying network timeout setting was identified and changed.
Impact
Degradation of service on the 30th March, customers were unable to execute print jobs during the incident which lasted approximately 1 hour 45 minutes. Varying duration of delays were possible for an additional 45 minutes as the Printix platform processed the backlog of jobs.
Degradation of service on the 31st March, some customers may have experienced various error messages and delays on the platform. The was no outage associated with the network timeout change made.
Action Items
- Add extra nodes to Kafka cluster for high availability and rolling restarts.
- Document steps taken for quick resolution in future for similar issue
Comments
0 comments
Article is closed for comments.