A node has rebooted and is under emergency maintenance.

Incident Report for Webscale STRATUS

Postmortem

As part of our implementation of High Availability for stores hosted on Stratus, additional Kubernetes pods were launched to handle MySQL replication across production nodes. These additional pods contained a liveliness check, which is a simple ping of the MySQL port that ensures that MySQL is running for the replication pod. These pings are executed through Docker, however, and ended up triggering a bug in systemd, where dbus connections would build up over time until they eventually exhausted. This bug caused all pods on the effected node to start terminating, and prevented their restart while docker was locked up due to the dependency on systemd. What made this more difficult to diagnose was that the replication pods had been running for over a month without any issues, and it took this length of time for it to manifest. Along with that, there was a lack of information pointing to systemd as the root cause within the logs. These liveliness checks have been removed so that this bug will no longer be triggered, and we are investigating the possibility of upgrading systemd so that this issue will be fixed completely.

Posted Jan 08, 2020 - 19:23 EST

Resolved

This incident has been resolved.

Posted Dec 30, 2019 - 08:12 EST

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 30, 2019 - 07:29 EST

This incident affected: Webscale STRATUS - Northern Virginia.