A node has rebooted and is under emergency maintenance.

Incident Report for Webscale STRATUS

Postmortem

At approximately 1:08PM EDT we received an alert that a stratus node was offline. We began investigating and found that AWS was reporting all status checks on the instance were failing. At this point we issued a reboot to the downed instance. While waiting for the node to come back up a second node reported down. Both nodes then became responsive minutes later and we then began recovery of containerized services of those nodes.

A support case was opened with AWS and they confirmed that there was an issue with the underlying server which they had automatically rebooted. AWS also confirmed that both of these instances resided on that same physical hardware. The report from AWS and the timeline of how the nodes were rebooted at different times indicates this was a fault at the hypervisor level and not hardware. Systems have been stable since the nodes have been recovered.

Posted Aug 08, 2019 - 15:45 EDT

Resolved

All stores have been verified to be operational.

Posted Aug 08, 2019 - 15:28 EDT

Monitoring

All stores are back up. We are continuing to monitor for any connection issues.

Posted Aug 08, 2019 - 13:57 EDT

Investigating

We are currently investigating this issue.

Posted Aug 08, 2019 - 13:16 EDT

This incident affected: Webscale STRATUS - Northern Virginia.