Page fault or other system failure - single node affected
Incident Report for Webscale STRATUS
Postmortem

We saw a node reboot twice due to a kernel panic, and we are working on reproducing the issue on our development build to confirm our fixes will work. This particular panic is fixed in a newer kernel version. The issue is not frequent and the root cause is a specific task load.

We are also working with AWS to get fixes related to debugging software that does not work correctly on their instances (that is an AWS bug). ENA devices lack proper driver support from AWS needed to capture proper console and other logging output leading up to any panic, so data is lost. AWS has confirmed this is an issue and their support is working closely with our development team. Delays from AWS are why this is taking a long time to correct what would otherwise be straightforward to investigate.

Posted Feb 18, 2019 - 14:00 EST

Resolved
Services are now restored.
Posted Feb 14, 2019 - 22:36 EST
Identified
A single hardnode is rebooting and services are already recovering, a full RCA will be provided by EOD Friday after our internal team can fully analyze the logs.

Again, we are recovering services now, no data loss has occurred, only service interruption.
Posted Feb 14, 2019 - 20:56 EST