Node in USEast is down
Incident Report for Webscale STRATUS
Postmortem

This afternoon one of our techs was checking load on kubernete's internal network. He was looking into a higher than usual load on kube's internal network now that we're 100% on 2.11 in us-east. When executing an eBPF program, it conflicted with a previously loaded diagnostics kernel module and segfaulted, whereupon the instance hung and needed to be rebooted. That was done and the sites came back. This tech was investigating on prod during the afternoon because we're unable to replicate the condition in dev (nor on Frankfurt which has been on 2.11 for a week). He's been given an informal verbal warning. Part of 2.11 is faster recovery. Previously we saw recovery times around 40 minutes and this time it was around 15. We'll keep pushing that lower and lower.

Posted Aug 20, 2019 - 12:57 EDT

Resolved
This incident has been resolved.
Posted Aug 19, 2019 - 14:57 EDT
Identified
The issue has been identified and stores are coming back online now.
Posted Aug 19, 2019 - 14:48 EDT
Investigating
We are currently investigating this issue.
Posted Aug 19, 2019 - 14:25 EDT
This incident affected: Webscale STRATUS - Northern Virginia.