Node in USEast is down

Incident Report for Webscale STRATUS

Postmortem

This afternoon one of our techs was checking load on kubernete's internal network. He was looking into a higher than usual load on kube's internal network now that we're 100% on 2.11 in us-east. When executing an eBPF program, it conflicted with a previously loaded diagnostics kernel module and segfaulted, whereupon the instance hung and needed to be rebooted. That was done and the sites came back. This tech was investigating on prod during the afternoon because we're unable to replicate the condition in dev (nor on Frankfurt which has been on 2.11 for a week). He's been given an informal verbal warning. Part of 2.11 is faster recovery. Previously we saw recovery times around 40 minutes and this time it was around 15. We'll keep pushing that lower and lower.

Posted Aug 20, 2019 - 12:57 EDT

Resolved

This incident has been resolved.

Posted Aug 19, 2019 - 14:57 EDT

Identified

The issue has been identified and stores are coming back online now.

Posted Aug 19, 2019 - 14:48 EDT

Investigating

We are currently investigating this issue.

Posted Aug 19, 2019 - 14:25 EDT

This incident affected: Webscale STRATUS - Northern Virginia.