AWS Node Failure
Incident Report for Webscale STRATUS
Postmortem

Tech 1 receives alert a node (ec2 instance) was performing poorly with Load Average around 300 and climbing. Cpu was low for user and sys. Memory was ok. IOWait was around 3%.

Tech 1 sees http site down alerts firing. Kube starts evicting pods. Customers begin seeing site down.

Tech 1 does not see any obvious problem and tried to reboot instance.

Instance reboots. Same high LA.

Tech 1 thinks maybe the underlying hardware on the aws side is bad. Tech 1 stops the instance. Starts the instance which forces it to move.

Same weird behavior with high LA.

At the same time, tech 2 was fixing permissions for a customer. Instead of working in ssh container, he was working directly on the nodes fs for files. He misses a chown cmd and chown’s everything.

Tech 2 realizes his mistake, reports it to tech 1, and starts to fix. But still high LA. Techs determine this problem is not causing the high LA.

Tech 1 sees in ZFS showing high fragmentation and high disk allocation used. Tech 3 sees ZFS L2 ARC unavailable. Tech 1 fixes L2 ARC unavailable. High LA continues.

Tech 1 adds another volume to the ZFS pool. ZFS then has enough space to defragment. LA drops immediately.

Tech 2 remembers seeing this high LA before. This behavior happens when kube sees node low on disk. Tech 2 see alerting system process is dead. So no low disk alert came in.

Node is healthy again. Pod processes begin to execute. Customers continuing seeing errors due to wrong permissions.

Tech 2 repairs permissions. All customer sites are back online.

Going forward we are implementing a separate service check heartbeat on the existing monitoring service. All techs are being briefed on the symptoms of this problem so it’s not misdiagnosed again. The tech with the wrong command has been addressed administratively.

Posted Jan 23, 2019 - 10:58 EST

Resolved
This incident has been resolved.
Posted Jan 22, 2019 - 22:23 EST
Identified
A node stratus node had an underlying hardware failure that is affecting some customers at the moment. The node is being moved to different hardware.
Posted Jan 22, 2019 - 20:44 EST
This incident affected: Webscale STRATUS - Northern Virginia.