Node under emergency maintenace
Incident Report for Webscale STRATUS
Postmortem

Here's what happened:

On the worker instances for Starter and Pro we have a set of gp2 volumes for files / database. We have a st1 volume where hourly and daily zfs level snapshots are sent.

EBS volumes have a setting to enable termination protection. There were 5/7 volumes on an instance that did not have termination protection. That is a regression in 2.9 code and only in N Virginia and a small % of drives on Starter / Pro workers.

Our normal on call tech last night was out sick, the backup on call tech mistakenly issued a terminate command on one of the starter / pro instances. The ebs volumes that did not have termination protection were permanently removed on the aws side. This included the st1 where the backup snapshots are held.

Here’s why we chose this configuration:

We chose this configuration for the speed of restoration. We first looked at zfs snapshots to S3. However, the amount of data that would need restored is is 3TB to 10TB. The speed to restore from S3 in our tests would result in up to 2 days of restoration time. Using local st1 we are able to achieve up to 500MB/s restoration speeds resulting in a significantly lower restoration time, hours instead of days.

Why we didn’t use ebs level snapshots on the gp2? The gp2 volumes are in a zfs raid array. The ebs level snapshot timing would not allow for a synchronous snapshot across all gp2 volumes that would be restorable in zfs.

Why didn’t we have st1 ebs level snapshots? Good question. This is obvious now, and was in our initial designs, but was backlogged. This was the weakness in our designs - missing snapshots of the snapshots.

Here’s what we’re doing to fix it:

We fixed the code regression. We put in an audit script to run daily to check termination protection (on all ebs volumes). We made sure all volumes are correct.

Here’s what we’re going to do better:

We're rolling out ebs snapshots of the st1 that holds the gp2 zfs snapshots. These ebs snapshots will be stored in s3. If an st1 drive is failed or lost we will be able to restore the st1 from a snapshot and then restore the gp2 volumes. This will be fully deployed in all regions in the next 24-48 hours.

We’re adding in the aws cli to Stratus ssh and helping customers to automate backups to their own s3 buckets.

We're adding to our ui roadmap the ability for customers to enter their own s3 bucket and set the frequency of their own backups (in addition to ours). This is a much easier way of automating your own backups.

We’re sorry.

We know this was a major mistake on our part. We want Stratus to be the most reliable, stable, performant, and easy to use Magento hosting and deployment platform. We failed to deliver that promise given this event. We want customers migrating from MHM to Stratus be 100% confident in the new platform before moving. We’re going to extend the MHM migration deadline out 1 additional month from March 1 to April 1. We want to give customers more time feel comfortable on the new product before moving.

Posted Feb 06, 2019 - 14:17 EST

Resolved
This incident has been resolved.
Posted Feb 06, 2019 - 14:16 EST
Update
Restores are continuing, our staff is updating individual tickets and customers as fast as we can. Please note that no reply does not mean we are not working on your issue.
Posted Feb 06, 2019 - 10:41 EST
Update
Some customers have been restored. We're continuing to restore the remaining affected customers.
Posted Feb 06, 2019 - 08:00 EST
Update
Some customers have been restored. We're continuing to restore the remaining affected customers.
Posted Feb 06, 2019 - 05:01 EST
Update
Some customers have been restored. We're continuing to restore the remaining affected customers.
Posted Feb 06, 2019 - 01:52 EST
Update
We are launching the stores now and will begin recovery shortly.
Posted Feb 05, 2019 - 23:12 EST
Update
An ec2 instance became unresponsive. The tech came to investigate. He was unable to login to the instance. He attempted to restart the instance. The instance failed to restart. The tech performed a manual terminate, instead of a stop/start. The instance terminated. This is not operating procedure and a terminate should not have been performed. A terminate would normally be ok. However, 5/7 ebs volumes, including it’s backup volumes, did not have termination protection enabled. We identified a regression in the launch configuration that caused these volumes to have launched without termination protection. The regression has been fixed. We identified several other volumes without termination protection and have enabled termination protection for those volumes also.
Posted Feb 05, 2019 - 23:01 EST
Update
We are continuing to investigate the issue.
Posted Feb 05, 2019 - 21:39 EST
Investigating
We are rebooting a node for unexpected emergency maintenance.
Posted Feb 05, 2019 - 21:02 EST
This incident affected: Webscale STRATUS - Northern Virginia.