Here's what happened:
On the worker instances for Starter and Pro we have a set of gp2 volumes for files / database. We have a st1 volume where hourly and daily zfs level snapshots are sent.
EBS volumes have a setting to enable termination protection. There were 5/7 volumes on an instance that did not have termination protection. That is a regression in 2.9 code and only in N Virginia and a small % of drives on Starter / Pro workers.
Our normal on call tech last night was out sick, the backup on call tech mistakenly issued a terminate command on one of the starter / pro instances. The ebs volumes that did not have termination protection were permanently removed on the aws side. This included the st1 where the backup snapshots are held.
Here’s why we chose this configuration:
We chose this configuration for the speed of restoration. We first looked at zfs snapshots to S3. However, the amount of data that would need restored is is 3TB to 10TB. The speed to restore from S3 in our tests would result in up to 2 days of restoration time. Using local st1 we are able to achieve up to 500MB/s restoration speeds resulting in a significantly lower restoration time, hours instead of days.
Why we didn’t use ebs level snapshots on the gp2? The gp2 volumes are in a zfs raid array. The ebs level snapshot timing would not allow for a synchronous snapshot across all gp2 volumes that would be restorable in zfs.
Why didn’t we have st1 ebs level snapshots? Good question. This is obvious now, and was in our initial designs, but was backlogged. This was the weakness in our designs - missing snapshots of the snapshots.
Here’s what we’re doing to fix it:
We fixed the code regression. We put in an audit script to run daily to check termination protection (on all ebs volumes). We made sure all volumes are correct.
Here’s what we’re going to do better:
We're rolling out ebs snapshots of the st1 that holds the gp2 zfs snapshots. These ebs snapshots will be stored in s3. If an st1 drive is failed or lost we will be able to restore the st1 from a snapshot and then restore the gp2 volumes. This will be fully deployed in all regions in the next 24-48 hours.
We’re adding in the aws cli to Stratus ssh and helping customers to automate backups to their own s3 buckets.
We're adding to our ui roadmap the ability for customers to enter their own s3 bucket and set the frequency of their own backups (in addition to ours). This is a much easier way of automating your own backups.
We know this was a major mistake on our part. We want Stratus to be the most reliable, stable, performant, and easy to use Magento hosting and deployment platform. We failed to deliver that promise given this event. We want customers migrating from MHM to Stratus be 100% confident in the new platform before moving. We’re going to extend the MHM migration deadline out 1 additional month from March 1 to April 1. We want to give customers more time feel comfortable on the new product before moving.