Amazon has shared the root cause analysis of the major outage they have faced on 28th of February and taking down S3, and affecting AWS, EC2 and other services relying on S3.
In the incident report published by AWS team, they reveal that a simple typo made by a system administrator of AWS shut down accidentally a lot of servers supporting S3 and causing major disruptions of Amazon services during 4 hours:
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.
To make sure the problem doesn’t happen again, Amazon has rewritten its software tools with safeguards so that its engineers can’t make the same typo mistake.