A lightning strike in Dublin, Ireland, has caused downtime for many sites using Amazon’s EC2 cloud computing platform, as well as users of Microsoft’s BPOS.
Amazon said that lightning struck a transformer near its data center, causing an explosion and fire that knocked out utility service and left it unable to start its generators, resulting in a total power outage.
Some quotes from the Amazon dashboard (Amazon Elastic Compute Cloud (Ireland)):
11:13 AM PDT We are investigating connectivity issues in the EU-WEST-1 region.
3:01 PM PDT A quick update on what we know so far about the event. What we have is preliminary, but we want to share it with you. We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up. We’ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.
Notice the 30 minutes difference between the first issue message on the dashboard (11:13) and the statement about when the event began, 10:41
11:04 PM PDT We know many of you are anxiously waiting for your instances and volumes to become available and we want to give you more detail on why the recovery of the remaining instances and volumes is taking so long. Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We’ve been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours (emphasis made by blogger) until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.
Microsoft doesn’t use a public dashboard but their twitter feed stated “on Sunday 7 august 23:30 CET Europe data center power issue affects access to #bpos“. Then 4 hours later there was the tweet “#BPOS services are back online for EMEA customers“. A pity that there isn’t an explanation how also their data center went down. Is it the same cause as that brought the Amazon data center down?
The idea on cloud computing is basically that the offered services are location independent. The customer doesn’t have to worry and doesn’t have to know on which location the services are produced. He even doesn’t have to know how the services are provided (the inner working of the provided services).
The incident in Dublin shows that at the current moment this assumptions are wrong. As a customer of cloud computing services you still have to have a good understanding of the location and working of the provided services to get a good understanding of the risks that are at stake in terms of resiliency and business continuity. Only then you can make the proper choices in which way cloud computing services can help your organization or business without business continuity surprises. Proper risk management when using cloud computing services deserves better attention.