Does the Fastly outage justify “Single Point of Failure” headlines?
On Tuesday 8th June, at roughly noon in mainland Europe, the impact of a major outage swept across a number of the world’s most significant websites. The issues resulted from a failure of Fastly’s content delivery network service. The impact was extremely widespread, as shown by this screengrab, taken at 12:30pm European time, from the DownDetector website:
Fastly’s status page for the issue showed that a fix was identified and deployed 38 minutes after the start of the outage. At this point, services visibly began to be restored. Notably, Amazon’s website appeared to be fully operational some time before many other services, leading to some light-hearted speculation about the robustness of Amazon’s SLAs.
In the immediate aftermath, media analysis focused on fact that a single service provider’s outage had created such a wide impact. The Guardian newspaper, whose entire website had been taken offline by the incident, described the events as a “wakeup call” that internet infrastructure has become “dangerously over-centralised”. Many commentators identified Fastly as a significant single point of failure.
Clearly, Fastly was a single point of failure during this event.
Some background: Companies use the Fastly’s edge cloud service (and the CDN services of their peers like Cloudflare and Amazon CloudFront) to provide faster, more distributed content delivery, with additional benefits such as protection against Distributed Denial of Service (DDoS) attacks.
The failure of the service clearly highlighted a critical dependency on this service, and a widespread lack of resilience and redundancy if the service is not working properly.
Single points of failure, whether technical or human, are seen as a Generally Bad Thing in conventional service thinking, and Fastly’s outage had media pundits and even the UK government (paywall) scrambling to understand how critical services had been so badly impacted by one.
However, the service was restored within three quarters of an hour. If there are no other failures this year, that would be equivalent to better than “four nines” availability (99.99%). Delivering a single service structure with that availability level can be significantly expensive; doing it multiple times to deliver redundancy is even more costly. The UK government is no stranger to big IT project failures. I have seen the inside of a few too many of those. Is it realistic to expect that they could replicate Fastly’s services, in-house, at equivalent cost, at a four-nines reliability level? (The answer, by the way, is a firm no).
The breadth of failure is irrelevant to each
Hence it was interesting to see the contrarian view of some commentators. Finance expert Corey Quinn was quick to praise Fastly both for their fast resolution and for the way the issue highlighted how widely their service is used, even by the parent company of one of their major competitors:
At the time of writing this blog, some ten hours after the event (and at the risk of my breaking every blogging law about creating evergreen content) Fastly’s stock price is up 10% on the day:
Assuming by the time you read this, it hasn’t plummeted to the floor, it seems Corey Quinn may have won this debate. Perhaps the occasional 45-minute outage can be lived with, and the events of 8th June 2021 show that a Single Point of Failure isn’t necessarily the wrong option, if it happens to be a really good option.