Over the last few days we bore witness to a minor economic disaster, thanks to the Ever Given having firmly planted itself into both walls of the Suez Canal. The Financial Times gives a very good overview of the factors that to this mishap. In that article, Brendan Greeley describes how the Ever Given got “walled” more so than just grounded, because it implanted itself into the canal walls.
For those of us whose life is about providing resilient services, one has to ask: where was the failure? Mr. Greeley goes into some depth about how the sheer height (beam), weight, and width of the ship, the shape of the canal, the water forces and wind all contributed to this mishap. He also pointed out that the economics favor larger vessels. This is an externality- there is no chance that the owners will ever pay for the amount of damage the blocked canal has caused, which is estimated to have been up to $10 billion. Syria was reportedly rationing fuel because of the blockage, and fuel prices across the globe ticked up. Several ships rerouted to go around the horn of Africa, risking hijackings.
The other far bigger failure here is that there is but one canal through which upon which large portions of the world economy depends. One big anything doesn’t make for good resilience. That canal could fail again. Knowing this, Iran has offered to create an alternate shipping lane, adding at least a bit of redundancy into the system. Ultimately, manufacturers throughout the supply chain can re-evaluate how to manage this sort of delivery delay. Should new lanes be formed? Should more production be closer to the end consumer? A new canal would surely cost tens of billions of dollars, and may offer only limited resilience. After all, why wouldn’t the same failure happen in both canals? In all likelihood it won’t be this precise “walling”, the hope being that canal operators and pilots will update their procedures to limit the risk.
We Internet geeks understand this class of problem in great detail, in many dimensions. A major benefit of cloud computing is to spread load across multiple CPUs in multiple locations, so that no single failure would cause disruption.
Taken individually and impacting individual customers, it’s a sure bet that cloud services are far more reliable than people rolling their own, just as it is safer to use a container vessel than trying to carry one’s products across in a dingy. However, the flip side of that coin is the impact those services have when they fail. Some examples:
|2016||Mirai BOTNET / DYN attack||Twitter, other services out for a day|
|2020||GMail, YouTube, Google Docs||Services disrupted for an hour|
|2020||Amazon Web Services East Coast Data Center||Large numbers of application services failed|
|2020||Cloudflare DNS outage||Client resolvers failed for 27 minutes|
|2021||Microsoft Teams and Office 365||Services to their customers unavailable for four hours|
Can an Internet-wide failure happen? Where’s that “Internet canal” bottleneck? I wrote about that for Cisco not long ago. It could very well be cloud-based DNS resolvers, such as Cloudflare’s 220.127.116.11. What we know is that these services can fail because they have done so in the past. Last year, MIT sage Dan Geer looked at market concentration effects on cybersecurity risk, which opens up a bigger question. This time, The Ever Given failed without any malice. Geer’s major point is that there is an asymmetric attack on large targets, like popular cloud services. The same perhaps can be said about the Suez Canal.
Note that large cloud services are not the only aggregate risk we face. Geer’s earlier work looked at software monocultures. When a large number of systems all use the same software, a single attack can affect all, or at least a great many, of them. This is just another example of a Suez Canal.
The economic drivers are always toward economies of scale, whether that’s a large cloud service or a single supplier, but at the often hidden price of aggregate resiliency. The cost generally amounts to an externality because of the size and scope of the service as well as the impact of an outage on others are not understood until an event happens. Having not considered it a week ago, some producers are considering this question today.
Courtesy of Copernicus Sentinel data 2021, https://commons.wikimedia.org/w/index.php?curid=102251045