Facebook Will Happen Again: DNS Outages and Interconnected Systems
Last week’s massive outage on the Facebook-Instagram-WhatsApp ecosystem left many of us puzzled and concerned: How did our entire social communication (and news source for many) become so dependent in a single, non-regulated conglomerate? How come this conglomerate can fail over a seemingly-trivial reason such as DNS? And what are the dangers of our over-reliance on such interconnected entities as our connection to the world?
What caused the Facebook outage?
“The Facebook case was actually more than just a DNS Failure: The root cause seems to be BGP (Border Gateway Protocol) failures underlying the DNS Protocol, which then caused the DNS to start failing,” says Francesco Altomare, GlobalDots’ chief EU-based expert for web performance solutions and business continuity strategies.
“But in essence, the DNS failed because something wasn’t maintained as it should have, to the point it required manual intervention and resulted in an hours-long denial of service. A global corporation which controls most of the world’s means of social communication has a responsibility to minimize that risk.
“DNS was the cause of most major global outages recently, including the latest Facebook and Slack ones. It happens because DNS is the most overlooked protocol in the web. And it can happen to any online business – not just the biggest global ones – and create monetary & reputation damages beyond repair.
“I keep seeing commentaries saying “it’s always DNS” as if nothing can be done about it, and this simply isn’t true. A modest investment in a resilient, performant, 100%-uptime, SLA-backed DNS Technology can save all this, and we’ve been doing this for decades.”
Asked about the probability of such future events in global interconnected services, Francesco explains:
“The reliance on interconnected systems does carry with it an inherent risk of system or even service failure. To counter this daunting risk, companies utilize tools such as SRE (System Reliability Engineering), as well as DR (Disaster Recovery) and BCP (Business Continuity Planning), which all deal with varying levels of redundancy built into each and every layer of your systems infrastructure. In fact, the so-called “Compound SLAs” (Service-level Agreement) on systems deals with more than one component, each of which carries a distinct Availability SLA (see informal subject reference), are used to calculate that. The same goes for the notion of “Error Budgets” (Google explanation here), where you – as an Organization – live and cope with a budget for your systems’ downtime and maintenance windows. If an entity is able to afford enough system downtime, a limited solution can always be found to assess and input the technology to minimize the risk, and if repeated, potentially eliminate the risk from the agenda topic.
“Yet despite these defensive, preventative, and protection tools as well as the mounting literature on the subject, it remains that there is no magic formula to determine a user’s SLAs without active consultancy with its key stakeholders. Moreover, even a 100% Availability SLA-backed system is subject to failures, and when there is more than one component that actively contributes to the availability percentage, calculating the risk of failure becomes even more complex and grueling a task. Simply stated, it is not a question of how likely the risk might be for systems to fail or how an over-reliance on such systems may lead to more problems. Rather, the question is how long will it take for these systems to fail without constant maintenance and updates integration. As well as, what can be done to delay the inevitable system failure and maximize utilization and output most efficiently with the greatest optimization. Beyond that, the question turns to the human aspect in updating the system coding and configuration versus machine learning AI coding of the future, and whether this will lower the lisk and increase the timeframe of efficient system operation.
“DNS is probably the most overlooked web protocol, which is why even the world’s giants aren’t immune, unless they implement a multi-DNS strategy. This could happen to any website, and multi-DNS solutions are highly affordable, so no one should really go without them nowadays.“
Steven Puddephatt, GlobalDots’ chief UK solution architect, adds:
“The probability of these systems failing is 100%. We know this because no service provider will offer more than ‘7 nines’ uptime in their SLA. Undoubtedly Facebook have redundancy built into their core platform, but in this case it was a configuration change that caused the outage. As long as humans are involved with updating code & configurations there’ll always be outages. I don’t believe an over reliance on them will increase outages, there were far more system outages (overall) when systems were less consolidated, you just didn’t hear about them as they were less public facing.”
Watch Steven’s whiteboard explainer below
GlobalDots is happy to be leading the Multi-DNS front, keeping business customers out of outages for nearly 20 years.