Facebook Will Happen Again: DNS Outages and Interconnected Systems

Francesco Altomare
image 4 Min read
Managed DNS

Last week’s massive outage on the Facebook-Instagram-WhatsApp ecosystem left many of us puzzled and concerned: How did our entire social communication (and news source for many) become so dependent in a single, non-regulated conglomerate? How come this conglomerate can fail over a seemingly-trivial reason such as DNS? And what are the dangers of our over-reliance on such interconnected entities as our connection to the world?

What caused the Facebook outage?

“The Facebook case was actually more than just a DNS Failure: The root cause seems to be BGP (Border Gateway Protocol) failures underlying the DNS Protocol, which then caused the DNS to start failing,” says Francesco Altomare, GlobalDots’ chief EU-based expert for web performance solutions and business continuity strategies.

“But in essence, the DNS failed because something wasn’t maintained as it should have, to the point it required manual intervention and resulted in an hours-long denial of service. A global corporation which controls most of the world’s means of social communication has a responsibility to minimize that risk. 

“DNS was the cause of most major global outages recently, including the latest Facebook and Slack ones. It happens because DNS is the most overlooked protocol in the web. And it can happen to any online business – not just the biggest global ones – and create monetary & reputation damages beyond repair. 

“I keep seeing commentaries saying “it’s always DNS” as if nothing can be done about it, and this simply isn’t true. A modest investment in a resilient, performant, 100%-uptime, SLA-backed DNS Technology can save all this, and we’ve been doing this for decades.”  

Asked about the probability of such future events in global interconnected services, Francesco explains:

“The reliance on interconnected systems does carry with it an inherent risk of system or even service failure. To counter this daunting risk, companies utilize tools such as SRE (System Reliability Engineering), as well as DR (Disaster Recovery) and BCP (Business Continuity Planning), which all deal with varying levels of redundancy built into each and every layer of your systems infrastructure. In fact, the so-called “Compound SLAs” (Service-level Agreement) on systems deals with more than one component, each of which carries a distinct Availability SLA (see informal subject reference), are used to calculate that. The same goes for the notion of “Error Budgets” (Google explanation here), where you – as an Organization – live and cope with a budget for your systems’ downtime and maintenance windows. If an entity is able to afford enough system downtime, a limited solution can always be found to assess and input the technology to minimize the risk, and if repeated, potentially eliminate the risk from the agenda topic. 

“Yet despite these defensive, preventative, and protection tools as well as the mounting literature on the subject, it remains that there is no magic formula to determine a user’s SLAs without active consultancy with its key stakeholders. Moreover, even a 100% Availability SLA-backed system is subject to failures, and when there is more than one component that actively contributes to the availability percentage, calculating the risk of failure becomes even more complex and grueling a task. Simply stated, it is not a question of how likely the risk might be for systems to fail or how an over-reliance on such systems may lead to more problems. Rather, the question is how long will it take for these systems to fail without constant maintenance and updates integration. As well as, what can be done to delay the inevitable system failure and maximize utilization and output most efficiently with the greatest optimization. Beyond that, the question turns to the human aspect in updating the system coding and configuration versus machine learning AI coding of the future, and whether this will lower the lisk and increase the timeframe of efficient system operation.

“DNS is probably the most overlooked web protocol, which is why even the world’s giants aren’t immune, unless they implement a multi-DNS strategy. This could happen to any website, and multi-DNS solutions are highly affordable, so no one should really go without them nowadays.

Steven Puddephatt, GlobalDots’ chief UK solution architect, adds:

“The probability of these systems failing is 100%. We know this because no service provider will offer more than ‘7 nines’ uptime in their SLA. Undoubtedly Facebook have redundancy built into their core platform, but in this case it was a configuration change that caused the outage. As long as humans are involved with updating code & configurations there’ll always be outages. I don’t believe an over reliance on them will increase outages, there were far more system outages (overall) when systems were less consolidated, you just didn’t hear about them as they were less public facing.”

Watch Steven’s whiteboard explainer below

GlobalDots is happy to be leading the Multi-DNS front, keeping business customers out of outages for nearly 20 years.



There’s more to see

slider item
Managed DNS

Downtime is Pricy, Solution isn’t: How to Stay Out of DNS Outages

Francesco Altomare 13.10.21

The recent global DNS outages, with the latest addition of Facebook-Instagram-WhatsApp, are a call to transform your approach to DNS solutions. In this webinar, we explore whether cloud-borne environments are really fail-proof and how businesses can use the most advanced cybersecurity methods and DNS solutions to minimize their risk of server failures, code misconfigurations, DDoS […]

Read more
slider item
Managed DNS

Webinar: Stay Out of Outages – The BCP Element Now One Talks About

Francesco Altomare

The onslaught of recent outages at major infrastructure providers like Fastly, Cloudflare and Akamai, reminds us of the importance of a holistic business continuity strategy that leaves nothing to chance. Yes, that includes often-overlooked web protocols like DNS. Learn about the DNS strategies that can increase uptime on this webinar, featuring our friends at NS1. […]

Read more
slider item
Managed DNS

Ebook: DNS Best Practices to Proactively Protect Against Malware

Francesco Altomare 08.04.21

Proactively protecting your company against malware, ransomware, and phishing at the DNS control-point, as opposed to retroactive triage and remediation, simply makes sense. A cloud-based solution is ideal given ease of configuration and deployment, limiting exposure time and ensuring 100% compliance across all branches, employees, and devices on your network near instantaneously. However, layering an […]

Read more

Unlock Your Cloud Potential

Schedule a call with our experts. Discover new technology and get recommendations to improve your performance.
Contact us
figure figure figure figure figure