What happened to Facebook on 4th October?

Facebook went down! What went wrong, and why did it take so long to fix?

What a shock to many – their worlds coming crashing down as the need for social interaction is unable to be met by the worlds most commonly used social networks, all owned by Facebook. Today (5 Oct 2021) – many here in New Zealand have woken to a worldwide outage, visiting the site is complaining about a DNS / Domain issue, and a white screen that no doubt has some rather high paid network engineers at Facebook, having kittens.

So why is it down? Well – that’s the question everyone is speculating on, and much of it comes down to the core structure of the internet, and how they are harnessing tools to give us all the best experience possible. The most likely reason is what I’ll be walking us through today.

How does the internet work?

Well it all starts off in your internet browser – Google Chrome, Mozilla Firefox, Internet explorer / Edge / whatever Microsoft are calling it now, Apple Safari. Lots of options, all work the same way. You type in a web address (URL) into the address bar, and hit enter, and within seconds the page you want renders in the browser, and we carry on our merry way. But there is a bunch of communications that goes on within those few seconds that helps make this all work.

The first part of this is the address translation. There is a global system called DNS (the Domain Name System) which translates what you have typed in (e.g. www.webmad.co.nz) into a series of numbers called an IP address. The servers that store the website data each have an IP address that they respond on, and deliver the web pages back to you. Its a bit like your phonebook. I want to call someone by this name, so please give me their phone number to do so.

Once the address translation has happened, you can talk directly to the servers and get the data you need to render the web page. The faster this translation happens, the faster your website will load for the end users. And this is where the problem is believed to have happened for facebook today.

Where has it all gone wrong?

The way that normal IP addressing works is that one server typically has one IP address. It is unique, and you can get a bunch of details from it (check out https://ip-api.com for some of this info). The downside is that a single IP address typically translates to one server, that may actually be on the other side of the world to you. And because light can only travel so fast (i.e. the internet backbones that link us all together via fibre optic cables) there is a delay talking from little old NZ through to big datacenters in the USA or Europe.

What some clever clogs has worked out though, is that you can use Content Delivery Networks to reduce the physical distance between your web servers and your customers around the world, making websites load so much quicker. Yay! But that is only part of the equation. This works for website content, but it doesn’t work for the DNS lookup / translation aspect. And this is where we get to BGP routing. This is where we believe the outage has been caused today.

You’re getting technical…

BGP Routing or Border Gateway Protocol Routing, is a fancy way of allowing one single advertised IP address to be shared by multiple servers globally, which can then serve website clients from the closest possible geographic location. As there are lots of servers that can serve the data of the one IP address, it can be very fault tolerant, and increases speeds of users getting website addresses translated to IP addresses so that the traffic can be routed to the right places and the websites work

In todays outage, the hardware that does this BGP routing globally for Facebook, allowing them high website speeds, has been misconfigured / lost its configuration. What this has meant is that anyone trying to do lookups / translations of any of the Facebook operated web addresses, are getting blank screens with their browsers telling them that they can’t find the domain name.

As I write this it looks like things are slowly starting to resume normal operations after 4 and a bit hours – there is a Facebook branded error page now, so we are at least seeing Facebook servers again, but I suspect the next issue they will face as they slowly bring the site back online is the large influx of people accessing the sites after their drought, and trying to catch up, effectively swamping their servers

What can we learn from this?

  • Firstly – in the internet world, you are never to big to fail.
  • Secondly – the world is still ok without social networks.
  • All the geekery in the world (CDN’s, BGP Routing etc) won’t necessarily save you from good old fashioned human error, although it does help reduce its occurrence.

Here at Webmad we are well versed in using these various tools to get you the best outcomes and speed for your website, using trusted providers, and offering proven results. We’ve run sites using BGP failover routing to offer high availability geolocation aware systems within NZ, we use CDN‘s all the time, and we can quickly pinpoint where issues might be, and how to fix them. Could we fix Facebook’s troubles? That’s a bit above our pay grade, but we can definitely put our knowledge to great use as part of your web team. Drop us a line to get the best results for your online assets.

By Stephen

Co-founder at Webmad, Stephen is part of the website development team, and is keen on solving problems for businesses using web tools. When he’s not maintaining and developing systems, he is a keen audio engineer involved with live sound and studio recording, or hanging out with his family at skate parks and local markets

Keen to discuss your project? We'd love to chat!

  • This field is for validation purposes and should be left unchanged.
Locate Webmad Icon
69 Corsair Drive, Christchurch
(yes, the control tower!)