Why Google Went Offline Today and a Bit about How the Internet Works

trout · on Nov 6, 2012

There are some other ways to fix the problem.

Last time with the Youtube problem, they advertised more specific routes. If Pakistan was advertising a /24 network (255 IP addresses) Youtube started advertising two /25 networks (2x 128 addresses). Since they are more specific, they are preferred over the more broad routes. This prevents lack of cooperation, but not malicious behavior. As well, it ends somewhere because many networks will not pass routes smaller than say /24 or /28.

Most service providers also do 'inbound route filtering' to filter out any routes that they do not own. This isn't a simple process, which is why PCCW does not do it. Maybe a few more of these incidents and they will.

There's also AS Path filtering. This allows networks to be more granular in which paths they trust, by inspecting which AS's a route has gone through. If certain AS or AS path combinations become problematic, the internet at large could blackhole them or do manual route filtering. This would be laborious, but possible.

That said if someone can maliciously peer with an active BGP router, the damage to be done is significant. I haven't seen any outage reports from this type of attack, but I'm surprised by that.

pilom · on Nov 6, 2012

Much more common that malicious outages is malicious creation of ghost networks. Basically a person could say over BGP "W.X.Y.Z is at my office" where that address isn't used by anyone anywhere else on the internet. Then they do their bad deeds from that made up address. Lastly they remove their route via BGP and it is as if their addresses never existed.

sounds · on Nov 6, 2012

That might work for some unused /24's for a large organization's /8 block, but unused IPv4 addresses are so last year!

I suppose the attack will still work for IPv6 for a long time.

wmf · on Nov 6, 2012

There are a lot of IPv4 addresses that are assigned but not routed on the Internet, so you can easily "borrow" them. This kind of trick does leave a trace, though.

neya · on Nov 6, 2012

Best explanation ever. Wow, seriously, this person can use the right words to help even the non-technical people understand such a complex situation. Thanks for posting this.

jwr · on Nov 6, 2012

I used to manage networks and wondered, while reading the article, why it gets so many points on HN, when it only states obvious things and doesn't really go into detail. And then I realized most people these days have no idea about how packets get from here to there, or even that there are packets at all. Now I understand the appeal, but I guess this means that good introductory material is badly needed.

duggan · on Nov 6, 2012

“BGP is literally the glue of the Internet” - I think you’ll find BGP is figuratively the glue of the Internet ;)

virmundi · on Nov 6, 2012

It depends on how one defines virtual glue. But, hey, the whole thing might be moot.

http://xkcd.com/1108/

clebio · on Nov 6, 2012

I'm in this camp. To me, 'literally' has only one meaning. If it doesn't, the word loses all utility. He could say 'is essentially the glue', I suppose.

glomph · on Nov 6, 2012

You are outdated to the extent that you would have been behind the times in the 1680s where the word was already being used to mean 'what follows must be taken in the strongest admissible sense'.

SilasX · on Nov 6, 2012

That's fine, as long as you have an alternative ready that takes the meaning that the old "literally" had when I do want the statement to be taken, er, literally.

If you don't, then it makes a lot of sense to defend literal from non-literal usage.

What's the alternative that I can use and be understood?

nwienert · on Nov 6, 2012

Context.

SilasX · on Nov 6, 2012

"Hello, 911? Yeah, I've got an unconscious person here. His face is literally purple."

Did I mean literally literally, or figuratively? And how can the previous question have meaning?

ars · on Nov 7, 2012

You remove the emphasizer and just say: "His face is purple."

SilasX · on Nov 7, 2012

And then they say, "literally purple?"

jerf · on Nov 7, 2012

The problem with this argument is that you are hypothesizing a case that, if it were going to be happening, would be happening now, not in the future. Yet it does not. There is no great epidemic of confused 911 operators because they can't make out whether or not someone on the other end used literally "correctly".

While one can speculate on why you might be wrong about this being a problem, an examination of the world around us rather strongly suggests that there's no question that there is something fatally wrong with your argument.

SilasX · on Nov 7, 2012

There's no epidemic of any single problem being caused by any imprecision in grammar. But there are lots of little, similar problems -- perhaps in non-emergency situations -- that cause predictable, avoidable confusion because people insist on breaking the use of important words.

If your point is that "we can make it impossible to communicate the concept 'literally' until there's an epidemic of deaths over it", then your threshold is in a very, very wrong place.

jerf · on Nov 9, 2012

It was your implicit threshold you were setting with your argument, not mine. While I am gratified that you so thoroughly demolished your own argument for me, you might want to consider your arguments a bit more tactically in the future.

The real problem being caused here is well below the noise threshold and certainly not worth trying to play "Holier than thou" at people on the internet.

SilasX · on Nov 9, 2012

>It was your implicit threshold you were setting with your argument, not mine

That wasn't my threshold; that was an example of a confusion that couldn't be disambiguated without clear terms for literal vs figurative; it's just that it had unusually large implications for a scenario that require fast, unambiguous communication. (I guess we don't have to care about these scenarios?)

Your own implicit threshold of "if someone doesn't die because if it, I can fuck up the communicative ability of a language however I feel like" is so thoroughly stupid, I doubt you even believe it yourself, yet feel the need to argue for it anyway.

In any case, I'm less concerned with who makes the best tactical moves than on discerning the best idea presented. As it stands, I don't yet see any justification for "let's get rid of this useful disambiguating feature for literal vs figurative" -- but feel free to keep offering them; maybe your knowledge of "tactics" could come in handy here, thought I doubt it. Tactical arguments don't make a language useful. Rather, substance does.

And any time you ever get around to telling me how to indicate the old meaning of "literally" you just let me know. I get that it's not a real high priority for you right now (based on how you think), and I'm not holding by breath or anything, but it would be really cool if you could pull it off. Thanks.

glomph · on Nov 15, 2012

The thing is it isn't a choice we are making now. It was made over 400 years ago and it works.

glomph · on Nov 15, 2012

There are actually strong reasons to think ambiguity in certain cases is an asset in a language not a hindrance. This talk goes into it in part: http://www.ted.com/talks/steven_pinker_on_language_and_thoug...

flyinRyan · on Nov 7, 2012

Drop the word, or use one of: actually, truly, really.

namank · on Nov 6, 2012

Well, they make a valid point in that a word carries with it implicit attributes that should also be taken into account.

It's what really separates bad writing from very bad writing.

Then again, it's also a perspective thing.

clebio · on Nov 6, 2012

Curious why such a specific time period. Is there a source for 1680s? The 17th century would probably be sufficient otherwise.

joshAg · on Nov 6, 2012

OED has the source if you have access to it, which I don't or I would provide it here.

saraid216 · on Nov 7, 2012

http://www.etymonline.com/index.php?term=literally&allow...

Etymonline does not reproduce the OED, but it does source from it and sometimes you get lucky.

clebio · on Nov 7, 2012

Thanks for the link! It's fairly interesting, too. That citation specifically reads: 'Erroneously used in reference to metaphors, hyperbole, etc., even by writers like Dryden and Pope, to indicate "what follows must be taken in the strongest admissible sense" (1680s), which is opposite to the word's real meaning.'

So for one, it states clearly 'erroneously used' (and indeed has this specific wording, 'strongest admissible sense').

Further, gives the 1680 number, but doesn't actually source that any further (general writing periods of Dryden and Pope? Perhaps, though not clear).

Anyway, that's fun. I miss the OED, but it's the sort of massive tome that's impractical to always have on hand.

prof_hobart · on Nov 6, 2012

You do realise there's more than one meaning for "glue" these days, presumably (including "something that binds together")?

jganetsk · on Nov 6, 2012

There's also more than meaning for "literally" these days.

sageikosa · on Nov 6, 2012

With two inflection points in the phraseology, you can literally reverse your glue.

mpyne · on Nov 6, 2012

That only makes it more likely that what was said has a semantically correct meaning though (or is that what you meant?).

bennyg · on Nov 6, 2012

He meant that there is literally only one definition for literally. Any other use of the word cuts away at its meaning (like what happened in that context).

prof_hobart · on Nov 6, 2012

Not really. It is literally something that binds the internet together (i.e. one of the common definitions of the word glue).

BCM43 · on Nov 6, 2012

Now we can get into what "bind" means.

mpyne · on Nov 6, 2012

Or we can stop pretending like we're even a little bit confused as to what was meant at any point along the way here... natural language exists to facilitate communication and understanding, not pointless arguments over the form of idiomatic expressions.

CodeMage · on Nov 6, 2012

Merriam-Webster "Ask the Editor": http://www.youtube.com/watch?v=Ai_VHZq_7eU

Also worth watching, "The History of English in 10 Minutes": http://www.youtube.com/watch?v=rexKqvgPVuA

duggan · on Nov 6, 2012

I think it would be a stretch to suggest that the use of "literally" here was for the sake of hyperbole. It seems pretty clear that it's intended as an analogy.

Interesting videos though, didn't realize I'd stepped on another unexploded grammar mine from The War. I really should know better at this stage.

lifeformed · on Nov 6, 2012

It's literally the figurative glue.

punjabisingh · on Nov 6, 2012

Here's a good read: http://www.slate.com/articles/life/the_good_word/2005/11/the...

nikcub · on Nov 7, 2012

'literally' has literally become an overused cliche that does't mean anything and isn't used properly most of the time.

you can almost always take it out of the sentence it is being used in and the sentence is easier to read and makes more sense.

a new pet-hate word for me (nothing on you OP!)

huskyr · on Nov 6, 2012

Obligatory The Oatmeal reference: http://theoatmeal.com/comics/literally

elpee · on Nov 7, 2012

I think BGP is more of a mucus than glue

vr000m · on Nov 6, 2012

There is an IETF WG called SIDR, which is working on solving this problem of invalid BGP announcements. A good summary is available here http://isoc.org/wp/ietfjournal/?p=2438 and technical details are in the related proposals.

danyork · on Nov 6, 2012

Yes, a very good group for people to get involved with if they are interested in this problem.

apaprocki · on Nov 6, 2012

If you're interested in peering (couldn't resist the pun) behind the curtain, read the NANOG[1] mailing list. These are the real guys keeping the Internet up and running :)

[1]: http://mailman.nanog.org/pipermail/nanog/

dsl · on Nov 6, 2012

It is worth noting that the average HN reader should probably subscribe read-only. Unless you have your own AS and enable on routers, you should probably call your ISP with any issues. (though an unfortunate number of people disregard this advice, which results in smaller private splinter mailing lists sigh)

rlpb · on Nov 6, 2012

> When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem...

I wonder how he contacted his colleague. In this case, I presume that routing to other networks were unaffected. But in the general case, with a future of everything over IP, what will network engineers use to communicate about faults?

jwr · on Nov 6, 2012

If you run a network with BGP, you always have good contact information for your peers. "Good" meaning direct telephone contact with tech people running the show on the other side of the link.

Peaker · on Nov 6, 2012

Telephony might be routed over IP too.

hhw · on Nov 6, 2012

Many networks engage in a practice know as peering. If the author was the peering co-ordinator for his AS, he may have likely been previously in touch with the peering co-ordinator on Moratel's side when establishing a peering relationship, and would thus have had direct contact information to a pretty direct channel (peering co-ordinators are generally also network engineers). The direct contact information being e-mail, phone, IM, or even IRC (yes, some network engineers still use it). Although, the phone (non-VOIP) would be the only option not tied in to IP.

jvehent · on Nov 6, 2012

cellphones ?

ojii · on Nov 6, 2012

There's a trend to use VoIP on cellphones too (see LTE). So in the future this will not help at all.

noselasd · on Nov 6, 2012

Even with LTE/VoLTE, the voice packets are usually contained within the telco private network, that just happen to be IP based, with private interconnections to the other telcos.

At some point we might see most of the VoIP being transported across the internet as well, but that'll be the far future.

JohnLBevan · on Nov 6, 2012

Is there some way to say "ignore DNS results from this provider", such that were you to spot an issue you could block that provider's information (and anyone replicating their version of the truth) and thus find a valid path. If that were possible you wouldn't be reliant on a third party to resolve the issue to get your system working, and once your system worked, you could contact them to resolve the issues for all.

ef4 · on Nov 6, 2012

If someone is giving out bad DNS records, you can just choose to use a different DNS server.

But in this case the problem was bad routes. You can certainly force your own routers to use fixed routes instead, but that doesn't help you unless everybody else along the path also does it. So it's not easy. There are tricks one can play -- like advertising your network as a set of smaller, more-specific networks (since routers will usually favor more-specific routes over more general ones).

Ecio78 · on Nov 6, 2012

@JohnLBevan this was not a DNS problem, but a BGP problem

NB Cannot reply under his post

stanleydrew · on Nov 6, 2012

Even when there's no "reply" link next to the poster's name, I think you can usually reply if you click the "link" link in the same area.

ybaumes · on Nov 6, 2012

The author (Tom Paseka) wrote near the conclusion that himself addressed the Google's issue, by contacting a Moratel's engineer. Do you have the same feeling when reading the article? It sounds weird that Google did not triggered a recovery procedure on its own.

Maybe I see bad things everywhere and you may call paranoïd, but could it be some sort of ("false") advertising on the side of cloudfare?

archangel_one · on Nov 6, 2012

I'm not a network engineer, but it seems like the kind of thing that might be very hard to detect when you're already inside or near to the google.com domain. Or maybe CloudFlare just got there first.

I don't think it's necessary to call BS on Cloudflare without any kind of evidence at all.

amalcon · on Nov 6, 2012

This is basically correct. BGP is weird. The addresses for one of Google's many datacenters were routed incorrectly for packets coming from some subset of IP space. Unless Google is running active ping tests to that subset of IP space, the way they would normally detect it is for someone to call and complain.

In this case, the author decided to take a shortcut and call the owner of the "problem peer" directly.

veidr · on Nov 6, 2012

Although only a vanishingly small percentage of Google users can call and complain. Blog or tweet or post to HN and hope Matt Cutts sees it and notifies the right team, maybe.

Matt_Cutts · on Nov 6, 2012

A team of Googlers could have been working on this in parallel to Tom. I'm guessing that a sudden drop of queries like that would cause people at Google to start digging into what happened. I don't know either way, because network ops and BGP is pretty far from my area (search quality).

mh- · on Nov 6, 2012

>Blog or tweet or post to HN and hope Matt Cutts sees it and notifies the right team, maybe.

It seems that's more or less the quasi-official support channel even for paid services from Google.

jauer · on Nov 6, 2012

A common way to notice things like this is to subscribe to a service like Renesys or Cyclops (http://cyclops.cs.ucla.edu/) that will alert you if it sees your subnets being announced by a different AS.

emilw · on Nov 6, 2012

Stop excusing yourself for not being something, either read more and then comment or trust your gut feeling. (not meant to sound harsh)

JohnLBevan · on Nov 6, 2012

I think it's good to qualify your opinion with your level of expertise. There's no rule that says HN should only be for discussion by experts (hopefully there never will be), and if you don't know something for sure it's best to say so that others don't take your word as gospel. That said, I'm no expert ;).

precisioncoder · on Nov 6, 2012

The key quote seems to be "Looking at peering maps, I'd estimate the outage impacted around 3–5% of the Internet's population." so if it didn't affect google directly it would have to go through customer server -> network technicians which would probably take more than the 26 min that Google was down for those customers for. I'm sure they would have been right on it if it hadn't have been fixed so fast.

michaelt · on Nov 6, 2012

  It sounds weird that Google did not triggered a recovery procedure on its own.

It's possible they didn't have the personal contact details for the engineer capable of fixing the problem.

We all know how hard it can be to contact a competent person at a big corporation when you have a problem [1]. Would Google find it easier than every other human being?

[1] http://xkcd.com/806/

notatoad · on Nov 6, 2012

It sounds like cloudflare simply got there first. Unless cloudflare is outright lying (highly unlikely) they saw a problem they could fix, and fixed it. What's the false advertising there?

dsl · on Nov 6, 2012

They didn't fix anything. Multiple people noticed, all of them contacted the network in question, then they took credit publicly when another network fixed its mistake.

jhull · on Nov 6, 2012

Couldn't a rogue government easily take down the internet this way? Seems like if one guy in Indonesia can take out Google by accident, a government entity could do the same.

sudhirj · on Nov 6, 2012

The moment people realize that the rogue network was being malicious, they'd stop trusting it - ignoring all announcements it might make. It might take a few hours for order to be restored, though.

JohnLBevan · on Nov 6, 2012

Would it be possible to claim to own Google's IP, then on receiving the packets intended for Google forward them on to the real IP (without accidentally forwarding them back to yourself)? That way someone could hijack & interrogate these packets without being spotted (at least without causing service outage / only adding slight delay). Alternatively could they route these requests to a clone as an advanced phishing scam?

SoftwareMaven · on Nov 6, 2012

Wouldn't the packets that <evil network> forwarded on to the real Google just get routed right aback to them, because the rest of the world thinks they are Google?

It might work for somebody like China, where they have two network interfaces, so can make all Chinese networks think they are Google on one interface, then forward things on to the real Google on the other. There might be a good reason for them to do it, too, because they are also likely a trusted CA, so could forge SSL certs, too.

count · on Nov 6, 2012

http://www.defcon.org/images/defcon-16/dc16-presentations/de...

It works :)

stonemetal · on Nov 6, 2012

That is more or less the definition of a man in the middle attack. Hopefully if the website does something important(online banks, shopping, etc.), they have done something to mitigate that possibility.

ErikD · on Nov 6, 2012

That's what https is for. It should prevent them from doing anything useful with the packets.

vegardx · on Nov 6, 2012

Not unless you manage to forge a certificate at the same time. It has been done before, as SSL is based on more or less the same level of trust as BGP.

sp332 · on Nov 6, 2012

Maybe you can't read the data in the packets, but HTTPS doesn't do a thing about SIGINT (signals intelligence) which, on such a large scale, could give you a lot of valuable information.

antidoh · on Nov 6, 2012

Enough time for a purge or to gain an advantage in an invasion. This might even be on checklists for such actions.

parfe · on Nov 6, 2012

This exactly thing happened a few years ago. China rerouted about 10% of internet traffic, presumably by accident. http://www.theregister.co.uk/2010/04/09/china_bgp_interweb_s...

lini · on Nov 6, 2012

And nothing will change. At least not until someone does this with malicious intent - script kiddie A knocks out big site, or a censoring state decides that it should block a free speech site from the entire Internet.

forgotusername · on Nov 6, 2012

Evil routing has been employed a whole bunch of times going back decades, most visibly a couple of years ago when IIRC Iran (?) started advertising bad routes for a bunch of big sites, including Google

apawloski · on Nov 6, 2012

Pakistan null routed YouTube and accidentally took a big chunk down around the world in 2007.

noselasd · on Nov 6, 2012

I can't find much information about Iran advertising bad routed, but China did: http://bgpmon.net/?p=282

pilom · on Nov 6, 2012

A much more useful thing to do than take out a big site is use BGP to create your own "section of the internet" to do your malicious deeds from, then afterwards remove the BGP routes and your addresses will no longer exist on the internet. So it will be as if you sent packets from a phantom network.

dsl · on Nov 6, 2012

Thats why many people have been archiving every advertised route going back to the mid-90's.

http://archive.routeviews.org/oix-route-views/ http://www.pch.net/resources/data.php?dir=/routing-tables (link appears to be temporarily broken)

stephen_g · on Nov 7, 2012

Script kiddies generally do not have access to edge routers though...

xcode · on Nov 6, 2012

To be accurate, google didn't go down today -- your pathway from your computer to google got 'poisoned'. It wasn't Google's fault.

stingraycharles · on Nov 6, 2012

For what it's worth, this is quite a vulnerability in the internet's routing system. It's also the reason Youtube went offline after Pakistan was deliberately announcing the wrong routes a few years ago because it didn't agree with some videos being broadcasted by Youtube.

http://www.ripe.net/internet-coordination/news/industry-deve...

http://news.cnet.com/8301-10784_3-9878655-7.html

highace · on Nov 6, 2012

This worries me. Am I right in saying a malicious party could actually take down the internet with this?

sp332 · on Nov 6, 2012

Yes, but you'd have to con a lot of big players into trusting your BGP routes first. And the effect would only last as long as it took to change some configurations and write you back out of the internet.

sanxiyn · on Nov 6, 2012

Malicious party? It already happened by accident. http://www.renesys.com/blog/2009/02/longer-is-not-better.sht...

davej · on Nov 6, 2012

They would have to be a 'trusted' malicious party. But in theory, yes. I'm sure it would be reverted very quickly though.

killermonkeys · on Nov 6, 2012

Why wouldn't PCCW preventing its customers from publishing routes outside its whitelist work? It has been a long time since I worked on BGP but that was common practice from back haul carriers to ISPs even at that point (2003). Given the same back haul provider has allowed this twice, it seems like a reasonable ask.

jauer · on Nov 6, 2012

Some carriers are lazy. There may also be politics involved in making national carriers "ask" for permission to advertise routes.

jemfinch · on Nov 6, 2012

Isn't the title a touch sensationalist? Google did not go "offline": it was briefly unavailable for a relatively small number of networks.

ColinWright · on Nov 6, 2012

You can't win. If you quote the title given, people complain. If you change it to something more accurate, the mods change it back, and then people complain anyway.

jemfinch · on Nov 6, 2012

Yes; to be clear, I was complaining about the author's original title, not the submission title.

noselasd · on Nov 7, 2012

For the people affected, google was for all purposes "offline". If the estimate of 3–5% of the internet population is correct, that's a lot of people.

jemfinch · on Nov 7, 2012

I think that if 3-5% of the Internet population suddenly stopped querying Google, Google would have noticed before CloudFlare.

noselasd · on Nov 9, 2012

There weren't any claims that Google were unaware of this. And when things happen at this level, resolving it in 27 minutes can only be done when there's direct contact between people that are able to do anything about it.

hayksaakian · on Nov 6, 2012

It really makes one wonder about the fragility of the internet.

precisioncoder · on Nov 6, 2012

I would say the resilience is what impresses me here. The fact that it's decentralized means that anyone can fix the internet. The fact that this one specific problem was fixed within 26 min by individuals realizing the problem and acting to fix it gives me a warm feeling.

kami8845 · on Nov 6, 2012

I think what you mean is that anyone can break the internet (in this case a random ISP from Indonesia) and that in that case only very specific people could fix it (probably at least a senior network engineer at said ISP).

sp332 · on Nov 6, 2012

Only specific routers that you trust (or are trusted by routers you trust) can break your internet. You can fix your internet by un-trusting those routers.

ChuckMcM · on Nov 6, 2012

At some point anonymous is going to figure out the bgp 'hack' is actually exploitable, unlike taking the root name servers offline and we see a network routing outage for several days. I wish it wasn't so but sometimes that is the only way these things get fixed

dsl · on Nov 6, 2012

First of all to pull off this "hack" you need a router, an AS number, a transit contract with your upstream provider, BGP configured with said upstream, and most importantly your upstream needs to be negligent enough to not apply route filters to your session (which basically means I will only accept routes for IPs owned by company X over company X's session).

Secondly, it is pretty easy to track down who is doing it. Assuming a rouge employee used their employers setup (see first point) to announce once of Google's routes and it managed to propagate, smart people at NOCs around the world start emailing and calling each other pretty quickly. Despite CloudFlare trying to take credit here, I'd put money on the fact the network in question received at least a dozen phone calls and emails. There are services like Renesys and BGPmon that "important" companies sign up for that will scream bloody murder and start paging people if someone unauthorized originates your prefixes.

Third, as this is a known problem, a solution is already in the works and on its way to being implemented. Basically when you are assigned a block of IP addresses, you also get to publish a cryptographically signed statement of how and where that block should show up in the global routing table. See http://www.nanog.org/meetings/nanog49/presentations/Tuesday/...

guiambros · on Nov 7, 2012

Well said, dsl. Almost two decades ago I used to run an ISP in another country, and remember that BGP was already reasonably safe at the time (when v4 started to be implemented), with peers normally rejecting route updates from blocks outside your control.

Yes, there's always the risk of a trusted peer mistakenly leaking routes publicly (and a permissive upstream provider not rejecting it outright), but that's a low risk attack vector.

I do remember this happening a few times, but were quickly spotted and corrected (true, the internet at the time was a lot smaller; you could probably fit all sysadmins of a country in a room..)

I see this article as the CloudFlare guy trying to get credit for an act of civility that many other sysadmins likely have done, silently, in parallel. Of course I'm glad he did, but wouldn't expect anything less. That's just how the internet works.

ps: thanks for the link. NANOG is something that I had long ago erased from my brain. Had a chuckle looking at the archives :)

ChuckMcM · on Nov 7, 2012

Correct of course, you would need to compromise the infrastructure of an ISP. Not that a few hundred dollars or a USB stick drive at the right place couldn't do it. Especially for a less well travelled part of the Internet.

ninetax · on Nov 6, 2012

While this does make sense if I abstract out what a BGP is, I wish I had a deeper knowledge of how the Internet works.

Does anyone know of a book that goes from the basics of networking up to how it's all assembled on a large scale?

A "big book of internet" if you will.

clebio · on Nov 6, 2012

Since I use DuckDuckGo for searches, I probably wouldn't notice this. Not receiving Gmail for a while wouldn't be noteworthy (at least for the first half hour or so).

I'm confused about the times the author gives, though. The article is dated today (11/6) and he says this happened 'today' at 6:24pm PST / 02:24 UTC. But unless I'm mistaken, that is a time currently in the future (http://time.gov/timezone.cgi?Pacific/d/-8/java). I guess he meant yesterday?

robk · on Nov 6, 2012

You're counting across the dateline, so for you it was 11/5.

clebio · on Nov 6, 2012

Am I? Not snark: if I'm misunderstanding this, I truly want to know. I'm in central US, CST, and the article gives PST. That conversion has always just been +2 hours.

Cushman · on Nov 6, 2012

As I read it that was 18:24 yesterday in PST, or 02:24 today in UTC. The use of "today" may just be sloppy dating-- or it may reflect that it was today for most of those affected.

clebio · on Nov 6, 2012

My guess was that he wrote the article yesterday, but didn't publish it until today. It's not a big deal, was just curious.

eastdakota · on Nov 6, 2012

That's correct. Tom wrote the article yesterday (11/5) but I didn't review it and hit publish until today (11/6). Sorry for the confusion.

rdl · on Nov 6, 2012

I am more curious what caused the 4 minute mid-day outage a few days ago. It wasn't BGP, since google.com was still up, but all personalization was down, and YouTube was down.

flannell · on Nov 6, 2012

Not the first time, see here;

http://www.theregister.co.uk/2010/04/09/china_bgp_interweb_s...

runn1ng · on Nov 6, 2012

Can I ask a pretty newbie question - how is BGP connected to IP, TCP and DNS protocols? Is it sitting "below" them, "on top" of them, or is it somewhere else?

jemfinch · on Nov 6, 2012

First, TCP and DNS don't come into it: they both piggyback on IP (TCP directly; DNS via UDP in typical use), so IP is all that's really relevant.

BGP is how routers communicate with each other. Every major edge router for a network is typically connected to many other edge routers for other networks. Each router announces what amounts to their complete routing table: i.e., for every IPv4/IPv6 address that they know how to route, they announce what networks it traverse on the way to the destination.

When a router is deciding which router an IP packet should hop to next, it looks at the packet's destination IP address and consults an in-memory data structure that it has constructed based on the BGP announcements of the routers to which it's connected. Modulo refining nuances (MED/PREF), it looks for two things:

1. It routes the packet according to the most specific network it saw announced. If it sees a packet destined for 1.2.3.4, and one connected router A is announcing a route for 1.2.3.0/24, and another connected router B is announcing a route for 1.2.0.0/16, it will pass along the packet to router A, all other things being equal.

2. As a tiebreaker for announcements with the same network specificity, it looks at the "AS path": the set of networks that the packet will traverse. It picks the router with the shortest path: the least number of traversed networks.

So the answer to your direct question is that BGP is "somewhere else": it's what routers use to communicate to each other "How will you route this IP packet?" and then make reasonable decisions about how they should send packets around the network.

count · on Nov 6, 2012

To be clear - BGP runs on TCP port 179, so it sits in TCP segments, and those are inside IP packets.

btilly · on Nov 6, 2012

Rather than considering them as a meaningful stack, I think it helps to know what each does.

IP is a protocol for taking a chunk of data, slapping some addressing information on it, and then having it be sent, like an electronic letter, from one computer to another by whatever route the network thinks is best. More precisely every computer sends it to a computer it is directly connected to that it thinks is closer. Eventually, hopefully, it gets to the right place.

If you just want to send chunks of data over IP and hope that they get there, you have UDP.

TCP is a more advanced protocol where one computer contacts another, and then a stream of data starts to flow between them through a connection. Under the hood the stream is broken into chunks that are put in IP packets. And there are extra packets for things like, "Hello, trying to connect here" "I got these packets" "I'm done" and so on. Obviously TCP sits on top of IP.

DNS is a protocol for turning a human readable name like news.ycombinator.com into an IP address like 174.132.225.106. Under the hood DNS uses both UDP and TCP.

BGP is a protocol that is used between routers to advertise how to route packets. BGP uses TCP to work, so it is above TCP. But that routing information is used at the IP level, so bad routes can stop IP from working. Which is what happened here. Someone advertised that they were how to get to a lot of Google addresses, so routers began sending Google traffic there. When the packets arrived, they had no idea what to do with them and dropped them. The result is that the IP layer to Google stopped working for a lot of people.

alok-g · on Nov 6, 2012

Newbie question: If BGP uses TCP to work, TCP is above IP, and routers use BGP information to route IP packets, how may bootstrapping happen if needed?

jauer · on Nov 6, 2012

Typically no bootstrapping because IP (and thus TCP) work between hosts in the same subnet without routing.

You would typically have a /30 or /31 subnet containing a pair of routers and have the routers communicate (BGP etc) using those addresses.

ef4 · on Nov 6, 2012

BGP runs on top of TCP, which runs on top of IP.

DNS runs on top of UDP (or sometimes TCP), which runs on top of IP.

Edited to elaborate: most computers on the internet don't need to know anything about BGP. It's not directly involved when you establish connections. Think of it as an automatic configuration system running on the various routers.

pilom · on Nov 6, 2012

Saying BGP runs on top of IP is true but doesn't tell you what BGP does or how Global scale IP doesn't work without BGP.

BGP is the protocol Internet Routers (i.e. not your home router) use to figure out how to route IP addresses to particular routers.

So your home router connects to an internet router at comcast (or your ISP). The Comcast router announces to the rest of the world "Dear world, if you want to connect to any of the IP addresses at X.X.X.X sent those packets to me and I'll deal with them."

ColinWright · on Nov 6, 2012

http://www.tcpipguide.com/free/t_TCPIPExteriorGatewayRouting...

zrail · on Nov 6, 2012

BGP is comprised of long lived TCP connections between routers. The IP of the other router is well known and hard coded in the config afaik but I'm not a network engineer.

brianpan · on Nov 6, 2012

BGP and DNS run on TCP which runs on IP.

http://en.wikipedia.org/wiki/Internet_protocol_suite http://en.wikipedia.org/wiki/OSI_model

derleth · on Nov 7, 2012

In terms of understanding the Internet, the OSI model is less useful than the Internet model outlined in RFC 1122 and RFC 1123, because the OSI model persists in making distinctions without a difference, like the separation of the Application and Presentation layers. (Frankly, most of the time the Link layer is entirely determined by the Physical layer, so that distinction is also of marginal use in the real world.)

On a more historical note, the Internet protocol suite beat the OSI protocol suite. Practically nobody uses the OSI protocols anymore, so why bother trying to fit the Internet protocols into the OSI model?

http://en.wikipedia.org/wiki/Internet_protocol_suite

http://tools.ietf.org/html/rfc1122

http://tools.ietf.org/html/rfc1123

brianpan · on Nov 7, 2012

OSI is what I learned in school an age-and-a-half ago, that's why I included that link also. :)

tomjen3 · on Nov 6, 2012

Does this mean that in the future we should ignore all routes comming from PCCW (since they rebroadcast all rules without filtering)?

wmf · on Nov 6, 2012

That's a good way to effectively disconnect yourself from the Internet. A lot of ISPs are not properly filtering.

ambiguator · on Nov 6, 2012

I don't know much about the processes behind the Internet, but I found this to be a fascinating introduction.

halayli · on Nov 6, 2012

Outage in an ASN != Google went Offline. The title puts the blame on Google which isn't true.

oldcreek · on Nov 6, 2012

The failure has nothing to do with BGP either.

rurounijones · on Nov 7, 2012

Sidenote: In the comments I saw a reply about nanog.com being a great plce to meet other networking peeps.

http://www.nanog.com/ is currently showing a "Welcome to nginx" message

bonobo · on Nov 11, 2012

That's because the right address is http://www.nanog.org

rurounijones · on Nov 12, 2012

Ta muchly.

dangoldin · on Nov 6, 2012

This is a great write up - thanks for posting. I'm slowly beginning to understand how the internet works day by day due to posts like this.

sneak · on Nov 6, 2012

"I'm a network engineer at CloudFlare and I played a small part in helping ensure Google came back online."

Uhh, no. Without the "ensure", then maybe.

oldcreek · on Nov 6, 2012

Even without the "ensure" ... what did this network engineer at CloudFlare do anyway? it was a hardware failure.

saraid216 · on Nov 7, 2012

He filed a bug report.

louischatriot · on Nov 6, 2012

Very interesting explanation, thank you.

rahasia · on Nov 6, 2012

27 minutes, 3-5% traffic, it could means thousand of dollars lost for Google, right? (Does it sueable?)

jQueryIsAwesome · on Nov 7, 2012

Under which law are they going to sue? USA laws don't apply to Indians ISPs.

bonchibuji · on Nov 7, 2012

It was an Indonesian ISP, not Indian.

henrymazza · on Nov 6, 2012

> We use Google Apps for things like email so when we can't reach their servers

Very professional way to do so!

jamesinsf · on Nov 6, 2012

Great explanation. Good show and great job! Very smart engineers at Cloudflare!

yskchu · on Nov 6, 2012

Haha, I was in HK today, and one of those hit, using PCCW services also

tcohen · on Nov 6, 2012

I wish I understood more of this but still really cool!

sunyc · on Nov 6, 2012

almost all bgp transit provider have prefix filtering,

lhnn · on Nov 6, 2012

DAE think that the whole "BGP is broken!" argument is a bit overblown?

If you're going to have a bunch of autonomous systems/networks operating together, with no central authority, it necessarily comes down to trust and relationships.

Shit will occasionally happen. It's important to look at outages, figure out the cause, and work to prevent it. Perhaps, though, this is a best practices issue, and not some fundamental flaw in BGP.