If the datacenters are really too complex for the people running them to underst...

coldcode · on Jan 23, 2017

Having worked at SABRE its not as easy as you think, as there are a crapton of systems and people that depend on all that old stuff to keep running as it. SABRE has spent 30 years trying to modernize their mainframe based system piece by piece. But the core reservation system is actually highly available still, it's the modern bits that wind up less stable. The problem is that travel is a highly interconnected system of many different companies in which complexity is almost impossible to avoid. There are also many systems involved not visible to the public, such as weight balancing, crew scheduling, etc where even one of those failing and screw up airline travel worldwide. It's not always reservations or checkin that's broken, even something as simple as an airport system failing can have a domino affect all over the country.

caconym_ · on Jan 23, 2017

It reminds me a bit of Vernor Vinge's "zones of thought" books (sci fi) in which many of our current technological dreams have failed to materialize and cascading failures of brittle automated systems cause logistic collapses that wipe out advanced civilizations.

tyingq · on Jan 23, 2017

> I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service

I don't think that would help much. It's not really the core hardware or operating systems that tend to cause these types of outages.

More typically, it's the dependency chain between locations, applications, and services. And, there's more than one system that can cause a ground halt. The check-in service, the no-fly list functionality (which the govt runs), weight/balance, crew scheduling, dispatch functions, and so on.

Check-in is a good example. You can lose that either through a failure in the complex WAN, failures in the check-in backend service, failures with the no-fly service (run by the govt) or connectivity to it, failures in the CRS/GDS, failures in various services around check-in kiosks, failures in the online checkin, and so forth.

Once they go down, you also face an unusually high spike in request volume when you're trying to get them back up. It creates a wave than can overwhelm different parts of the system.

For the more recent failures (across different airlines) listed above, I know one was a routing storm on the IP network, one was the checkin service, and one was the central reservations system...I think a botched version upgrade. Similar effects, different root causes.

Not to say it's okay, or shouldn't be addressed, but just noting that there's not really one smoking gun.

nodesocket · on Jan 23, 2017

I'm not advocating that it is ok, but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology. Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems.

eplanit · on Jan 23, 2017

It's a legacy story like none other, in fact. The predecessor/origins of SABRE is IBM's Airline Control Program (ACP). When I worked for IBM years ago I heard many stories of how difficult it was to try an modernize to a newer system because of absurd complexity, but just as much because the whole airline industry became so inextricably bound to the legacy:

https://en.wikipedia.org/wiki/IBM_Airline_Control_Program

madeofpalk · on Jan 23, 2017

I'm currently doing consulting work for That Major Australian Airline and, while I'm rather high up the stack, you do get a sense of the amount of legacy that's built into everything and the monumental effort it would be to migrate all this old stuff to newer infra. I mean, AFAICT there's no database/service to _query_ flights - you have to register a web hook to receive flight data and store it yourself.

New projects are cloud-first, and more and more stuff gets migrated or replaced with equivalents which run in whatever cloud provider. But I can't even imagine how the replacements for all these old legacy services would go down.

pbarnes_1 · on Jan 23, 2017

They used to run on Tandem NonStops with a dual DC setup in Sydney and Newcastle. Those were the days.

hourislate · on Jan 23, 2017

You would be surprised how much has already been migrated away from the old IBM systems (TPF). The big players these days are Lufthansa, Jeppesen, Navitaire, Appolo, Sabre, etc.

tyingq · on Jan 23, 2017

Apollo is now Galileo. Galileo, Sabre, HP Shares, and Amadeus (the actual "Big 4" in this space) all still use the TPF operating system on IBM mainframes. They all are offloading functionality piecemeal, to more modern systems. But TPF is still at the core.

You mentioned Navitaire. They are not using TPF...they use COBOL on Windows (not kidding). They do have a large list of airline customers, but none with a big fleet. Reportedly, it doesn't scale up well enough to serve a large airline.

TPF also lives on in the financial world as well, like at Visa, for example.

_ugfj · on Jan 23, 2017

IBM DB2 is downright futuristic compared to what some of these systems are. SABRE, for example, is probably the granddaddy and horror textbook example of what's commonly referred as "legacy codebase" (although the IRS Master Files written in S/370 assembly could give it a run).

gravypod · on Jan 23, 2017

There's no way the IRS still uses an S/370 system. Please give me a link so I can have a laugh.

wpietri · on Jan 23, 2017

Both the individual and business master files are still written in IBM mainframe assembly language, and are circa 56 years old. See the table on page 4 of this PDF for a list of the oldest systems in operation:

https://oversight.house.gov/wp-content/uploads/2016/05/2016-...

Number three on the the list is The DoD's Strategic Automated Command and Control System, which runs on "an IBM Series/1 Computer—a 1970s computing system—and uses 8-inch floppy disks". No biggie; it just "coordinates the operational functions of the United States’ nuclear forces, such as intercontinental ballistic missiles, nuclear bombers, and tanker support aircrafts."

_ugfj · on Jan 23, 2017

I am wrong then! 56 years? That's the sixties. Yes, https://www.cnet.com/news/irs-trudges-on-with-aging-computer... it's from 1962 according to this 2008 article and the S/370 was introduced in 1970. It's actually an IBM 7074. God above. http://webcache.googleusercontent.com/search?q=cache:J3hXqKq...

This is a PDF from 2016 https://www.irs.gov/pub/irs-utl/scap-pia.pdf

> Standard CFOL Access Protocol (SCAP) is written in COBOL/Customer Information Control System (CICS). SCAP downloads Corporate Files On-Line (CFOL) data from the IBM mainframe at the Enterprise Computing Center, Martinsburg. The CFOL data resides in a variety of formats (packed decimal, 7074, DB2, etc.)

7074 format. weeps

raverbashing · on Jan 23, 2017

Yeah, but it's better for the ICBM control systems to run on 8'floppies, it's much safer (also because of the added friction)

100ideas · on Jan 23, 2017

Distorts the attack surface these legacy systems present.

hindsightbias · on Jan 23, 2017

Code that works, and has for 40 years.

You'll never write something that lasts that long.

GFischer · on Jan 23, 2017

I feel really old when I realize some of my code is almost halfway there, and the odds are it WILL reach 40 years :(

You seriously underestimate inertia at financial institutions.

ams6110 · on Jan 23, 2017

New IBM z-series mainframes will still run most System 370 assembly language programs. So it's possible.

Shivetya · on Jan 23, 2017

never ever under estimate the government and military's ability to keep old systems operational well past what others consider reasonable.

anecdotal, back in the late 80s I was in the USAF. Our secure communication center was running the first model Burroughs machine to not use tubes. It was that old. It could boot from cards, paper tape, or switches. The machine was older than many people who would be assigned to it. This was closely repeated in the main data center (personnel records, inventory, and such) which had a decade old system that migrated off physical cards by 89 but still took them as images off 5.25 floppy uploaded by PC)

nodesocket · on Jan 23, 2017

You're right, DB2 is way too hipster and modern for airlines.

nradov · on Jan 23, 2017

DB2 isn't "ancient" in any meaningful sense. The first release shipped long ago but IBM has kept it fully up to date since then and it's still competitive with any other relational database for high-volume OLTP.

mbesto · on Jan 23, 2017

> but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology.

Not even. They're still on mainframes.

sithadmin · on Jan 23, 2017

I don't get what you're trying to get at here. DB2 started its life on IBM mainframes (MVS), and is still often run on z/OS or i.

tyingq · on Jan 23, 2017

They are still on mainframes running TPF, where DB2 is not an option. The database is TPFDF, a KV datastore.

compuguy · on Jan 23, 2017

The financial industry still has software running on mainframes. That some are still writing LOB software in VB6.

GFischer · on Jan 23, 2017

Can confirm both - I've been writing LOB software for the financial industry in VB6 up until very recently, and stopped because I switched jobs, not because they've stopped writing them :P

To be honest, VB6 is much better than some of the other stuff they have around.

I now switched to a travel agency and am interacting with the Sabre blue screen systems that are similarly old.

jimktrains2 · on Jan 23, 2017

> Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems

The question is if moving to _anything_ more modern is less trivial/costly than keeping the current systems which appear to have many single points of failure/

shakna · on Jan 23, 2017

Often I've seen businesses reason that the failures are cheaper than the upgrades.

That metric is often miscalculated as the initial cost is more of a deal than future revenue.

So they stay with antiquated mainframes and DOS interfaces.

Providing jobs for people like myself who can code in PL1, though I'd much rather take a brick to the face than do it again.

jimktrains2 · on Jan 23, 2017

> Often I've seen businesses reason that the failures are cheaper than the upgrades.

I would love to know what the cost of today's outage in terms of overtime, gate fees, fuel, additional crew, &c.

Delta's cost ~$150MM [1]. That's something on the order of a thousand mid- to senior- level programmers for a year in my area. Even if you allocate a quarter of that cost to computer costs (which I'm betting is a fairly large over estimate), that still leaves a sizable team.

> DOS interfaces.

TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.

[1] http://www.datacenterknowledge.com/archives/2016/09/08/delta...

et-al · on Jan 23, 2017

> TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.

TUIs have a steeper learner curve, but I agree that once someone masters the hotkeys for that particular interface, they're much quicker. In addition, depending how old the computer is running the TUI, the staff may not be able to use the computer to browse anything else. ;)

shakna · on Jan 23, 2017

I absolutely agree that TUIs can be great, especially at user congestion sites, like POS and similar points.

I meant DOS when I said DOS. Where you run into program/system memory split problems when you keep piling on features.

vacri · on Jan 23, 2017

> That's something on the order of a thousand mid- to senior- level programmers for a year in my area

I shudder to think of a scheduling program as complex as an airline's ticketing system being run on a bit of software written by a thousand programmers in only one year.

jimktrains2 · on Jan 23, 2017

I didn't mean to imply it'd take a single year, but just trying to think about the cost of manpower.

blhack · on Jan 23, 2017

>Delta's cost ~$150MM [1]

Okay, I admit that I have absolutely no idea what software like that costs to build, but surely they could have rebuilt their entire software stack for that, couldn't they?

slededit · on Jan 23, 2017

You are excluding the costs of bugs such a rewrite would inevitably produce. Normally that is drastically more expensive then Dev time.

GFischer · on Jan 23, 2017

No. Software at that scale is humungously expensive.

I worked on quoting an insurance policy core (just the core mind you, not the extras), and for a medium-sized insurance company it would reach that amount of money.

I suspect a complete rewrite would go into the billions.

jimktrains2 · on Jan 23, 2017

Just curious, where did all that money go? How many people for how long?

GFischer · on Jan 23, 2017

The company I worked for ended up not doing it (and regretting it).

It ends up not being that many people, but extremely high paid consultants (200 to 400 dollars an hour), and extremely high licensing costs. Some projects can go on for years, should be 1 to 2 years.

It's extremely profitable and very well paid, one such company, Guidewire, is one of the top 10 best paying employers in the U.S.

http://www.cio.com/article/3064769/careers-staffing/10-best-...

gaius · on Jan 23, 2017

What? You can trivially run DB2 "in teh cloudz" http://www.ibm.com/analytics/us/en/technology/cloud-data-ser...

roscoebeezie · on Jan 23, 2017

Yeah a lot of this stuff is on old school mainframes written in assembly. I work in the airline/travel industry and I've seen 28 year old code still running in production.

DrJokepu · on Jan 23, 2017

DB2 is actually a fairly decent database. I would be really surprised if DB2 was the weakest link in their technology stack.

ams6110 · on Jan 23, 2017

I wonder if they are still interconnected with 56Kb leased lines?

jwilliams · on Jan 23, 2017

DB2 isn't that old, relatively. It's in the same league as Oracle or many other mature relational databases. You might be thinking of IMS.

gaius · on Jan 23, 2017

The funny thing is if you take about 5% of IMS's features and reliability, you get super-trendy MongoDB.

hourislate · on Jan 23, 2017

A lot of the systems the airlines use are typically hosted by large telco's. Some use SaaS, so those are hosted for them. Very few Airlines I know of actually have their own data centers anymore and typically it's a misconfiguration or an upgrade that causes these outages. Although it can also be a comm problem.