If the datacenters are really too complex for the people running them to understand, I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service.
Having worked at SABRE its not as easy as you think, as there are a crapton of systems and people that depend on all that old stuff to keep running as it. SABRE has spent 30 years trying to modernize their mainframe based system piece by piece. But the core reservation system is actually highly available still, it's the modern bits that wind up less stable. The problem is that travel is a highly interconnected system of many different companies in which complexity is almost impossible to avoid. There are also many systems involved not visible to the public, such as weight balancing, crew scheduling, etc where even one of those failing and screw up airline travel worldwide. It's not always reservations or checkin that's broken, even something as simple as an airport system failing can have a domino affect all over the country.
It reminds me a bit of Vernor Vinge's "zones of thought" books (sci fi) in which many of our current technological dreams have failed to materialize and cascading failures of brittle automated systems cause logistic collapses that wipe out advanced civilizations.
> I would expect failures like this to drive the airlines towards AWS, Azure, or some similar service
I don't think that would help much. It's not really the core hardware or operating systems that tend to cause these types of outages.
More typically, it's the dependency chain between locations, applications, and services. And, there's more than one system that can cause a ground halt. The check-in service, the no-fly list functionality (which the govt runs), weight/balance, crew scheduling, dispatch functions, and so on.
Check-in is a good example. You can lose that either through a failure in the complex WAN, failures in the check-in backend service, failures with the no-fly service (run by the govt) or connectivity to it, failures in the CRS/GDS, failures in various services around check-in kiosks, failures in the online checkin, and so forth.
Once they go down, you also face an unusually high spike in request volume when you're trying to get them back up. It creates a wave than can overwhelm different parts of the system.
For the more recent failures (across different airlines) listed above, I know one was a routing storm on the IP network, one was the checkin service, and one was the central reservations system...I think a botched version upgrade. Similar effects, different root causes.
Not to say it's okay, or shouldn't be addressed, but just noting that there's not really one smoking gun.
I'm not advocating that it is ok, but I am sure these airline systems are super old and wouldn't be surprised if they use IBM DB2 or similar ancient database technology. Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems.
It's a legacy story like none other, in fact. The predecessor/origins of SABRE is IBM's Airline Control Program (ACP). When I worked for IBM years ago I heard many stories of how difficult it was to try an modernize to a newer system because of absurd complexity, but just as much because the whole airline industry became so inextricably bound to the legacy:
I'm currently doing consulting work for That Major Australian Airline and, while I'm rather high up the stack, you do get a sense of the amount of legacy that's built into everything and the monumental effort it would be to migrate all this old stuff to newer infra. I mean, AFAICT there's no database/service to _query_ flights - you have to register a web hook to receive flight data and store it yourself.
New projects are cloud-first, and more and more stuff gets migrated or replaced with equivalents which run in whatever cloud provider. But I can't even imagine how the replacements for all these old legacy services would go down.
You would be surprised how much has already been migrated away from the old IBM systems (TPF). The big players these days are Lufthansa, Jeppesen, Navitaire, Appolo, Sabre, etc.
Apollo is now Galileo. Galileo, Sabre, HP Shares, and Amadeus (the actual "Big 4" in this space) all still use the TPF operating system on IBM mainframes. They all are offloading functionality piecemeal, to more modern systems. But TPF is still at the core.
You mentioned Navitaire. They are not using TPF...they use COBOL on Windows (not kidding). They do have a large list of airline customers, but none with a big fleet. Reportedly, it doesn't scale up well enough to serve a large airline.
TPF also lives on in the financial world as well, like at Visa, for example.
IBM DB2 is downright futuristic compared to what some of these systems are. SABRE, for example, is probably the granddaddy and horror textbook example of what's commonly referred as "legacy codebase" (although the IRS Master Files written in S/370 assembly could give it a run).
Both the individual and business master files are still written in IBM mainframe assembly language, and are circa 56 years old. See the table on page 4 of this PDF for a list of the oldest systems in operation:
Number three on the the list is The DoD's Strategic Automated Command and Control System, which runs on "an IBM Series/1 Computer—a 1970s computing system—and uses 8-inch floppy disks". No biggie; it just "coordinates the operational functions of the United States’ nuclear forces, such as intercontinental ballistic missiles, nuclear bombers, and tanker support aircrafts."
> Standard CFOL Access Protocol (SCAP) is written in COBOL/Customer Information Control
System (CICS). SCAP downloads Corporate Files On-Line (CFOL) data from the IBM mainframe at
the Enterprise Computing Center, Martinsburg. The CFOL data resides in a variety of formats
(packed decimal, 7074, DB2, etc.)
never ever under estimate the government and military's ability to keep old systems operational well past what others consider reasonable.
anecdotal, back in the late 80s I was in the USAF. Our secure communication center was running the first model Burroughs machine to not use tubes. It was that old. It could boot from cards, paper tape, or switches. The machine was older than many people who would be assigned to it. This was closely repeated in the main data center (personnel records, inventory, and such) which had a decade old system that migrated off physical cards by 89 but still took them as images off 5.25 floppy uploaded by PC)
DB2 isn't "ancient" in any meaningful sense. The first release shipped long ago but IBM has kept it fully up to date since then and it's still competitive with any other relational database for high-volume OLTP.
Can confirm both - I've been writing LOB software for the financial industry in VB6 up until very recently, and stopped because I switched jobs, not because they've stopped writing them :P
To be honest, VB6 is much better than some of the other stuff they have around.
I now switched to a travel agency and am interacting with the Sabre blue screen systems that are similarly old.
> Moving to the cloud is not a trivial task for these sorts of mission critical antiquated systems
The question is if moving to _anything_ more modern is less trivial/costly than keeping the current systems which appear to have many single points of failure/
> Often I've seen businesses reason that the failures are cheaper than the upgrades.
I would love to know what the cost of today's outage in terms of overtime, gate fees, fuel, additional crew, &c.
Delta's cost ~$150MM [1]. That's something on the order of a thousand mid- to senior- level programmers for a year in my area. Even if you allocate a quarter of that cost to computer costs (which I'm betting is a fairly large over estimate), that still leaves a sizable team.
> DOS interfaces.
TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.
> TUI interfaces can be really, really efficient in terms of navigation and getting around. I rarely see GUI apps that function anywhere near as smoothly or keyboard only.
TUIs have a steeper learner curve, but I agree that once someone masters the hotkeys for that particular interface, they're much quicker. In addition, depending how old the computer is running the TUI, the staff may not be able to use the computer to browse anything else. ;)
> That's something on the order of a thousand mid- to senior- level programmers for a year in my area
I shudder to think of a scheduling program as complex as an airline's ticketing system being run on a bit of software written by a thousand programmers in only one year.
Okay, I admit that I have absolutely no idea what software like that costs to build, but surely they could have rebuilt their entire software stack for that, couldn't they?
No. Software at that scale is humungously expensive.
I worked on quoting an insurance policy core (just the core mind you, not the extras), and for a medium-sized insurance company it would reach that amount of money.
I suspect a complete rewrite would go into the billions.
The company I worked for ended up not doing it (and regretting it).
It ends up not being that many people, but extremely high paid consultants (200 to 400 dollars an hour), and extremely high licensing costs. Some projects can go on for years, should be 1 to 2 years.
It's extremely profitable and very well paid, one such company, Guidewire, is one of the top 10 best paying employers in the U.S.
Yeah a lot of this stuff is on old school mainframes written in assembly. I work in the airline/travel industry and I've seen 28 year old code still running in production.
A lot of the systems the airlines use are typically hosted by large telco's. Some use SaaS, so those are hosted for them. Very few Airlines I know of actually have their own data centers anymore and typically it's a misconfiguration or an upgrade that causes these outages. Although it can also be a comm problem.