Hacker News new | past | comments | ask | show | jobs | submit login
Epic Games certificate expiration incident report (epicgames.com)
107 points by gwtabn on April 16, 2021 | hide | past | favorite | 79 comments



In my experience, certificate issues is a huge tell into organization and treatment of IT folks. Every place I've worked which has had issues with last minute certificate changes or expiring certificates without renewal has had a systemic problem with underpaid and understaffed IT department.

This is not a new problem, organizations will always choose guaranteed profits over possible loss of business unless the loss of business is catastrophic, I just wish that in this case instead of trying to make it seem like a big deal by writing an entire multipage excuse, a company for once would be honest and say 'The risk percentage did not fall in our favor this time, but we're not going to do anything about it because it didn't really impact our profits.'


Video game developers are underpaid because they have an undying love for video games and are willing to work for less than they could make elsewhere.

I suspect this becomes a problem in the context hiring devops people, because whereas you can make the argument that writing game engines and working on game logic is more fun and justifies working for less, it's hard to make the argument that a devops job at Epic running game servers and websites is any more exciting than running servers and websites anywhere else.

This puts epic in the situation of having to pay market rate to attract devops people, but below market rate for attracting developers, which fucks up their pay scaling completely. What ends up happening is they just don't adjust their pay scale at all, which means they're hiring cheap devops people.


This. I interviewed for a role with a game studio and the pay was 40% lower than what I'm making.

I was just curious since they approached me and I had fun with the experience and saying I didn't play their games/had no idea.

The recruiter had no idea of local wages.


I interviewed with epic games and got all the questions answered, though I used generic term to describe each aws product and drilled down into specifics/fundamentals of the questions, protocols, configuration gotchas, etc. Got rejected with "no experience with aws".

Now seeing this I'm sure I dodged a bullet.


> Video game developers are underpaid because they have an undying love for video games and are willing to work for less than they could make elsewhere.

I'm not sure that love lasts forever though. I'm childhood friends with a lot of people that went into games and left by their 30s because they couldn't justify the pay difference. That said, maybe the games industry doesn't need these experienced people.


I work as a devops in the games industry. It’s true that it’s underpaid, and by quite a bit. But it’s not as bad as the programming teams, IME devops pays more.


Epic pays very well compare to the rest of the industry and way above your average dev compagny, your comment is out of reality because it's def not gameplay dev that manage certificates, they have central team like Google does.

When people say video game doesn't pay well, it does not apply for the like of EA, Activision, Epic, Unity etc ...


> ... your comment is out of reality because it's def not gameplay dev that manage certificates, they have central team like Google does.

The main point of my comment is that there's a natural downward pressure on the salary of their software engineers due to being in the games industry, which makes it difficult to find devops people (which, for the purpose of this conversation, are distinct from software engineers) at a similar rate, and makes it difficult to justify paying devops people at the market rate.


Epic doesn't pay below market rate. They can't offer stock because they aren't public but offer cash bonuses 2x-4x salary.

I do agree with OP that (some) game developers undervalue IT. Oculus had a similar and pay rate was equal to FAANG (because it is FAANG!), so it came from culture, not pay.


I’ve never worked in a place that gave more than a few thousand dollars in bonuses :/

Yeah I know FAANGs and investment banks can be impressive on the bonus front too

But the prevalence of this just seems disconnected from what is considered normal or bragworthy in the rest of the private sector and world


Can you elaborate more on the bonuses? Is bonus based income common for these sorts of companies?


Probably also explains why so many games with online components (or online-only gameplay) completely shit themselves the first weekend after launch. The servers can't handle the load, the game is bugged to fuck, and the poor souls stuck in crunch time are stuck in firefighting mode for a while longer. At best they make it look like false exclusivity.

And it continues because as much as people complain, they are holding these game publishers and developers to the lowest possible standard.


I think the model also has to contribute. If you charge a fixed price up front for a game then the infrastructure for online play is almost pure cost.


Are sysadmins "devops people" now? I'm a bit confused with the terminology, I thought devops was a strategy making developers and sysadmins work together.


Devops has 50 definitions. One of them is "What we call Sysadmins these days but with more scripts and APIs and less manually logging into machines and running commands"


> Every place I've worked which has had issues with last minute certificate changes or expiring certificates without renewal has had a systemic problem with underpaid and understaffed IT department.

That's an interesting anecdote, but its quite easy to find examples of companies with well respected, well paid engineering teams that still have an occasional certificate expire. Microsoft[0], Spotify[1], Facebook[2], Apple[3] have all had embarrassing outages due to certificates expiring.

[0]: https://www.theverge.com/2020/2/3/21120248/microsoft-teams-d...

[1]: https://www.theverge.com/2020/8/19/21375032/spotify-down-son...

[2]: https://www.theverge.com/2018/3/7/17092084/oculus-rift-heads...

[3]: https://www.theverge.com/2015/11/12/9721108/apple-mac-app-st...


Right handling certificates is one of those chores that, particularly in a startup is easy to oversee. If the average turnaround for employees is 2 years, and a certificate can be bought for a bit more than 2 years. Normally someone will put it in their calendar and leave the company before it expired, so the new employee will be welcomed by an expired certificate and a not-so-clear list of places where to place it.

That´s why things like AWS certificate manager + ELB kind of things are useful, so that they are mostly auto-renewed.

It is a chore that had bit most of the places where I have worked.


The Apple issue was not a case of forgetting to renew a certificate, certain 3rd party apps just weren't handling the upgrade correctly. So maybe not quite as easy to find examples after all.


> Every place I've worked which has had issues with last minute certificate changes or expiring certificates without renewal has had a systemic problem with underpaid and understaffed IT department.

I've seen the opposite: Organizations who spent so much on the department that everyone was getting promoted to manager and hiring someone underneath themselves to manager things. Responsibilities being shuffled around as the department is constantly reorganized, until no one really understands who's responsible for what any more, but there are enough low-level employees to blame when things go wrong.

I've seen enough variations of organizational dysfunction that I no longer pretend to be able to guess what's going on behind the scenes.


Yeah, that's why Epic Games have been transparent enough to post this incident report: not to provide some explanation to their customers, or some information that the rest of us might be able to learn something from, but so that people on HN can make entirely unfounded accusations about the state of their organisation based on (at best) weakly correlated behaviours and symptoms.

Be reasonable: you know nothing about how Epic Games treats their IT staff or whether or not the team is adequately resourced. I wouldn't say certificate expiry is something that happens particularly often, but I have seen it happen, and it's been simply an oversight rather than an indication of some serious systemic issue.


The fact that a company can't deal with a scheduled-far-in-advance, highly-public-if-failed event does tell you some things about their priorities / how well they do things they need to do.


Reminder: Mozilla failed this way too.

https://news.ycombinator.com/item?id=19823701


And in recent years they've been crippling extensions more and more, and even completely dropped support for them from their primary mobile browser for over a year now.

So yes, I think this is one of many signs that they're not paying enough attention to extensions, not a totally isolated "accidents happen" event. Were I an extension author, I'd see that event as reason to be more concerned.


> and even completely dropped support for them from their primary mobile browser for over a year now.

You're misinformed. Many extensions work, they are progressively being re-enabled over time, and on the nightly version they are all available, although whether they actually work or not depends on the state of the underlying APIs. The reason for the whitelist model, is that when they swapped to the new mobile browser engine, the underpinnings of many of the extension APIs had to be reimplemented, and they are not all online or bug-free yet.


The nightly browser, and the ridiculous[1] steps you need to take there to use extensions, are not their primary mobile browser. Installs from Google Play alone are two orders of magnitude apart.

And no, I don't consider a small custom list to be "support". It's a high-value list and a solid sign that they're not wholly abandoned, and I do expect it to come eventually, but it's very much not the same as general availability. General availability did exist before.

[1]: https://www.ghacks.net/2020/10/01/you-can-now-install-any-ad...

---

Edit: I broadly agree with their breaking of NPAPI stuff, WebExtensions (as a concept, not necessarily the specifics we have now) has a LOT of very real benefits, and does not inherently prevent equal or better capabilities. But it too is still a loss in control, as it stands today.


How long is it going to take to get customizable key commands to where they were in 2016 (five years ago)?


AWS has failed at this numerous times. A company filled with a large number of incredibly smart, incredibly capable engineers who have routinely failed to notice certs were about to expire etc. Amongst organisations with operations teams obsessed with uptime, and an obsession about measuring and monitoring everything. I sat through so many internal incident reports that boiled down to expiring certificates there.

Since I left, I understand that they've fully automated, and mandated, all certificate generation and rotation, but there have still been cert expiration events, albeit rare.

Cert expiration events happen. It's zero indication of the intelligence or capability of the engineering skills or maturity of a company. It's a thing that just works until it doesn't, with zero warning.


They don't have to be renewed on the very last day - non-zero warning is easily achieved. Renew it a month early, and look at it at some point before that month is up.


OK, fine, I'll bite: what specifically are those things it tells you that you can verifiably claim are true about Epic Games, again, specifically?


That they apparently sometimes fail to do these things.

You can't verify anything internal unless you're internal or it has already failed publicly, so you of course have to draw on patterns seen elsewhere. Critical-process failures in one area correlate heavily with failures in others.

Plus, Epic has not exactly shown themselves to be producing consistent quality in anything related to their store, or many internet-connected properties. If they were, this might be more attributable to "accidents happen, it's impossible to prevent them all". It could still be an abnormality, but they're edging further towards "... maybe not though" territory.

---

Edit: lets add a concrete "kinda example, kinda counter-example". Google is a tech company that is pretty good at consistently renewing its many certificates. They recently failed to do so for Google Voice: https://www.bleepingcomputer.com/news/google/recent-google-v...

I think there's a reasonable argument to be made that this reinforces claims that Google Voice is low priority / at higher risk of future issues due to lack of care, i.e. systemic issues, compared to other Google properties. I have no proof, but that doesn't mean it's automatically unreasonable.


Sure, but you can't actually use an example from Google to deduce what's going on at Epic Games.

Don't get me wrong: I'm not saying there aren't problems at Epic Games (most companies have them). What I'm saying is, we're just speculating: how is that helpful? Either to them or to this discussion?

We're either casting vague and hand-wavy aspersions or citing more specific examples where we actually have no idea whether they have any relevance to Epic Games.

It's just noise because, as you've pointed out, we're not internal.


> Sure, but you can't actually use an example from Google to deduce what's going on at Epic Games.

It was illustration of though process - that seems to make sense to me.

> It's just noise because, as you've pointed out, we're not internal.

Yes, it is noisier than direct info from inside but you may learn something.


Are you arguing that the internal workings of a company can't be visible at all to outsiders? Or that there's no correlation between the rate of public, easily preventable failures and technical incompetence? Or just that it's not "helpful" somehow to point these things out?


Epic games published this as a PR move. Nothing more, nothing less. Customers got mad because Epic fucked up so they had to say something to make it seem complex and totally reasonable.

“We made a bad bet on certs not being that important, it backfired” doesn’t sound good but it’s the truth.

The same thing happened when Delta got wiped out by a power outage. “We made a bad bad bet on geo redundancy not being important, it backfired” wasn’t good enough for them either so they pontificated just like Epic did here.

It’s obvious that Epic doesn’t take certificates very seriously here. This is cert management 101. No need to read into it much further.


This seems a bit presumptuous. Epic's Glassdoor reviews[0] don't seem to list pay or staffing as systemic issues.

[0] https://www.glassdoor.co.nz/Reviews/Epic-Games-Reviews-E2669...


I really don’t know one way or the other, though as mentioned in another thread: I’m a devops in games and it pays less but not as poorly as it does for programmers.

That said; Glassdoor is a terrible metric and has been widely criticised as a source of information due to the fact that bad reviews can be removed for payment; though “officially” they don’t accept payment to delete reviews; it’s part of one of their packages to clean up a companies image.

It has also been gamed by employers- but that is obviously a problem for all review sites of this kind.

https://www.reddit.com/r/sysadmin/comments/8tfhxv/glassdoor_...


About 5 years back they were going through a spell of getting lots of engineers from AWS to join them, offering way more than Amazon was. Some of the smartest and most capable systems engineers I know headed in that direction.


In my experience there are a lot of IT departments full of people who know how to click around and hack shit together but aren’t what you’d call classically trained experts.

Kinda like “I’ll get my nephew to make my website”


This is so overfit. Unbelievable anyone goes along with it. I was paid north of $400k total comp when I made this error last. Easy mistake to make.


I don't think this has anything to do with the treatment of IT folks. It has more to do with the time frame validity of certificates. If your certificate expires every month you will have a system or process in place to deal with that (preferably an automated one). However, if your certificate expires every two years or so, someone will set a calendar reminder, leaves the company at some point, and there is your problem.


We had this issue a few times at the place I previously worked.

At first, it wasn't clear whose responsibility it was since back in the operations day, emails would go to someone's specific address or even a mailing group, where most of the employees who were on it had left while new employees weren't added to the list since they didn't know about it.

After it happened once or twice, metrics were set up to track expiring certificates (they were mostly all migrated to AWS Cert Manager I believe) while a few key ones couldn't be.

As a bit of background, we also follow the Google-esque model of not having a phone number for customer support and requiring customers to submit a ticket. We do have outgoing calls but no incoming phone number.

I say that because those key certificates would generate an email that said something like "Press this button and we'll call you to confirm you want to renew" so as you can imagine, my first thought was "Well, how the fuck is shit gonna work?"

I think in the end we just ended up calling the certificate provider to say we don't have a phone number and then we managed to get them migrated to DNS-based validation after some time.

This too wasn't a case of being underpaid but rather having a lack of knowledge. It's the sort of task that some particular person did for a long time but then left so none of us newer folks even knew where these things were provisioned from. Additionally, you don't feel like you have the authority to ie; call up some multi-national provider and be like "Hi, we own this thing but umm, I have no idea how to go about renewing it". It feels like being a teenager calling up about a first job haha.

It's just one of the casualities of "high growth" businesses mixed with humans being bad at seeing cause and effect when the gap between the two is super wide. Cause being people leaving and effect being "I forgot to ask how to do X or Y"

I guess I would clarify that we were following a devops model but had transitioned from a classic dev/ops split so it's quite literally a generational thing where you conceptually don't know how to go about eg; renewing a certificate on the phone because you've entered the industry in the time of dns validation via lets encrypt (and because there literally are no phones anymore in the businessa)


Typical certificate management practices for internal PKI are just absolutely set up to cause outages like this. The certificates get issued for a year, or two years, or whatever. This is infrequent enough that it doesn't feel like it makes sense to automate the process, and then it becomes a run-book that only ever comes out once a year, it's way too easy to add additional services without remembering to monitor which certificates you deployed to them etc.

Start from the idea that you're going to issue certificates valid for 24 hours, and think how different your environment would need to look.


The 3 month limit from Let's Encrypt was a blessing to me, as it forced me to monitor and automate all the renewals.

I renew once a month, and if things should break, I have a two month window to fix the issues.

Before that I would receive a Comodo SSL Certificate once a year via email and until then I always had forgotten what I had to do with it. What an unnecessary pain.


I suppose it's the "green lock" that drives people to still use certificates issued non-automated way.


Extended validation is dead; none of Chrome, Firefox or Safari show any distinguishing UI at this point.


What browsers still show a green lock, or anything to differentiate EV certs? Only IE?


It's worse than one not feeling it's worth it. Automating rare events is basically futile, because the next time your automation runs, everything will have changed and it will break.


If your automation has good logging and you have good alerting on logs, isn't it much better to see the automated process fail as a notification it needs to be done manually rather than relying on it being remembered?

(Ideally, you'd remember and never set the alert off, but still great to have that extra layer.


It's not much different from a notification telling you the activity is due. The difference is mostly a matter of what kind of notifications your organization ignores, and well, I've seen both cases.

Anyway, the best is to shorten the certificates validity. The way Letsencrypt recommends is perfect, run it often and require several failures before anything breaks.


If that is the benefit, than why not just send an auto-reminder notification and skip the automation part?


when your automation fails you get to start by fixing 1% of the task not 100%


Not if you treat automation as a first class citizen.


I know that automation is key here and all the benefits that doing thing very often bring BUT in the specific case of certificates issue if you have short lived certs now you have to ensure your CA system works perfectly, as its uptime now is the uptime of your whole platform. Yeah you can outsource it to AWS ACM and the like, or use Hashicorp Vault but still, it's something that before the change was totally static and now is an extra moving part.

I'm not advocating against it, just exposing the whole story.


It depends on your organization, but in many ways the enterprise PKI CA is one of the easiest services to run at high availability. There are hardly any shared-data dependencies, so it's easy to scale; it's almost completely CPU bound with highly predictable demand, etc.

Pretending it's "totally static" is exactly the problem. There are only two kinds of things in the software world - things that can stay the same until your next release, and things that need automation. "Almost completely static" is how your post mortem ends up on the front page of HN.

A consideration of the full story also needs to include the risks associated with long-lived certificates. If you lose control of the private key associated with one, what do you do? Are you actually operating a CRL? Are any of your HTTPS clients actually checking the CRL? What would you do if a severe compromise were discovered that affected the signature algorithm you're using?


With a short expiry, let's say 90 days, you should be renewing 30 days ahead of time, so at the 60 day mark you attempt to renew.

This grants you 30 days to fix any problems and get the system back up.


Then you have to automate handling the 60-day renewal failure warnings which adds another moving part.


If you can’t reliably automate sending an unignorable message to some set of humans when something fails to happen, you’re going to have a tough time keeping anything actively developed online.


Ugh, I feel this in my bones.


That is some impressively fast mitigation for an unexpected problem. 6 minutes to start the incident process, another 6 minutes to identify the issue, and another 25 minutes to start rolling out the solution.


That’s the kind of performance you get with horizontally scaling to 25 people. /s lol.


Cert Expiration is a problem that needs a better solution when a company does not renew it. These were internal certificates. Still important but not user-facing.

One possible solution might be having the client introduce an artificial delay of 10 seconds or some other time when it encounters an expired cert, or adds an additional second of delay for every day it is expired. This degrades the connection but does not immediately break anything.


Oh please no; give me a hard fail I can localize and fix rather than some kind of awful brownout where various parts of the system just go slow and break things just as badly anyway.

Plus you'd need to be way in the guts of the TLS implementation to achieve this; if you're already there, start generating noise a week ahead of the expiration instead.

Or better, none of the above and automate.


Concur. From working at Basho, one key takeaway with distributed systems is that a hard failure is much easier to remediate than a slow machine.

We wanted a database server to fail hard. Running slowly just caused cascading failures.

Of course, in this case you're effectively talking about the entire cluster crashing hard, but that's still easier to cope with than every system responding at a snail's pace.


I agree that automation is ideal. But let's face it: most companies haven't.

The goal of a business is not to have perfect engineering practices. It is to fulfill customer requests. When there is an outage in the middle of the night, I'd argue that a degraded system buys time to address the issue.

Regardless of the mechanism, having a sudden, complete breakage is not ideal for a business.


No thank you. An artificial delay of 10 seconds is already broken. And adding an additional second a day doesn't improve anything, nor does it help.

If you plan to implement something like this, then do it right and have the service catch the exception and notify an administrator.


We integrated with a government service. It uses a government supplied authentication service[1] for machine-to-machine communication, based on OpenID IIRC (OAuth2++).

For this, our customers need a EV certificate. Most of our customers are small, and don't have their own IT. It's a mess, most don't understand what it is, don't understand the difference between the two or three certificate files they get, a lot can't even figure out how to extract the files (inside a password protected pdf of all things), password? What password? ...

And then of course the certificates expires. Just like that. Poof. And the person who ordered them last time has moved on to a new job, and so we're back to scratch.

We spend so... much... time... on hand holding this for our customers. Didn't take us long to figure out we need to remind them about certificate expiry, but the rest is just such a PITA.

Technically it's a pretty nice solution, but boy it is not made for normal people.

[1]: https://www.digdir.no/digitale-felleslosninger/maskinporten/...


> machine-to-machine communication

> EV certificate

> or three certificate files

So an EV certificate for machine to machine communication where self managed PKI would be better due to having a single CA that could “know the customer” and possibly sending the private key in a password protected PDF?

Did I misread that? Technically is sounds terrible.


I was tired, I think I might have said the wrong thing. They call it a "business certificate" or "enterprise certificate", and for a moment I thought that was EV.

https://www.commfides.com/en/commfides-virksomhetssertifikat...

There's one other they (the gov't auth provider) accepts as CA for this, though they claim others will follow.


It makes me feel better when everybody is fucking up the simple things all the time also.


For internal services, why not use self-signed certs with expiration dates in the 22nd century (if the technology allows that)? You don't need public trust and arguably your own authentication of the cert is more trustworthy a third party's.

I can imagine exceptions, such as when code requires a publicly-signed cert, but I suspect I'm missing something obvious here.


There's a lot of software that sucks at dealing with internal CAs. It's a pain in the ass. Even where there is a single, central store like on Windows, applications will decide they'd rather use their own certificate bundle.


> ...we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems.

I always bristle at this use of ‘learnings,’ especially in cases where ‘lessons’ would suffice. However, it turns out this usage goes back to Middle English and is also in Shakespeare’s Cymbeline:

Puts to him all the learnings that his time

Could make him the receiver of, which he took

As we do air, fast as ’twas ministered,


Not sure if internal services necessaruly require valid certificates. Most of them don't even require encryption. Both encryption, decryption, signing, and validating signatures will only cost cpu-cycles and increase total power consumption.

Looking at what Epic is doing, I would encrypt customer data and everything that involves money. IMO only communication between data centers, with external payment providers, and with users must be encrypted and require valid certificates.


Modern CPUs come with instructions that make symmetric crypto very cheap. And if you err in the other direction you end up with "SSL added and removed here! :^)"


I'm running Let's Encrypt certificates on services that are only accessible in my home network. And I live alone.

I mean, why not?

(Granted my certs actually failed earlier this week since my automation had broken)


One problem with Let's Encrypt certs only work for public domains.


You can verify ownership of a domain via DNS [1], so you don't need the IPs in the A/AAAA records to be publicly accessible. Or be public IPs. Indeed you don't even need those A/AAAA records to be available from your DNS server from the internet.

You do need a domain though.

[1]: https://letsencrypt.org/docs/challenge-types/#dns-01-challen...


Right; technically the ACME protocol itself could be implemented on a private network, but honestly it has a whole mess of complexity because it's designed around the assumption that the requester and the issuer are arms-length counterparts.


Is that really a problem though? Just create a dummy public-facing page.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: