Hacker News new | past | comments | ask | show | jobs | submit login
GitHub having issues today (githubstatus.com)
137 points by mangoman on Jan 9, 2024 | hide | past | favorite | 87 comments



The fact that Github has been so unstable for so long is absolutely insane to me. I know ops is hard, but this level of consistent outage points to an endemic problem. Is it the legacy rails/mysql stack that is the largest culprit or is there systemic rot in the engineering org?


More likely, it's efforts to migrate away from the previously solid Rails stack to MS's preferred stack.

They've had a long history of this kind of stability issue when migrating or trying to migrate acquisitions from their previous stack to an MS one. This happened with Hotmail (Unix server -> Windows server), LinkedIn (custom cloud -> MS cloud) and others since.


Is Github moving to .Net and/or SQLServer, or is it "just" moving everything to Azure?


Most of GitHub runs in its own data centers. Services like actions runners and Codespaces use Azure.


The latter.


The LinkedIn to Azure migration was indefinitely postponed.


Never heard of linkedin problems before.


No need to speculate, Github posts fairly detailed information on availability and outage causes https://github.blog/tag/github-availability-report/


> so unstable for so long

Has it?

I’ve had hardly any problems. Occasional issues, but rarely have I been impacted to the extent I notice for more than say an hour…. maybe I notice it a couple times a year.

My internet access at home is more likely the issue when I hit GitHub issues.


I've experienced a ton of issues, but it's likely because almost every aspect of our operation depends in someway on GitHub -- the repo itself and basic push/pull, or PRs, or webhooks, or actions, or some aspect of status updates, or random API tasks, etc. We use a lot of GitHub.


Yeah I can see how the more tied you are the more exposure you would have, especially for very frequent / long running tasks and etc.

My exposure is increasing, but it's still intermittent / not stuff ongoing all the time and if it doesn't work ... oh well it will run later when it does. So the impact is lower than some.


Why people think it can be related to Rails when there are tons of companies out there using Rails not affected by this degradation?


>Why people think it can be related to Rails

Probably because there have been more high-profile stories of companies migrating off of Ruby on Rails to something else (e.g. Java, Go, etc) rather than vice-versa of migrating into it.

E.g. the high-profile story of Twitter's previous "whale fail" scaling problems supposedly being partially solved by switching from Ruby to Java/Scala/JVM : https://www.google.com/search?q=twitter+whale+fail+ruby+rail...

Ruby may be unfairly blamed but nevertheless, the narrative is already out there even though other big sites like Shopify, etc still use it.


One difference is that Rails and MySQL on a Github scale is rare, even when taking into account Github scale is rare.


you mean that most of the other popular huge Rails companies (GitLab, Shopify) use PostgreSQL? Basecamp uses MySQL tho


Gitlab is, AFAIK, nowhere near the scale of Github. Shopify IDK, but I'm fairly certain their type of usage is very different. Basecamp is also of another scale, and certainly a very different usage/performance type.


Shopify primarily uses MySQL, unless something major has changed recently. They've done a number of conference talks and engineering blog posts about their usage of MySQL, see e.g. https://www.google.com/search?q=mysql+site%3Ashopify.enginee...


Sometimes it's not Ops. Sometimes its crap code. ;)

Source: <-- OPs


Clients I worked with: Our service crashed, why?

Because you designed and implemented it poorly, that's why. Alternatively: How should I know, you wrote it.

If you're ever bored as a developer, switch to operations, you get to be the person developers turn to when they can't code, debug, do logging or security.


I've never really been interested in straight development, though I do enjoy the occasional coding session. I'd say at this point, I'm ops through and through, and I feel that the value I add is understanding the systems and what they're doing at a fairly low level. As such, I do sometimes have to help developers out with things I consider surprisingly basic (not usually code exactly), but that's the nature of teamwork I suppose; I'm happy to have a place in things and don't shy away from epithets like "yaml wrangler" or "helpdesk for devs".


Sadly devops will be one of the first to go as AI progresses.


While on the surface I'd agree with you, in reality I think operations people are going to be around longer than developers at this rate.

It's fairly "easy" and relatively safe to let an AI loose on your Java code base and use it to add new features or find bugs. Very few people would let a similar AI roam around production servers and databases.

If you collect enough logs, exceptions/crash dumps, network traffic and so on, you could feed that to the AI and have it tell you why a service crashed. The majority of my job as an operations person is to figure out why something crashed with only a subset of that information and being able to read the code and reason about why current circumstances resulted in the crash or data corruption. Sometimes the job is even to implement the stuff the developers didn't, while not actually touching the code and relying on what the operating system, database, web server or network tells you.

If developers where better, or had more time, more resource then yes, an AI could do the job faster and better. In current environment, operations is pretty safe.


I sure hope so. But realistically AI will thin the devops herd much more rapidly than "proper" development teams. I already use an LLM to crank out my configs, shell scripts, analyse my logs etc. LLMs are a lot better at these things than full fledged development. I think I am now able to do in days what it would have taken me weeks. My employer does not need to hire as many devops people as it would have pre LLMs.

As you can see you don't need to give AI write access to your production environment.


AI can't swap a drive. AI cannot clear a printer jam. AI cannot replace a spinning rust drive in a laptop with an SSD.

Trust me, there will always be OPs people.

Source: 30+ years in Ops/SysAdmin


DevOps isn't really ops though, right? As in, not product ops. It's ops but for devs, so it's rare for them to have to handle production servers. At least, I hope so. Those would be SRE or even sysadmins, right? I'm not up to date on the usage of the term though haha.


AI is just more software to manage and operate.


Every time there's a GitHub outage of any severity one of the top comments on HN is inevitably suggesting that it's probably due to Rails. It's getting pretty tiresome.

Calling it a "legacy rails" stack is incredibly disingenuous as well. It's not like they're running a 5 year old unsupported version of Rails/MySQL. GitHub runs from the Rails main branch - the latest stable version they possibly can - and they update several times per month.[^1] They're one of the largest known Rails code bases and contributors to the framework. Outside of maybe 37 Signals and Shopify they employ more experts in the framework and Ruby itself than any other company.

It's far more likely the issue is elsewhere in their stack. Despite running a rails monolith, GitHub is still a complex distributed system with many moving parts.

I feel like it's usually configuration changes and infra/platform issues, not code changes, that cause most outages these days. We're all A/B testing, canary deployments, and using feature flags to test actual code changes...

[^1]: https://github.blog/2023-04-06-building-github-with-ruby-and...


It's easier to blame <piece of technology> than to admit running a service at Github's scale is highly complex and takes deep expertise.


It's also that many Rails shops have performance problems: which isn't the same as saying "Rails is slow"!!. "Getting performance problems at some point" is almost a rite of passage in Rails; I'm certain every rails developer has pored over N+1 queries, caching, async jobs, race-conditions, gems and whatnot to keep the system running.

The only Rails projects that I worked on that never had performance problems are the ones that never reached any scale. All Rails projects that gained traction that I worked on, needed serious refactorings, partial rewrites, tuning and tweaking to keep 'em running. If only to tame the server-bills, but most of the times to just keep the servers up. Good news is that it's very doable to tune, tweak and optimize a Rails stack. But the bad news is that every "premature optimization is the root of all evil" project made a lot of choices back in the days that make this nessecary optimization today hard or impossible even.

What I'm trying to say is: Performance issues with Rails will sound very familiar to anyone who worked seriously with Rails. So it's not so strange that people reach for this conclusion if almost everyone in the community has some first-hand experience with this conclusion.


> The only Rails projects that I worked on that never had performance problems are the ones that never reached any scale. All Rails projects that gained traction that I worked on, needed serious refactorings, partial rewrites, tuning and tweaking to keep 'em running.

You'll be hard pressed to find any stack that doesn't require this.


Obviously.

A big problem with rails, though, is how easy it makes it to "do the bad thing" (and in rare cases, how hard it makes it to do the "good" thing). A has_many/belongs_to that crosses bounded domains (adds tight coupling) is a mere oneliner: only discipline and experience prevents that. A quick call to the database from within a view, something that not even linters catch, it takes vigilance from reviewers to catch that. Reliance on some "external" (i.e. set by a module, concern, lib, hook or other method) instace-var in a controller can be caught by a tighly set linter, but too, is tough.

Code that introduces poor joins, filters or sorts on unindexed columns, N+1 queries and more, are often simple, clean-looking setups.

`Organization.top(10).map(&:spending).sum` looks lean and neat, but hides all sorts of gnarly details in ~three~ four different layers of abstraction: Ruby-language because "spending" might be an attribute or a method, you won't know, Rails, because it overloads stuff like "sort", "sum" and whatnot to sometimes operate on data (and then first actually load ALL that data) and sometimes on the query/in-database. It might even be a database-column, but you won't know without looking at the database-model. And finally the app for how a scope like top(10) is really implemented. For all we know, it might even make 10 HTTP calls.

Rails (and ruby) lack quite some common tools and safety nets that other frameworks do have. And yes, that's a trade-off, because many of these safety nets (like strong and static typing) come at a cost to certain use-cases, people or situations.

Edit: I realize there are four layers of abstraction if the all-important and dictating database is taken into account, which in Rails it always is.


I wasn't really blaming rails per say, if anything their main database mysql1 seems to pop up in their post mortems more then anything else


> Is it the legacy rails/mysql stack that is the largest culprit or is there systemic rot in the engineering org?

The culprit is change. Infra changes, config changes, new features, system state (os updates, building new images, rebooting, etc...), even fixing existing bugs all are larger changes to the system than most think. It's really remarkable at this point that Github is as stable as it is. It is a testament to the Github team they have been as stable as they are. It's not "rot" it's just a huge system.


I don't think you understand ops :), there's no 100% availability anywhere, so issues and degradation will always happen no matter what. https://sre.google/sre-book/service-level-objectives/

It's not rails nor MySQL, both proven good for years.


please permit me to indulge the most extreme example of what you just said.

"What do you mean the database is down after I loaded 500 TiB and indexed all columns? It's MySQL, Facebook uses MySQL a high scale for years without incident!"


Getting intermittent 500s browsing repositories right now.

Hugs to the GitHub ops team.


Also actions won't be able to checkout.


We have a slack channel that monitors GitHub availability. There’s content there nearly 3-4 times a week. It’s amazing how awful this has become.


In the old days we had mirrors for many online repositories.


We've lost the technology for decentralised Internet services in the early 2000s.

We just hope SMTP keeps ticking along somehow or we're fcuked.


Teams, Slack, Discord, whatsapp, imessage, etc are trying hard, to replace that too, though,


Yep. Now I'm glad I vendor-mirror all my dependencies so I can keep doing tests and all that.


I wonder if projects should proactively mirror to another git site (github as main and gitlab as mirror, for example). Collaboration on the project may stop when the main is down, but consumers could proceed using the mirror. I'm not sure how well various tooling supports fallback origins. This would reduce (but not eliminate) the need for users to vendor their dependencies.


Everyone with a git clone has a mirror....


But how do I find someone with a clone of the repo I'm looking for?


How would you have done it previously?


With github, which is now down, which loops us back to the original question. Your point?


And just as we're about to migrate 4 kubernetes clusters with a total of ~4k pods. Terraform in github actions on selfhosted runners and argoCD is failing.


Oh that sucks, there's always going to be those who will say that it's the price you pay for using Github, but locally hosted VCS and CI/CD systems have issues as well.

External dependencies are always problem, but do you have the capacity and resources required to manage those dependencies internally? Most don't and will still get a better product/service by using an external service.


Rate of outages on github last few years has been orders of magnitude higher than anything I've encountered on a locally hosted VCS.

Local also means you can orchestrate maintenance windows to avoid outages at critical phases.


> Most don't

Define "most". There is a surprisingly high number of small/mid-sized companies which have dedicated people for this kind of things.


Perhaps "many" would have been better. If you only count companies that view themselves as IT companies then number of companies with self hosted/managed solution grows, but if you include everything then I'd guess that more than 50% don't run these services internally. If you're every small company with one or two developers that doubles as the IT staff then the numbers add up pretty quickly.


That's where I feel like it's actually pretty nice to not have CI tied to your source code. It's probably more expensive to use Travis/Circle but at least you don't have a single point of failure for deploys.


Doesn't this give you 2xSPF? Or can I use my local copy of the source to kick of Travis/Circle?


Ideally you don't ever rely on CI specific automation tooling to actually accomplish anything and instead just use it as a dumb "not my dev machine" to execute workflows.

You should always engineer things so you can fall back to something akin to:

./scripts/deploy_the_things

Ideally backed by a real build system and task engine ala Bazel, Gradle, whatever else floats your boat.

It also means you are free to move between different runners/CI providers and just have to work out how to re-plumb secrets/whatever.

GH actions/friends really provide minimal value, the fact they have convinced everyone that encoding task graphs into franken-YAML-bash is somehow good is one of the more egregious lies sold to the development community at large.


Well, git ops might not be impacted on Github, usual Github outages tend to be through Actions/the site, not the actual git operations. Doesn't seem like you can use a local copy but you can use Gitbucket/Gitlab.


Happy I'm not in a hurry with any specific work items today. Hope it's not too much of a mess to figure out for the github peeps. Much love to them.


Time to sword fight outside the offi... oh we all work remotely now.


Time to do the laundry and dishes then, I guess... Sometimes WFH has its boring sides.


well, at least that's one good reason to buy into zuck's metaverse


GitHub having issues?

Sounds like all in good order then ...


Wouldn't it be wonderful if the most popular version control system was is decentralized? This is achievable, and is the correct solution.

This way your git repo could be located on: - GitHub - Your Closet (...) - UCLA's supercomputer - JBOD in Max Planck Institute (...) - GitLab

Doing this with a simple file with "[ipfs, github, gitlab]" on it would be revolutionary, especially for data version control, like nn weights or databases that are too large for git and cost too much on other services, as they would be free on ipf/torrent.

Then no one is phased by the inevitable failure of various companies.


Git is already decentralized....


I think what GP was getting at is that the topology is still point-to-point, which tends to lead to a hub-and-spoke system. Hell, it’s even in the name: GitHUB. And fact is, this works out for most people, but it leads to some undesirable failure modes. Maybe we need, say, a “git-discover-remote” command…


So is cryptocurrency, but people still lose their money when exchanges get hacked. :D It's a cry for help. "Make decentralization as convenient as centralization!"


Not sure I understand the 'cry for help's comment. This has absolutely nothing to do with currency here. Associating decentralization with currency is idiotic and completely misses the point.

Well, actually on second thought it does: If ipfs or torrent can be used as a data version control backend, people no longer have to pay $600 to get the very popular and basic dataset of arxiv.org from AWS - under torrent or IPFS it's free. So basically decentralization here means running away from money hungry scams like AWS.


But not GitHub... :-)


Can't tell if this comment is sarcastic, but that's exactly what git is: Every clone of the repo is independent, and acts as a full backup. Likewise, a local repo can be pushed to various remotes, there is no inherent strong server-client coupling (even though it's often used in such a way).


Yes but not having to specify every single user who pulls the git repo from GitHub by IP address as an additional remote is a huge win no?

If "ipfs" can be added as a remote, and it automatically pulls from thousands of different devices without having to specify them, that's a pretty big win for redundancy right?


Not sure what's the big benefit here, backups? Otherwise, the more devs a repo has, the more redundancy exists anyway, if that's your main concern.


It becomes far more useful for making very large files generated from code (e.g. NN weights) easily downloadable without having to pay huge sums to storage providers ($600 to AWS for arxiv.org for example) - with IPFS it's completely FREE, because everyone with the file is using it and hosting it.

Additionally capabilities can be added in a Makefile form to use IPFS as the cache to ensure that a script that takes 3 months on a HPC/supercomputer to make a 3MB file only has to grab the file in a few seconds, even if not yet computed locally (e.g. you just computed it at work and do `git pull && ipfsmake`).


Ah yes, because version control is only about file storage.


Still having regular incidents at GitHub in 2024, even with Microsoft's infrastructure after 5 years since the acquisition with something always going down.

Just expect GitHub to go down at least once every month as it is that unreliable.

This certainly has aged well: [0]

[0] https://news.ycombinator.com/item?id=22868406


The price teams will pay to offload their ops.

People really like avoiding ops


It really feels like the frequency of this happening has increased lately. Are they facing high employee turnover?

_Maybe it’s time for rewriting it in Rust._

Edit: RIIR was said in jest. I forgot HN doesn’t support markdown.


Full stack re-writes are not always the best way. Sometimes you end up with worse. Sometimes you end up with better. If you do go the 'full stack rewrite' you better have a decent plan in place. Because you are about to get to support 2 code bases for awhile.

edit: fair enough


I'm pretty sure (hope?) the comment was said _in jest_ (though I'm not familiar with _this internet standard_)


Yeah. I thought markdown would work here.


Italics work using asterisks.


spongecase works wonderfully for sarcasm: mAYBe IT’S TimE fOr rewrItING it iN rUsT



Whoops guess that was one too many PRs.

Sorry everyone!


Phew, I thought it was because I am in Puerto Rico. Has GitHub or Microsoft done any bigelayoffs?


Somebody is having a bad Tuesday.


copilot is purring like a cat, best wishes to their infra team!


It is back


Run for your liveeeees!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: