I don’t think the publication date (May 8, as I type this) on the GitHub blog article is the same date this change became effective.
From a long-term, clean network I have been consistently seeing these “whoa there!” secondary rate limit errors for over a month when browsing more than 2-3 files in a repo.
My experience has been that once they’ve throttled your IP under this policy, you cannot even reach a login page to authenticate. The docs direct you to file a ticket (if you’re a paying customer, which I am) if you consistently get that error.
I was never able to file a ticket when this happened because their rate limiter also applies to one of the required backend services that the ticketing system calls from the browser. Clearly they don’t test that experience end to end.
Why would the changelog update not include this? it's the most salient piece of information.
I thought I was just misreading it and failing to see where they stated what the new rate limits were, since that's what anyone would care about when reading it.
I've hit this over the past week browsing the web UI. For some reason, github sessions are set really short and you don't realise you're not logged in until you get the error message.
GH now uses the publisher business model, and as such, they lose money when you're logged out. same reason why google, fb, etc will not ask you for a password for decades.
60/hr is not the same as 1/min, unless you're trying to continually make as many requests as possible, like a crawler. and if that is for your use case, then your traffic is probably exactly what they're trying to block.
Use case: crawling possibly related files based on string search hints in a repo you know nothing about...
Something on the order of 6 seconds a page doesn't sound TOO out of human viewing range depending on how quickly things load and how fast rejects are identified.
I could see ~10 pages / min which is 600 pages / hour. I could also see the argument that a human would get tired at that rate and something closer to 200-300 / hr is reasonable.
Does it seem to anyone like eventually the entire internet will be login only?
At this point knowledge seems to be gathered and replicated to great effect and sites that either want to monetize their content OR prevent bot traffic wasting resources seem to have one easy option.
Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
Those of us who self-host git repos know that this is not true. Over at ardour.org, we've passed the 1M-unique-IP's banned due to AI trawlers sucking our repository 1 commit at a time. It was killing our server before we put fail2ban to work.
I'm not arguing that the specific steps Github have taken are the right ones. They might be, they might not, but they do help to address the problem. Our choice for now has been based on noticing that the trawlers are always fetching commits, so we tweaked things such that the overall http-facing git repo works, but you cannot access commit-based URLs. If you want that, you need to use our github mirror :)
Only they haven't started doing this right now. For many years, GitHub has been crippling unauthenticated browsing, doing it gradually to gauge the response. When unauthenticated, code search doesn't work at all and issue search stops working after like, 5 clicks at best.
This is egregious behavior because Microsoft hasn't been upfront about this while they were doing this. Many open source projects are probably unaware that their issue tracker has been walled off, creating headaches unbeknownst to them.
Have you noticed significant slowdown and CPU usage from failban with that many banned IPs? I saw it becoming a huge resource hog with far less IPs than that.
Apparently, the vibe coding session didn't account for it. /s
I would more readily assume a large social networking company filled with bright minds would have worked out some kind of agreement on, say, a large corpus of copyrighted training data before using it.
It's the wild wild west right now. Data is king for AI training.
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
> These changes will apply to operations like cloning repositories over HTTPS, anonymously interacting with our REST APIs, and downloading files from raw.githubusercontent.com.
Or randomly when clicking through a repository file tree. The first time I hit a rate limit was when I was skimming through a repository on my phone, and about the 5th file I clicked I was denied and locked out. Not for a few seconds either, it lasted long enough that I gave up on waiting then refreshing every ~10 seconds.
The truth is this won't actually stop AI crawlers and they'll just move to a large residential proxy pool to work around it. Not sure what the solution is honestly.
I don't know if I ever recall seeing a CEO go to jail for practically anything, ever. I'm sure there are lots of examples, but at this point in my life I have kind of derived a rule of thumb of "if you want to commit a crime, just disguise it as a legitimate business" based off seeing so many times where CEOs get off scott free .
What's being strip mined is the openness of the Internet, and AI isn't the one closing up shop. Github was created to collaborate on and share source code. The company in the best position to maximize access to free and open software is now just a dragon guarding other people's coins.
The future is a .txt file of John Carmack pointing out how efficient software used to be, locked behind a repeating WAF captcha, forever.
AI isn't the one closing up shop, it’s the one looting all the stores and taking everything that isn’t bolted down. The AI companies are bad actors that are exploiting the openness of the internet in a fashion that was obviously going to lead to this result - the purpose of these scrapers is to grab everything they can and repackage it into a commercial product which doesn’t return anything to the original source. Of course this was going to break the internet, and people have been warning about that from the first moment these jackasses started - what the hell else was the outcome of all this going to be?
I'm using Firefox and Brave on Linux from a residential internet provider in Europe and the 429 error triggers consistantly on both browsers. Not sure I would consider my setup questionable considering their target audience.
Same way Reddit sells all its content to Google, then stops everyone else from getting it. Same way Stack Overflow sells all its content to Google, then stops everyone else from getting it.
(Joke's on Reddit, though, because Reddit content became pretty worthless since they did this, and everything before they did this was already publicly archived)
Wow, I'm realizing this applies to even browsing files in the web UI without being logged in, and the limits are quite low?
This rather significantly changes the place of github hosted code in the ecosystem.
I understand it is probably a response to the ill-behaved decentralized bot-nets doing mass scraping with cloaked user-agents (that everyone assumes is AI-related, but I think it's all just speculation and it's quite mysterious) -- which is affecting most of us.
The mystery bot net(s) are kind of destroying the open web, by the counter-measures being chosen.
It's even more hilarious because this time it's Microsoft/Github getting hit by it. (It's funny because MS themselves are such a bad actor when it comes to AIAIAI).
Also https://github.com/orgs/community/discussions/157887 "Persistent HTTP 429 Rate Limiting on *.githubusercontent.com Triggered by Accept-Language: zh-CN Header" but the comments show examples with no language headers.
I encountered this too once, but thought it was a glitch. Worrying if they can't sort it.
Even with authenticated requests, viewing a pull request and adding `.diff` to the end of the URL is currently ratelimited at 1 request per minute. Incredibly low, IMO.
It sucks that we've collectively surrendered the urls to our content to centralized services that can change their terms at any time without any control. Content can always be moved, but moving the entire audience associated with a url is much harder.
Gitea [1] is honestly awesome and lightweight. I've been running my own for years, and since they've put Actions in a while ago (with GitHub compatibility) it does everything I need it to. It doesn't have all the AI stuff in it (but for some that's a positive :P)
I'm stuck on the latest gitea (1.22) that still supports migration to forgejo and unsure where to go next. So I've been following both projects (somewhat lazily), and it seems to me that gitea has the edge on feature development.
Forgejo promised — but is yet to deliver any — interesting features like federation; meanwhile the real features they've been shipping are cosmetic changes like being able to set pronouns in your profile (and then another 10 commits to improve that...)
If you judge by very superficial metrics like commit counts, forgejo's count is heavily inflated by merges (which gitea development process doesn't use, preferring rebase), and frequent dependency upgragdes. When you remove that, the remaining commits represent maybe half of gitea's development activity.
So I expect to observe both for another year before deciding on where to upgragde. They're too similar at the moment.
FWIW, one of gitea larger users — Blender — continues to use and sponsor gitea and has no plans to switch AFAIK.
That's an interesting perspective, and I can't strongly disprove it, but that doesn't match my impression. I cloned both repos (Gitea's from GitHub; Forgejo's from Codeberg, which runs on Forgejo) and ran this command:
to get an overview of things. That showed 153 people (including a small handful of bots) contributing to Gitea, and 232 people (and a couple bots) contributing to Forgejo. There are some dupes in each list, showing separate accounts for "John Doe" and "johndoe", that kind of thing, but the numbers look small and similar to me so I think they can be safely ignored.
And it looks to me like Forgejo is using a similar process of combining lots of smaller PR commits into a single merge commit. The wide majority of its commits since last June or so seem to be 1-commit-per-PR. Changing the above command to `--since="2024-07-1"` reduces the number of unique contributors to 136 for Gitea, 217 for Forgejo. It also shows 1228 commits for Gitea and 3039 for Forgejo, and I do think that's a legitimately apples-to-apples comparison.
to match lines that mention a PR (like "Simplify review UI (#31062)" or "Remove `title` from email heads (#3810)"), then I'm seeing 1256 PR-like Gitea commits and 2181 Forgejo commits.
I'm not an expert in methodology here, but from my initial poking around, it would seem to me that Forgejo has a lot more activity and variety of contributors than Gitea does.
The development energy has not really moved, Gitea is moving much faster. Forgejo is stuck two versions behind and with their license change they're struggling to keep up.
I’ve almost completed the move of my business from GitHub’s corporate offering to self-hosted Forgejo.
Almost went with Gitea, but the ownership structure is murky, feature development seems to have plateaued, and they haven’t even figured out how to host their own code. It’s still all on GitHub.
I’ve been impressed by Forgejo. It’s so much faster than Github to perform operations, I can actually backup my entire corpus of data in a format that’s restorable/usable, and there aren’t useless (AI) upsells cluttering my UX.
For listeners at home wondering why you'd want that at all:
I want a centralized Git repo where I can sync config files from my various machines. I have a VPS so I just create a .git directory and start using SSH to push/pull against it. Everything works!
But then, my buddy wants to see some of my config files. Hmm. I can create an SSH user for him and then set the permissions on that .git to give him read-only access. Fine. That works.
Until he improves some of them. Hey, can I give him a read-write repo he can push a branch to? Um, sure, give me a bit to think this through...
And one of his coworkers thinks this is fascinating and wants to look, too. Do I create an SSH account for this person I don't know well at all?
At this point, I've done more work than just installing something like Forgejo and letting my friend and his FOAF create accounts on it. There's a nice UI for configuring their permissions. They don't have SSH access directly into my server. It's all the convenience of something like GitHub, except entirely under my control and I don't have to pay for private repos.
Just tried it on chrome incognito on iOS and do hit this 429 rate limit :S
That sucks, it’s already bad enough when GitHub started enforcing login to even do a simple search.
reply