IMO there was something of a de facto contract, pre-LLMs, that the set of things...

wredcoll · 2025-11-16T06:52:09 1763275929

For all its sins, google had a vested interest in the sites it was linking to stay alive. Llms don't.

eric-burel · 2025-11-16T08:22:00 1763281320

That's a shortcut, llm providers are very short sighted but not to that extreme, alive websites are needed to produce new data for future trainings. Edit: damn I've seen this movie before

stephenitis · 2025-11-16T05:42:13 1763271733

Text, images, video, all of it I can’t think of any form of data they don’t want to scoop up, other than noise and poisoned data

cwbriscoe · 2025-11-16T06:40:20 1763275220

I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?

Yoric · 2025-11-16T08:20:15 1763281215

Not the exact same problem, but a few months ago, I tried to block youtube traffic from my home (I was writing a parental app for my child) by IP. After a few hours of trying to collect IPs, I gave up, realizing that YouTube was dynamically load-balanced across millions of IPs, some of which also served traffic from other Google services I didn't want to block.

I wouldn't be surprised if it was the same with LLMs. Millions of workers allocated dynamically on AWS, with varying IPs.

In my specific case, as I was dealing with browser-initiated traffic, I wrote a Firefox add-on instead. No such shortcut for web servers, though.

m3047 · 2025-11-16T18:31:07 1763317867

Yoric, dropping some knowledge vis a vis the downstream regarding DNS:

* https://www.dnsrpz.info/

* https://github.com/m3047/rear_view_rpz

Yoric · 2025-11-17T16:41:40 1763397700

Thanks!

bonsai_spool · 2025-11-16T08:28:39 1763281719

Why not have local DNS at your router and do a block there? It can even be per-client with adguardhome

Yoric · 2025-11-16T08:49:59 1763282999

I did that, but my router doesn't offer a documented API (or even a ssh access) that I can use to reprogram DNS blocks dynamically. I wanted to stop YouTube only during homework hours, so enabling/disabling it a few times per day quickly became tiresome.

extra88 · 2025-11-16T13:02:32 1763298152

Your router almost certainly lets you assign a DNS instead of using whatever your ISP sends down so you set it to an internal device running your DNS.

Your DNS mostly passes lookup requests but during homework time, when there's a request for the ip for "www.youtube.com" it returns the ip of your choice instead of the actual one. The domain's TTL is 5 minutes.

Or don't, technical solutions to social problems are of limited value.

Yoric · 2025-11-16T14:22:07 1763302927

Any solution based on this sounds monstruously more complicated than my browser addon.

And technical bandaids to hyperactivity, however imperfect, are damn useful.

extra88 · 2025-11-16T14:41:01 1763304061

A browser add-on wouldn't do the job. The use case was a parent controlling a child's behavior, not someone controlling their own.

Yoric · 2025-11-16T14:48:55 1763304535

Yes, my kid has ADHD. The browser add-on does the job at slowing down the impulse of going to YouTube (and a few online gaming sites) during homework hours.

I've deployed the same one for me, but setup for Reddit during work hours.

Both of us know how to get around the add-on. It's not particularly hard. But since Firefox is the primary browser for both of us, it does the trick.

FrinkleFrankle · 2025-11-16T17:20:29 1763313629

For those that don't want to build their own addon, Cold turkey Blocker works quite well. It supports multiple browsers and can block apps too.

I'm not affiliated with them, but it has helped me when I really need to focus.

https://getcoldturkey.com/

renewiltord · 2025-11-16T17:13:56 1763313236

I think dnsmasq plus a cron on a server of your choice will do this pretty easily. With an LLM you could set this up in less than 15 minutes if you already have a server somewhere (even one in the home).

Yoric · 2025-11-16T21:42:12 1763329332

Thanks for the tip.

In this case, I don't have a server I can conveniently use as DNS. Plus I wanted to also control the launching of some binaries, so that would considerably complicate the architecture.

Maybe next time :)

renewiltord · 2025-11-17T01:37:53 1763343473

Makes sense! Keeping your home tech simple definitely a recipe for a happier life when you have kids haha

adobrawy · 2025-11-16T08:29:52 1763281792

They rely on residential proxies powered by botnets — often built by compromising IoT devices (see: https://krebsonsecurity.com/2025/10/aisuru-botnet-shifts-fro... ). In other words, many AI startups — along with the corporations and VC funds backing them — are indirectly financing criminal botnets.

strogonoff · 2025-11-16T07:09:56 1763276996

You cannot block LLM crawlers by IP address, because some of them use residential proxies. Source: 1) a friend admins a slightly popular site and has decent bot detection heuristics, 2) just Google “residential proxy LLM”, they are not exactly hiding. Strip-mining original intellectual property for commercial usage is big business.

skrebbel · 2025-11-16T07:31:27 1763278287

How does this work? Why would people let randos use their home internet connections? I googled it but the companies selling these services are not exactly forthcoming on how they obtained their "millions of residential IP addresses".

Are these botnets? Are AI companies mass-funding criminal malware companies?

fakwandi_priv · 2025-11-16T08:16:13 1763280973

It used to be Hola VPN which would let you use someone else’s connection and in the same way someone could use yours which was communicated transparently, that same hola client would also route business users. Im sure many other free VPN clients do the same thing nowadays.

joha4270 · 2025-11-16T07:52:58 1763279578

I have seen it claimed that's a way of monetizing free phone apps. Just bundle a proxy and get paid for that.

cuu508 · 2025-11-16T08:11:38 1763280698

A recent HN thread about this: https://news.ycombinator.com/item?id=45746156

stackghost · 2025-11-16T07:43:37 1763279017

>Are these botnets? Are AI companies mass-funding criminal malware companies?

Without a doubt some of them are botnets. AI companies got their initial foothold by violating copyright en masse with pirated textbook dumps for training data, and whatnot. Why should they suddenly develop scruples now?

globalnode · 2025-11-16T07:50:37 1763279437

so user either has a malware proxy running requests without being noticed or voluntarily signed up as a proxy to make extra $ off their home connection. Either way I dont care if their IP is blocked. Only problem is if users behind CGNAT get their IP blocked then legitimate users may later be blocked.

edit: ah yes another person above mentioned VPN's thats a good possibility, also another vector is users on mobile can sell their extra data that they dont use to 3rd parties. probably many more ways to acquire endpoints.

strogonoff · 2025-11-17T07:09:16 1763363356

“Known IP addresses” to me implies an infrequently changing list of large datacenter ranges. Maintaining a dynamic list (along with any metadata required for throttling purposes) of individual IPs is a different undertaking with higher level of effort.

Of course, if you don’t care about affecting genuine users then it is much simpler. One could say it’s collateral damage and show a message suggesting to boycott companies and/or business practices that prompted these measures.

ninja3925 · 2025-11-16T07:00:53 1763276453

Large cloud providers could offer that solution but then, crawlers can also change cycle IPs