Hacker News new | past | comments | ask | show | jobs | submit login

Can you say more about how it's a problem if people can view things without logging in? Naively I would have seen that as a plus.



If you allow new users to create user profiles with links, and those user profiles are visible to Google, spammers will create a bunch of new user accounts and fill them with spam links.

The easiest way to prevent this is to block Google from seeing user profiles by requiring login to see the profiles.


> "and those user profiles are visible to Google"

googlebot adheres to robots.txt, right?

in which case couldn't self-hosted gitlab admins add a robots.txt entry for the profile page url?


That requires the spammers to notice that it's blocked in robots.txt, which seems optimistic


There is also adding the rel="nofollow ugc" to user submitted links that removes the benefits of linking for spammers.

https://support.google.com/webmasters/answer/96569


Most bots aren't going to bother checking if you've done that, so while they'll not get the expected benefit you'll still get the spam.


Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Nobody views snippets and user profiles as a common part of daily development, so it takes time away from development to investigate those things to prune them. And if you don't prune it fast enough, it gets into the search results at which point it's even more of a pain in the ass to remove (even using Google's webmaster tools).


> Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Isn't then the problem not that it is viewable but that it's not excluded from indexing by robots.txt?


Anything visible without login will be visible to people who want to follow a bare URL to your tracker and visible to a search engine crawler, making it visible to people without a URL who just search for the issue. That is, indeed, a plus.

But even if you require login to post stuff to the issue tracker, creating a login and posting a comment has been trivially automated.

You're no longer running a useful issue tracker, you're running a free ad network: you're hosting a dozen useful issues and a thousand advertisements and blackhat SEO comments for spammers.

If it's not visible to search engines, and your repo doesn't get much traffic, it's not nearly as valuable to spammers. It's kind of like cutting off your nose to spite your face, but those basic economics of cost to spammers, cost to users, value to spammers, and value to users are the only rules that you can really apply when hosting content on the Internet.


> creating a login and posting a comment has been trivially automated

Isn't this what various CAPTCHA tools handle?

You can also require email address validation.


1. Create user account

2. Create spam content (snippets, profile, etc.)

3. Get spam content search indexed

4. Profit


a free user sign up is not going to prevent scraping...


I think the idea is that if you can't view issues without logging in, then google won't index your issues (because it can't view them), so you won't get people spamming in order to get into google


Over the years I've frequently seen Google search results showing things that require login to the indexed site. Has that changed?


I believe that requires the site giving google’s bot a logged in view rather than something google does themselves.


It also means 95% of people will stop casually viewing issues.


The problem isn't scraping, it's spam.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: