Can you say more about how it's a problem if people can view things without logg...

dfabulich · on Oct 28, 2020

If you allow new users to create user profiles with links, and those user profiles are visible to Google, spammers will create a bunch of new user accounts and fill them with spam links.

The easiest way to prevent this is to block Google from seeing user profiles by requiring login to see the profiles.

chrisweekly · on Oct 28, 2020

> "and those user profiles are visible to Google"

googlebot adheres to robots.txt, right?

in which case couldn't self-hosted gitlab admins add a robots.txt entry for the profile page url?

jefftk · on Oct 28, 2020

That requires the spammers to notice that it's blocked in robots.txt, which seems optimistic

pitay · on Oct 28, 2020

There is also adding the rel="nofollow ugc" to user submitted links that removes the benefits of linking for spammers.

https://support.google.com/webmasters/answer/96569

dspillett · on Oct 28, 2020

Most bots aren't going to bother checking if you've done that, so while they'll not get the expected benefit you'll still get the spam.

jancsika · on Oct 28, 2020

Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Nobody views snippets and user profiles as a common part of daily development, so it takes time away from development to investigate those things to prune them. And if you don't prune it fast enough, it gets into the search results at which point it's even more of a pain in the ass to remove (even using Google's webmaster tools).

dragonwriter · on Oct 28, 2020

> Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Isn't then the problem not that it is viewable but that it's not excluded from indexing by robots.txt?

LeifCarrotson · on Oct 28, 2020

Anything visible without login will be visible to people who want to follow a bare URL to your tracker and visible to a search engine crawler, making it visible to people without a URL who just search for the issue. That is, indeed, a plus.

But even if you require login to post stuff to the issue tracker, creating a login and posting a comment has been trivially automated.

You're no longer running a useful issue tracker, you're running a free ad network: you're hosting a dozen useful issues and a thousand advertisements and blackhat SEO comments for spammers.

If it's not visible to search engines, and your repo doesn't get much traffic, it's not nearly as valuable to spammers. It's kind of like cutting off your nose to spite your face, but those basic economics of cost to spammers, cost to users, value to spammers, and value to users are the only rules that you can really apply when hosting content on the Internet.

jefftk · on Oct 28, 2020

> creating a login and posting a comment has been trivially automated

Isn't this what various CAPTCHA tools handle?

You can also require email address validation.

marcinzm · on Oct 28, 2020

1. Create user account

2. Create spam content (snippets, profile, etc.)

3. Get spam content search indexed

4. Profit

dtech · on Oct 28, 2020

a free user sign up is not going to prevent scraping...

cortesoft · on Oct 28, 2020

I think the idea is that if you can't view issues without logging in, then google won't index your issues (because it can't view them), so you won't get people spamming in order to get into google

noizejoy · on Oct 28, 2020

Over the years I've frequently seen Google search results showing things that require login to the indexed site. Has that changed?

marcinzm · on Oct 28, 2020

I believe that requires the site giving google’s bot a logged in view rather than something google does themselves.

wolco2 · on Oct 28, 2020

It also means 95% of people will stop casually viewing issues.

wyldfire · on Oct 28, 2020

The problem isn't scraping, it's spam.