Wikimedia is moving to Gitlab

zzzeek · on Oct 28, 2020

I think I have some relevant experience here.

We host all of our projects on github:

yet we also use gerrit!

https://gerrit.sqlalchemy.org/

users send us pull requests, and they never have to deal with Gerrit ever. We use a custom integration, the source code to which is here: https://github.com/sqlalchemyorg/publishthing/tree/master/pu... and then we mostly bidirectional synchronization between Gerrit and Github pull requests (code changes can move freely from Github PR -> gerrit, comments and code review comments are posted bidirectionally, Gerrit status changes are synchronized into the PR - example: https://github.com/sqlalchemy/sqlalchemy/pull/5662

I continue to find Gerrit's code review to be vastly better than Githubs. Gitlab would take tremendous server resources to run internally and I like Github much better for the front-facing experience.

I wrote about an earlier form of our integration here: https://techspot.zzzeek.org/2016/04/21/gerrit-is-awesome/

to sum up:

1. your project benefits massively by being on Github

2. Gerrit is awesome (for me)

3. gitlab is not very appealing to me UX-wise and self-hosting wise

4. you can still use pull requests from outside users and use gerrit for code reviews.

augustohp · on Oct 28, 2020

I am yet to know people who Review code and use Gerrit to name a better solution.

I belonged to a team that used Gerrit for Review and Hosting, we changed to hosted Gitlab because people missed a "GitHub-like UI" they were used to. It was unanimous that Code Review on Gerrit was way better:

1. You start reviewing the commit message, that is the first touch point with a change everyone has 2. Navigation is done from file to file 3. On Gerrit there isn't two people commenting the same thing, because: 3.a. Messages from different people on the same line change are displayed together, not as different threads. 3.b. The review of a previous version is displayed with the next version, so you can continue the same discussion

I understand that GitHub/GitLab interface is more friendly, but their code-review really stands in the way of producing good software by not favoring good commit messages and long discussions.

u801e · on Oct 28, 2020

> I am yet to know people who Review code and use Gerrit to name a better solution.

What about reviews via patches sent to a mailing list?

I haven't looked into Gerrit for a while, so one question I have is how it handles related commits? The mailing list approach can group them in a single thread tied by a cover letter message where each commit along with the associated diff from the parent working tree is a message which is a reply to the cover letter message.

AaronFriel · on Oct 28, 2020

To be polite, I think the target audience of these tools might not include you. While that workflow works for you and, apparently, scales for some very large projects like the Linux kernel, it isn't a good solution for an enormous number of people which is why tools like Github, Gerrit, GitLab, and others exist.

u801e · on Oct 28, 2020

> To be polite, I think the target audience of tools might not include you.

Insinuating that I'm a special case doesn't add to the discussion.

AaronFriel · on Oct 28, 2020

You aren't a special case, but if you don't see any flaws with your method of code review then I don't think you're the target market for code review tools.

h0l0cube · on Oct 28, 2020

I have no dog in this fight, but you stated:

> it isn't a good solution for an enormous number of people

.. without providing any insight or justification for that belief. So an interesting and productive tangent might be to elucidate what you believe those flaws are?

Unpacking this question might even lead to some good UX ideas that can be applied to today's review systems?

fireattack · on Oct 29, 2020

I would provide one from someone who is not a dev or SE.

I honestly never fully understand how to read the maillist patch. The plain text format makes it very hard to understand what's going on. I'm sure I will understand it better or even prefer it if I use it long enough, but then again I can instantly understand code reviews on GitHub/GitLab/Gerrit.

h0l0cube · on Oct 29, 2020

I've never had to use a maillist patch myself, but from the ones that I've glanced at, I have this same problem, that formatting, syntax highlighting is absent. It seems like this would be an easy problem to fix by using a better viewer, so that might be enough to make it comprehensible.

u801e · on Oct 29, 2020

> I honestly never fully understand how to read the maillist patch. The plain text format makes it very hard to understand what's going on.

That may be due to the settings in your mail client. If it's displaying plain text in a variable width font and/or not applying syntax highlighting in terms of showing added and removed lines in different colors, that would make the diff more difficult to read.

But some mail clients can do that and it makes reading the diff much easier.

> I'm sure I will understand it better or even prefer it if I use it long enough, but then again I can instantly understand code reviews on GitHub/GitLab/Gerrit.

Essentially, code reviews in a mailing list are much like a discussion thread in Hacker News or Reddit where the thread structure is very similar. The only difference is that most mail clients only allow you to display one message at a time.

In my mail client, Thunderbird, you can see the overall thread structure of a patchset discussion. This is the root message for the thread which serves as the cover letter (which is the equivalent of the PR description) [2]. The first commit in the patch is displayed here [3] (note that I have a plugin that enables diff syntax highlighting). The email subject is the commit message title (with the [PATCH v2 1/3] tag prepended to it). The commit message itself is the beginning of the email, and the diff follows.

Unlike Github (and maybe Gitlab), the commit message and diffstat is treated at the same level as the diff itself. That means you can comment on it just like you would on the diff.

Here, you can see the Junio C Hamano's comments on the second commit in the patch set [4]. He's commenting on the diffstat line which shows 391 lines lines added to the builtin/submodule--helper.c file. Further down in the same message [5], he's commenting on the code inline, much like someone would quote a message here on HN and reply inline to multiple sections of it. It's not really that different compared to comments on a diff in Github or Gitlab other than the fact that it's a reply to an email message rather than a web page.

[1] https://i.imgur.com/QmqUWR8.png

[2] https://i.imgur.com/mILREtf.png

[3] https://i.imgur.com/gdoy5zs.png

[4] https://i.imgur.com/BcTdRRe.png

[5] https://i.imgur.com/cCpqsOL.png

fireattack · on Oct 29, 2020

I will be honest, I don't even use email client :/

u801e · on Oct 29, 2020

True, I suppose most people use Gmail or one of the other major email providers through a webmail interface. I haven't been able to get Gmail or Hotmail to display threaded messages the way they're displayed in Thunderbird and they tend to display messages using a variable width font.

In that context, reviewing code would be difficult, if not impossible, to do via email.

h0l0cube · on Oct 29, 2020

It is one advantage to mailpatch though. A lack of vendor lock-in means you can view the patch using whatever application you want. And there's room for a better mailpatch viewer, if anyone could be bothered to make one.

I'm guessing one disadvantage is the method of diff is hard-coded into the patch. It would be good to switch to word-diff, or ignore whitespace, but I'd imagine these could be applied as transformations on the generic format.

u801e · on Oct 29, 2020

> I'm guessing one disadvantage is the method of diff is hard-coded into the patch. It would be good to switch to word-diff, or ignore whitespace, but I'd imagine these could be applied as transformations on the generic format.

The plugin I use in Thunderbird can switch between unified, context, and side-by-side diff views based on the same email. Adding the transformations you mentioned could be done.

But one limitation that email has over other review tools is the lack of ability to expand the view of the context within the client. The only way I could think of is to have git format-patch generate the diff with the entire context included and then have the client limit the display of that context. But that not have a reasonable fallback for those using clients that aren't capable or configured to do that.

izietto · on Oct 29, 2020

Which plugin are you using? it seems life-changing to me

u801e · on Oct 29, 2020

This is the one I'm using: https://github.com/Qeole/colorediffs. I installed it several years ago, so I'm not entirely sure whether you can install it on a current version of Thunderbird, but it is still working with my installation.

basicexploit · on Oct 30, 2020

I might try it out as well! Thanks for it.

eru · on Oct 30, 2020

That should be an easy problem to fix:

Just send out email in HTML format with the code portions set to fixed width font and syntax highlighting already applied. Should display just fine in GMail.

u801e · on Oct 31, 2020

> Just send out email in HTML format with the code portions set to fixed width font and syntax highlighting already applied

That won't work because people actually download those email messages and apply them to their local repository with the git am command and they also expect to be able to reply to the email inline when commenting on a patch. Also, for mail clients that don't support HTML rendering, it would be much more difficult to read or respond to an HTML email.

eru · on Nov 1, 2020

I would assume you'd sent both a text message and an HTML message?

u801e · on Oct 28, 2020

Now you're claiming that I'm making an argument that I never made.

To be clear, the original statement was whether there was a better review tool compared to Gerrit for those who review code and use that tool for that purpose. I responded by suggesting the patch review via mailing list method and asked a follow up question about how Gerrit handled related commits and explained how that case was handled by the mailing list method.

Then you, not the person I originally responded to, decided to interject and, while claiming not to be rude, claims I'm saying something entirely different than what I actually stated.

Personally, I found that very off putting and extremely rude on your part.

AaronFriel · on Oct 29, 2020

I'm sorry I was rude to you, I didn't mean that and I apologize.

I didn't mean anything more than what I literally wrote, which is I don't think you're the target market for code review tools and your suggestion may not be a good fit for people looking for review tools.

waheoo · on Oct 29, 2020

Gerrit handles related commits in a similar way I guess.

A Gerrit changeset is like a GitHub pr, it has many commits, and commits are usually rebased against master, this is crucial to track the review properly over time. (Its very easy to switch to earlier revisions of the same changeset and you never see external changes.)

Honestly, I don't think Gerrit is anything special, I think it simply has the right approach to development (rebased/ff-only) that enables easy review.

u801e · on Oct 29, 2020

Based on the documentation I read, it looks like Gerrit handles this by grouping changes by topics [1]. I'm not sure whether that can be done automatically when pushing up changes that span multiple commits in a local branch.

If I create a branch with several commits where one commit adds a new method with associated unit tests, and a subsequent commit adds several calls to that new method in the code base (while updating any affected tests), then how would Gerrit handle the ordering of those commits. Even if they're in the same topic, I don't know if there's a way to ensure that the first commit is reachable from the subsequent commit.

[1] https://gerrit-review.googlesource.com/Documentation/intro-u...

pedroms · on Nov 2, 2020

I'm a Product Designer at GitLab and I appreciate your feedback.

We are well aware of the advantages that Gerrit has over GitLab and how these are emphasized when teams migrate to GitLab. We are working on improvements that will help mitigate this. Specifically, targeting your points:

1. We're discussing ability to comment on commit messages: https://gitlab.com/gitlab-org/gitlab/-/issues/19691 2. From version 13.2 you can opt to show one file at a time, in your user preferences: https://gitlab.com/gitlab-org/gitlab/-/issues/222790. We have also listed a number of improvements to this feature in https://gitlab.com/groups/gitlab-org/-/epics/516.

Could you expand on point 3.b? If the commented line hasn't changed from version A to B, you should be seeing the comment in version B. Or maybe you're referring to something else?

atombender · on Oct 29, 2020

Gerrit is the one where each "pull request" has to be a single commit, right?

I'm not particularly happy about GitHub, but I think it's less about GH and more about workflow and source code evolution, and I'm not sure if Gerrit solves anything here.

First, PRs being heavyweight encourages large branches instead of smaller incremental changes. Secondly, a large change (such as a big new feature) ends up living in a branch until it's ready to be released, not until it's reviewed. This means that any code that cannot be immediately merged to master need to live on that branch for a while, and gets rapidly out of date, requiring constant rebasing to keep it from rotting.

I'd prefer to merge as soon as something is accepted and use master at the main development branch, but that causes challenges. If you've merged something you don't want to release yet, you're faced with having to build release branches through cherry-picking, which can be really difficult or even impossible. You can hide features behind flags, but sometimes a branch is a big risky refactoring or some structural change that isn't providing features that can be isolated. Plus, once you merge something, any following changes, even if unrelated, often end up depending on the things you want to exclude.

I think something like Pijul (with some discipline, like always doing small incremental commits) could make this easier by being able to treat individual commits as moving pieces that can be rearranged for a release, but it wouldn't solve everything.

Any thoughts on this and how Gerrit would fit in?

bawolff · on Oct 29, 2020

> Gerrit is the one where each "pull request" has to be a single commit, right?

Yes, but you can have changesets depend on each other, so its not that big a deal. (But you ca end up in rebase hell if you do that). You also get a version history of all the different versions of your commit.

Ancedotally, during my use of gerrit, i never really wished to have multiple commits on a single changeset.

> I'd prefer to merge as soon as something is accepted and use master at the main development branch

That's what Wikimedia did, mostly. (There were weekly deployment branches but it was unusual to have something in master but reverted out of the deploy branch). It seemed to mostly work fine afaik (of course i wasnt on the team doing deploys, for all i know they might have horror stories)

JoshTriplett · on Oct 29, 2020

> Ancedotally, during my use of gerrit, i never really wished to have multiple commits on a single changeset.

People's tools shape their workflows; projects that use Gerrit tend to do more squashing of commits, because there's much more per-commit overhead. When I encountered Gerrit, I found it really frustrating to work with for this exact reason. Other aspects of it were great, but if you're used to "one logical change per commit" and end up with a dozen commits in a PR, that can be painful with Gerrit.

lima · on Oct 29, 2020

The whole point of Gerrit is to keep doing "one logical change per commit", but reviewing them individually. Anecdotally, this results in much higher-quality reviews.

You can still group them by "feature" and merge them atomically by using topics.

LockAndLol · on Oct 28, 2020

> # Why

> For the past two years, our developer satisfaction survey has shown that there is some level of dissatisfaction with Gerrit, our code review system. This dissatisfaction is particularly evident for our volunteer communities. The evident dissatisfaction with code review, coupled with an internal review of our CI tooling and practice makes this an opportune moment to revisit our code review choices.

and then further down

> # FAQ

> * Why is GitHub not considered?

> - GitHub would be the first tool required to participate in the Wikimedia technical community that would be non Free Software and non self-hosted.

> - GitHub also does not meet all of our needs; for example, GitHub grants little control of metadata, no influence over privacy policy/data retention, sanctions and bans, little control over backups and data integrity checks, and no long-term guaranteed access to underlying repository settings and configuration.

bawolff · on Oct 29, 2020

Wikimedia was already using mirroring to github (however we didn't accept pull requests).

I'm pretty sure most of the anti-gerrit sentiment at wikimedia was about gerrit as a code review tool.

My personal experience with it (as a mediawiki developer) is gerrit has a lot of UI bugs (although it has gotten better). I also suspect it encourages a code review culture that is overly nitpicky and risk averse (but perhaps that is just cultural forces at wikimedia)

zzzeek · on Oct 29, 2020

I'm not familiar with any UI bugs of any kind, but we don't use the "live editing" feature, maybe that's where you had problems. The big issue with Gerrit is on the "getting plugins to work" side of things, as they are kind of ad-hoc and almost totally undocumented, as well as the access model is too complicated but once that's all working, there is no need to deal with it.

nitpicky culture, we maybe have that problem with Openstack where there are thousands of developers, but for our projects in SQLAlchemy we're a team of about five people and I'm more or less a BDFL type of role, to the degree that we are nitpicky about things it only prevents much bigger problems from happening later, if a review has little things that are bugging me I'll just fix them myself and push a new change up rather than bothering them with it, also something you can't usually do with pull requests.

bawolff · on Oct 29, 2020

Wikimedia was using a super old version of gerrit for a long time. They upgraded recently (although the upgrade happened at roughly the same time as i left my job and took a step back from wikimedia so i dont have much experience with the new version).

I think Wikimedia struggles a lot with code review culture in general. Different people have conflicting ideas about what good code looks like. It used to be very nitpicky (i've had code rejected in the past for using (php's) intval() instead of casting to int. I've also had code rejected for casting to int instead intval().) But that's improved quite a bit with better precommit lint tools. The length of the feedback cycle is very long and sometimes feels like its mostly about who you know (e.g. the last patch i submitted i did on sept 9 for a decently serious bug. First actionable feedback (which was relatively minor things of the form use a constant named ONE_MINUTE instead of 60) was on oct 15. Thats kind of a long time to wait for code review imo). Anyways, its just not fun to contribute when code review is so unpredictable and long.

Hmm. Guess i got off on a bit of a tagent there. I do think gerrit has some usability issues, but i think that's hardly the main problem.

zzzeek · on Oct 29, 2020

those sound like managerial / organizational / social issues. technology isn't going to solve those without good guidance and controls for the overall system. building that up for a very large organization is extremely difficult, I'd not want to have to do that :).

lima · on Oct 29, 2020

> My personal experience with it (as a mediawiki developer) is gerrit has a lot of UI bugs (although it has gotten better). I also suspect it encourages a code review culture that is overly nitpicky and risk averse (but perhaps that is just cultural forces at wikimedia)

Can strongly recommend to remove "-1 code review" and require all comments to be resolved instead. Accomplishes the same goal why being more positive.

aprdm · on Oct 28, 2020

I wonder how gerrit compares to reviewboard (https://www.reviewboard.org/)

jancsika · on Oct 28, 2020

Anyone thinking of moving to their own Gitlab instance with Gitlab CE-- either stay on Github or prepare to waste your time dealing with user spam bots that pollute your site's search results.

In other words-- if you want the common use case for a FOSS project:

1. publicly viewable main repository with publicly viewable issue tracker

2. requirement to log in to view all snippets, user profiles, perhaps even other repos as enforced by administrator settings (otherwise SEO bots will leverage these features to eat your search results)

3. anyone with an email can sign up to post issues to the main repo's issue tracker

There is no combination of settings in Gitlab CE to achieve this. Any sane approach has to leave out step #2. That means that your Gitlab instance gets hammered with user spam from bots which then get indexed in Google search results for your site.

Worse, Gitlab has no tools to make it easy to remove the user spam (and obviously no tools to prevent it from happening).

Just run a public-facing Gitlab CE instance for a few days. Search for one of the spam snippets you collect, and you'll find results for all the FOSS projects out there running their own Gitlab instances.

I've never seen any solutions offered by Gitlab for this, nor frankly any interest in the myriad bug reports about them addressing this at all.

Edit: typo

phikai · on Oct 28, 2020

Hi! I'm the PM at GitLab who works on Snippets, so thanks for providing this feedback. We do have Recaptcha support which can be configured - are you seeing these kinds of issues with that enabled/configured?

One item that is on the roadmap that is coming and may be of interest is `Optional Admin Approval for local user sign up` - https://gitlab.com/groups/gitlab-org/-/epics/4491.

I'm not in the group working on that, but it does appear to be coming soon and would limit the ability of newly created accounts from doing anything until they're approved.

protoduction · on Oct 28, 2020

Hi phikai,

I built a privacy friendly alternative to ReCaptcha called FriendlyCaptcha [1], is there a possibility to see this integrated as a more user friendly alternative?

Happy to chat (e-mail in profile)

[1] https://friendlycaptcha.com/

barnabask · on Oct 28, 2020

Man this needs more attention, cool project. I see you tried to submit to HN a couple of times and didn't get traction, that's too bad. Don't give up!

aeyes · on Oct 28, 2020

Is the demo somehow tweaked to be less hard?

On my machine it doesn't take any time to solve it and I see no signs of CPU usage. Even trying a couple of times in incognito mode and watching CPU immediately after loading the page for the first time.

On many sites creating a profile takes a few seconds. Loading one of my CPU cores for another 5 seconds doesn't really bother me if I wanted to create massive amounts of profiles/posts. I'll still do over 100 per minute on a standard desktop PC.

protoduction · on Oct 28, 2020

The default difficulty is set to a difficulty that makes sense on websites that have a varied audience (which includes some ancient browsers on old devices).

The solver runs in WebAssembly and is really really fast (~4M hashes per second) - but not every browser supports WASM yet (around 0.3% empirically). The JS fallback is around 10 times slower (more in 5+ year old browsers) - for those users you want at least a decent solve time too.

For Gitlab's audience the difficulty can probably be increased a lot - it all depends on the website and usecase. I'm sure the JS fallback's performance can be improved (it involves a lot of operations on 64bit ints that need to be represented as two numbers in JS), happy to accept PRs [1] :)

[1]: https://github.com/FriendlyCaptcha/friendly-pow/blob/master/...

thinkloop · on Oct 28, 2020

What are your thoughts on performing a quick intial test on each client to measure their performance then tailoring the puzzle to be difficult enough for each?

unilynx · on Oct 28, 2020

Once the spammer figures out what you're doing, he'll just throttle the CPU for the duration of the quick test.

Depending on how smart the test is, just having Date.now() return values with a -12000, -11000, -10000 offsets the first few calls might even do it

sytse · on Oct 28, 2020

That looks cool! Can someone create an issue to add support for this to GitLab? And maybe we can consider switching GitLab.com to this as well.

robotmay · on Oct 28, 2020

I'm personally interested in this too so I've created one :D https://gitlab.com/gitlab-org/gitlab/-/issues/273480

sytse · on Oct 28, 2020

Thanks for creating this! I think adding support for this in GitLab is a no-brainer. After that we can consider enabling it for GitLab.com

birdsbirdsbirds · on Oct 28, 2020

Hopefully you are successful, but how can you scale? If it takes 5 seconds on a desktop, then a server can solve 500.000 captchas per month. At $5 per month, a spammer can still send 1.000 messages for a cent.

protoduction · on Oct 28, 2020

It's not enabled yet in production - but the main mechanism is by increasing the difficulty as more requests are made from an IP in a certain timeframe (it's basically rate limiting at that point). Think: every 3rd request in a minute doubles the difficulty with some cooldown period.

With that the cost (and complexity) of an attack can hopefully be in the same ballpark (or higher) than ReCaptcha - without your end user having to label cars or send data to Google.

But in the end a determined spammer will get through any captcha cheaply (for reference: ReCaptcha solves are sold by the thousands for $1) - we just hope we can do better than ReCAPTCHA, especially UX-wise.

leonidasv · on Oct 29, 2020

I love this concept of proof-of-work captchas, but there's a growing number of tools and ways to bypass IP blocks via IP rotation[1], specially after the explosion of IaaS providers. How do you intend to tackle this?

[1] Some examples: https://rhinosecuritylabs.com/aws/bypassing-ip-based-blockin... https://oxylabs.io/products/real-time-crawler https://github.com/alex-miller-0/Tor_Crawler https://www.scrapinghub.com/crawlera/

njitram · on Oct 29, 2020

There are free and paid list of all ip addresses from datacenters like https://udger.com/resources/datacenter-list, they probably existing for specifically preventing this, so maybe thats an option here.

coder543 · on Oct 28, 2020

The obvious follow-up question is how IPv6 impacts this, because I think it's supposed to be easy for someone to get their hands on a decent chunk of IPv6 addresses.

Maybe the difficulty could scale as a property of how similar the IP address is to previously seen addresses... so the addresses in the same /64 block would be very closely related, for example. (I think that's how IPv6 works... but definitely something I haven't researched lately, so I could just sound very confused)

protoduction · on Oct 28, 2020

I don't have all the answers yet, but indeed rate limiting a larger block (at least /64), or even at multiple prefix sizes with different weighting makes sense.

zahllos · on Oct 29, 2020

So the way this is supposed to work is that providers hand out /48s and each site should be allocated a /64. In practice if you for example rent a VPS, you'll be handed a /64 for it by your service provider from their /48.

I would personally treat any /64 as the same. Depending on your local network setup the second half of the address could be anything and could change frequently. You might also get multiple addresses. Whereas getting a new /64, or /48, requires slightly more effort.

Of course there's a risk you'll block a /64 and that takes out some whole company or whatever, but I've seen that happen to corporate proxies that got flagged as a source of spam as well so this is not an easy problem even without the 2^128 address space.

ognarb · on Oct 29, 2020

Your website mention that friendlycaptcha is open source but looking at the license in the repository, it is a custom license that can't be defined as open source. Can you change it to source available?

_qcti · on Oct 28, 2020

Love to see this. ReCaptcha is nothing short of a menace. I'll take a shot at this for my next project

laughinghan · on Oct 28, 2020

There doesn't appear to be any discussion on your website or on GitHub about why, to be blunt, this is even a good idea in the first place.

A classic 2004 paper, "Proof-of-Work" Proves Not to Work [0], explained that the fundamental problem with proof-of-work bot filters is that attackers will always be able to solve the cryptographic puzzle faster than legitimate users. A touch of security-through-obscurity can help at the margins, but you chose Blake2b, which is used by cryptocurrencies like Zcash, Siacoin, and Nano [1], and as a result there are optimized GPU algorithms (first Google result [2]) and FPGA designs (one of the top Google results [3]). Have you run the numbers on any of those?

The closest to any discussion of these numbers that I saw was a mention on your website that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers [4].

In another comment you bring up the idea of starting with a lower difficulty, and increasing it with repeated requests from the same IP address (IPv4, I assume). Unfortunately, access to unique IPv4 addresses is highly correlated with access to more compute power: laptops and desktops in developed countries are most likely to be in a household with a unique IPv4 address, whereas mobile devices on 4G internet and households in developing countries are more likely to be behind Carrier-Grade NAT [5], where thousands or millions [6] of hosts share a pool of a handful or dozens of IPv4 addresses. (The exact same concern applies to IPv6 /64 prefixes.)

This means that mobile devices will face a "double-jeopardy": your service will present them with higher proof-of-work difficulties because the same IPv4 address is shared by more people, and at the same time, the mobile device solves the proof-of-work slower for the same difficulty than a desktop.

Do you have documented anywhere on your website or GitHub how you address these concerns?

[0]: https://www.cl.cam.ac.uk/~rnc1/proofwork.pdf

[1]: https://en.bitcoinwiki.org/wiki/Blake2b

[2]: https://github.com/zhq1/sgminer-blake2b

[3]: https://xilinx.github.io/Vitis_Libraries/security/2020.1/gui...

[4]: http://theory.stanford.edu/people/jcm/papers/captcha-study-o...

[5]: https://en.wikipedia.org/wiki/Carrier-grade_NAT

[6]: Yes, millions. RFC 6598 reserved a /10 for them, which is 4 million unique IPv4 addresses: https://tools.ietf.org/html/rfc6598

coder543 · on Oct 28, 2020

I'm not associated with the project in any way, but your well researched comment did miss at least one important factoid.

This comment:

> The closest to any discussion of these numbers that I saw was a mention that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers.

Missed this quote from the website:

> As soon as the user starts filling the form it starts getting solved

> By the time the user is ready to submit, the puzzle is probably already solved.

The time spent solving reCAPTCHA is active user involvement. The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.

"up to 20 seconds" was also seemingly presented as a worst-case scenario. Most users' devices would presumably be faster than that, but I don't know how the author researched that conclusion on how performance scales. Friendly Captcha does report back some information on how long it is taking users to solve the captcha, and it looks like website owners could use that to adjust the difficulty based on the needs of their specific audience and how tolerant they are of untargeted spam.

The stuff you point out about Blake2b seems entirely legitimate, and I wonder if an Argon variant would be more appropriate to avoid specialized hardware being quite so problematic.

Personally, I really like the idea of Friendly Captcha. Certainly, there are problems with any captcha implementation. People can rant for many, many paragraphs about websites that use reCAPTCHA... I'm not surprised to see someone ripping apart a different captcha system. The ideal solution would be for spammers to just stop being so obnoxious... but good luck with that plan.

laughinghan · on Oct 28, 2020

The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.

Great point!

I wonder if an Argon variant would be more appropriate

The creators of Argon2 actually also created a memory-hard proof-of-work function they call MTP (for "Merkle Tree Proof", which is a terrible name, totally un-Googleable; I always have to search for the title of their paper, "Egalitarian Computing"): https://arxiv.org/pdf/1606.03588.pdf

A bug bounty for it was sponsored by Zcoin, which is nice. Zcoin is actually considering moving away from it, but mainly because the proof size of 200kb is prohibitive, which is less of a concern for a captcha system: https://forum.zcoin.io/t/should-we-change-pow-algorithm/477

I'm not surprised to see someone ripping apart a different captcha system

I really don't mean to rip it apart. I just wanted to see some discussion, any discussion, of the well-known flaws with the idea and what ideas OP has to address them.

protoduction · on Oct 28, 2020

It is also important to note that the 6-12 seconds and 7-14 seconds reported in the paper is for the garbled text CAPTCHAs, not for image labeling tasks (fire hydrants, cars, etc).

protoduction · on Oct 28, 2020

I'll try to provide my thoughts on each of the issues you've mentioned, let me know if there's something I missed.

On using blake2b: I chose blake2b as I was looking to use a hash function that is small in implementation, readily available and already optimized. With WebAssembly the solver can achieve (close to native) speeds and be least be an order of magnitude or two closer to optimized GPU algorithms.

Using specialized hardware, image tasks (and even more so audio tasks which must be present for accessibility reasons) have the same issue that they can be solved by GPU algorithms (i.e. machine learning, in which even a low percentage success rate would already be enough). If you search on GitHub you will find there are more ML captcha cracking repos than captcha implementations - they are probably even easier to get started with than adapting GPU miner code.

Image/Audio Captcha vs ML is an arms race that can be beat for split seconds of compute (even on CPU) or cheap human labeling: it's just as broken. FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility) by not engaging in the arms race - I think it makes a better trade-off. Like the sibling comment pointed out the captcha solving can happen entirely in the background so that hopefully it doesn't even make the user wait.

As for rate limiting/difficulty adjustment: it's not perfect and it could lead to problems if you share the IP with a spammer (and let's be realistic: even with a million users on one IP there won't be tens of users signing up to some forum per minute). Also normal captchas have problems here though: users from these locales already get presented with much more difficult+frequent recaptcha tasks (I also doubt they are localized: American sidewalks are harder to label if you've never seen one in real life). Setting a reasonable upper limit to difficulty may be good enough here.

On not using blake2b: I have considered mutating the hashing algorithm every day randomly to make writing an optimized solver for it all that more difficult - but that would mean one could no longer self-serve the JS+WASM and be done with it. I won't rule it out for FriendlyCaptcha v2 if this does ever become a real problem.

Swapping out the hash function should be easy (the puzzles are versioned to allow for this). If you have a different function in mind and someone implements it in Assemblyscript (so we also have a JS fallback) then we can definitely consider it.

laughinghan · on Oct 31, 2020

Thanks for your detailed response.

I've seen all the projects claiming to have broken ReCAPTCHA—often using Google's own ML services, hilariously—but it's unclear to me how broken image/audio CAPTCHAs are in practice (and the number of GitHub repos doesn't seem like a good measure to me). If they really are completely broken, then why are they still so widely used? If they really are completely broken by ML, how do human CAPTCHA-solving services stay in business?

FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility)

Good point. I am concerned though that burning CPU cycles on the proof-of-work uses battery life if the end-user is on mobile, without getting their getting any choice in the matter. What if, given an informed choice, they would have preferred an image CAPTCHA? (On the other hand, that could use more cellular data. Might be good to run the numbers on this too.)

even with a million users on one IP there won't be tens of users signing up to some forum per minute

I think this is a bad choice of threat model. "Some forum" would likely be better off with simpler measures, like a hidden honeypot textbox: https://dev.to/felipperegazio/how-to-create-a-simple-honeypo...

NorwegianDude · on Oct 28, 2020

Cool project, but I do find it quite ironic that it's named friendly captcha when it's not a captcha.

Eldt · on Oct 28, 2020

How would you define "CAPTCHA"?

perryizgr8 · on Oct 29, 2020

The original expansion was "Completely Automated Public Turing test to tell Computers and Humans Apart".

jimmydorry · on Oct 29, 2020

CAPTCHA: a computer program or system intended to distinguish human from machine input, typically as a way of thwarting spam and automated extraction of data from websites

I would say this Oxford Languages dictionary definition is close enough.

webphineas · on Oct 28, 2020

Really nice! Finally someone is using the blockchain technology in a meaningful way!

laughinghan · on Oct 28, 2020

This doesn't use a blockchain, it uses a Hashcash-style proof-of-work function (an idea that predates the Bitcoin by decades): https://en.wikipedia.org/wiki/Hashcash

redbergy · on Oct 28, 2020

Awesome work, I will be giving this a try in my next project

remram · on Oct 28, 2020

> up to 20 seconds on old smartphones

That sounds like a very battery-unfriendly idea.

protoduction · on Oct 28, 2020

It's not perfect, but maxing a single core for 20 seconds on an older smartphone is a necessary evil for this kind of captcha.

The alternative: loading a third party script and multiple images (~2MB) to label for ReCAPTCHA and spending time performing the task also takes some battery (and mental) power.

jancsika · on Oct 28, 2020

> We do have Recaptcha support which can be configured - are you seeing these kinds of issues with that enabled/configured?

Thanks, I have used Recaptcha for a long time now. It made no difference.

> One item that is on the roadmap that is coming and may be of interest is `Optional Admin Approval for local user sign up` - https://gitlab.com/groups/gitlab-org/-/epics/4491.

Yes, that would be a very sensible solution and welcome feature for my use case here.

Unfortunately, from the bottom of that issue tracker:

"Yikes. I'm glad we did the further breakdown and pre-work. It's a bit cringeworthy looking back and seeing I estimated a 5"

mushakov · on Oct 28, 2020

Hi! I'm a PM at GitLab. Please see my reply above for more details but TL;DR we shipped the first iteration of the `Optional Admin Approval for local user sign up` feature in 13.5. I'd love your feedback! Please comment on the epic if there are other changes for this feature that would help your use case https://gitlab.com/groups/gitlab-org/-/epics/4491

jancsika · on Oct 28, 2020

Thanks for the update. I can certainly manage user sign-up from the admin tab for the time being. Once it's hooked into email, I believe that will make things maintainable again for me.

From a UX standpoint it's still sub-par. Someone who wants to report an issue doesn't want to wait an arbitrary amount of time to be allowed to report an issue. They are ready to report it at that moment.

And as an admin, I don't want to have to approve new users on a schedule to ensure the delay is low enough that they are still willing to submit the issue after I approve them. I'd much prefer they go ahead and submit the content, especially so that I can use it in my review of whether to approve the sign up or not.

I seem to remember some pattern in Gitlab where my login period timed out before I finished making a comment. When I logged back in, Gitlab had somehow saved my comment content so that I could then post it so that others could see it. Is there any way to use that pattern for users who haven't been approved yet? So that they can post content, but with a warning shown to them that other users won't see it until the sign-up is approved.

mushakov · on Oct 29, 2020

That's a really interesting idea! Users could have limited interactions with the instance and content queued up until approved by an administrator. I created an issue to capture this. https://gitlab.com/gitlab-org/gitlab/-/issues/273542

rightbyte · on Oct 28, 2020

Relying on Google's Spying-as-a-Service tooling is not very FOSS at all.

There need to be other ways to reach out to users who block Google.

encom · on Oct 28, 2020

I immediately back out whenever encounter Recaptcha.

The other day I was forced to endure it, because I wanted to delete my ancient Minecraft account, since Microsoft pulled a Facebook and are going to require a Microsoft account to play going forwards. Without exaggeration, it took me 15 minutes of training Google surveillance AI (had to solve it three times), for Recaptcha to let me in. I guess Google really hates me.

dalmo3 · on Oct 28, 2020

Yesterday I spent the longest ever with a recaptcha, about 2-3 minutes, at a frigging checkout page. I decided to endure it just because I really needed that ergonomic kb+mouse combo.

Hopefully they'll allow me to solve captchas for longer without getting a RSI.

wolco2 · on Oct 28, 2020

Are you sure you are human?

myself248 · on Oct 28, 2020

I'm human enough, and I've been a licensed driver long enough, to recognize that rumble strips at the side of a road are not crosswalks. But apparently enough bots thought they were that the system is now trained on that 'fact', and I as a human am forced to misidentify rumble strips as crosswalks to pass as human.

It's bizarre.

ignoranceprior · on Oct 29, 2020

ReCaptcha also thinks that mailboxes are parking meters, for some reason.

encom · on Oct 28, 2020

Yes, definitely.

https://v.redd.it/uaefcc2mztj31/DASH_720

Kwpolska · on Oct 28, 2020

Try reCAPTCHA’s audio version (the headphones icon), it’s much easier than guessing what images it wants you to click (if you speak English, have headphones, and are not hearing-impaired).

WalterSear · on Oct 28, 2020

This sounds like it has the potential to be a modern version of the credit score: avoid it enough, and you become persona non grata. That is, for more than 15 minutes.

db48x · on Oct 28, 2020

I do the same thing.

meibo · on Oct 29, 2020

You're doing something very wrong if you take 15 minutes to solve these and aren't on Tor. Even on public VPN and Firefox this doesn't happen usually. I know people that pick the wrong options to fuck with their models though, and then go on HN to complain about recaptcha being annoying.

randunel · on Oct 29, 2020

I have similar issues. I do not pick the wrong options. It also doesn't take me too long to solve the captchas, leading to "too many queries from your ip address". This is what internet users deal with when blocking most google services.

mushakov · on Oct 28, 2020

Thanks for bringing up this epic in the conversation phkai. I'm a PM at GitLab for our Auth group and am working on the `Optional Admin Approval for local user sign up` feature. I'm happy to tell y'all that we shipped the first iteration of this in our 13.5 release. You can find more information in our release blog https://about.gitlab.com/releases/2020/10/22/gitlab-13-5-rel... . I've also updated the epic with more information about its current status https://gitlab.com/groups/gitlab-org/-/epics/4491#status-upd....

kemayo · on Oct 28, 2020

For this specific case, the Wikimedia Foundation has explicitly stated that "It is the Free Software release of GitLab that runs optional non-free software such as Google Recaptcha to block abuse, which we do not plan to use." So, not incredible helpful at the moment.

Also, is manual approval for new signups a good idea for a large FOSS project? It seems like a pretty big barrier to legitimate discussion.

anarcat · on Oct 29, 2020

We (at torproject.org) also adopted GitLab CE recently and we had to close down registrations because of abuse. Tens (hundreds?) of seemingly fake accounts were created in the two weeks we had registrations opened and we had to go through each one of those to make sure they were legitimate. In our case, snippets were not directly the problem: user profiles were used as spam directly.

We can't use ReCAPTCHA or Akismet for obvious privacy reasons. The new "admin approval" process in 13.5 is interesting, but doesn't work so well for us, because it's hard to judge if an account should be allowed or not.

As a workaround, we implemented a "lobby": a simple Django app that sits in front of gitlab to moderate admissions.

https://gitlab.torproject.org/tpo/tpa/gitlab-lobby/

The idea is people have to provide a reason (free form text field) to justify their account. We'd also like people to be able to file bugs from there directly, in one shot.

We're also thinking of enabling the service desk to have that lower bar for entry, but we're worried about abuse there as well.

Having alternatives to ReCAPTCHA would be quite useful for us as well.

MrStonedOne · on Oct 28, 2020

You have to remove incentives. Block the viewing of these snippets by logged out users by default and require opt-in and a way to whitelist snippets by snippet or user. Same for user profiles

noizejoy · on Oct 28, 2020

I don't think this is targeting human views - but it's targeting Google for SERP (Search Engine Results Pages) boost.

MrStonedOne · on Oct 28, 2020

That's the point. Having a way to disable search engines would also work, but wouldn't be obvious to spammers so they would still try to spam. Disabling all users by default works to remove the incentive to try

gaba · on Oct 29, 2020

Is this something that we will have in the CE version (the open licensed one) or it will only go to the enterprise one?

67868018 · on Oct 28, 2020

None of your captcha settings work, not even the invisible captcha setting that requires enabling a feature flag.

xiphias2 · on Oct 28, 2020

Have you thought of the option of disabling links? That would make SEO spam impossible

pitay · on Oct 28, 2020

Is just adding the attribute rel="nofollow ugc" to any links in submitted content may be good enough. This tells search engines to not index, or tag them as suspicious, allowing them them to identify SEO spam more easily. [1]

Having both options would be great.

[1] https://support.google.com/webmasters/answer/96569

gramakri · on Oct 28, 2020

The spam is infuriating (not GitLab's fault, of course). Atleast, on our instance at https://git.cloudron.io, we got massive snippet spam. After we disabled snippets, we got massive spam on the issue tracker (!). The way we "fixed" is by turning on mandatory 2FA for all users.

As a general lesson, what we learnt is these are not bots. These are real humans working in some poor country manually creating accounts (always gmail accounts) and pasting all sorts of random text. Some of these people even setup 2FA and open issues with junk text, it's amazing. Unfortunately, GitLab from what I can tell cannot make issues read-only to non project members (i.e I only want project members to open issues, others can just read and watch issues).

Currently, our forum spam (https://forum.cloudron.io) is way more than GitLab spam. On the forum, we even have Captcha enabled (something we despise) but even that doesn't help when there are real humans at work.

Symbiote · on Oct 28, 2020

We had one of the "real humans" write to us (in issues) asking us to leave his spam up for "just a few hours".

We implemented a filter anyway.

(This was not Gitlab, but a specific form on our unique website.)

packetlost · on Oct 28, 2020

> asking us to leave his spam up for "just a few hours"

What... why? What is their goal???

vvpan · on Oct 28, 2020

Feeding their family?

packetlost · on Oct 28, 2020

But like... who's paying for that kind of spam??

Symbiote · on Oct 29, 2020

In our case, mostly pirate TV streaming services. (Watch NBA games live etc.)

mgbmtl · on Oct 28, 2020

A lot of people unfortunately seem to bite on those "SEO experts" kind of emails. I had a few clients ask me if they should give it a try, since, "why not, it's cheap".

csdreamer7 · on Oct 28, 2020

Why are they posting random text in Gitlab?

grey-area · on Oct 28, 2020

This is a typical spam profile. Usually they contain links, which search engines follow.

https://forum.cloudron.io/user/cardioaseg

edflsafoiewq · on Oct 28, 2020

The link contains rel=nofollow.

grey-area · on Oct 29, 2020

That doesn't matter, see other comments below on Google's changing treatment of this attribute.

Also you'll find spambots posting on any open form on the internet even if it doesn't do them any good, because much of it is automated, so even if you hide the results the spam will still come in.

leipert · on Oct 28, 2020

I don’t know for sure, but I think our Markdown implementation adds nofollow.

brlewis · on Oct 29, 2020

I used to think that spammers would stop if their spamming didn't win them any results. But they don't care. They spread their spam as widely as possible without trying to prune out the places where it does them no good.

gramakri · on Oct 28, 2020

I am not entirely sure. See https://forum.cloudron.io/users, if you go to say page 10 or something you will see all sorts of nonsense. I am still trying to figure what the best way to fight this spam (because captcha is enabled and required to even create accounts). But these are real people and not bots. I know this because they even post new messages all the time.

fancyfish · on Oct 28, 2020

Definitely the SEO backlinks- for example one profile I see is linking to an Indian escort service in the profile.

coder543 · on Oct 28, 2020

Maybe GitLab needs an option to disable external linking, and filter any comment that contains an external link automatically

csdreamer7 · on Oct 28, 2020

Or a nofollow option (add rel=nofollow)

dnsmichi · on Oct 29, 2020

That's a great idea. We have discussed ways of getting a trust level, and enable this for specific groups. Discourse uses the same system for preventing spam. "Good" bots detect the rel=nofollow and do not come back.

See my proposal here: https://gitlab.com/gitlab-org/gitlab/-/issues/14156#note_258...

dnsmichi · on Oct 29, 2020

Iterating on my original thought, here is a smaller feature request for self-hosted GitLab instances. This can help GitLab.com too: https://gitlab.com/gitlab-org/gitlab/-/issues/273618

coder543 · on Oct 29, 2020

I still think an even better path to success is to allow entirely disabling linking for non-admins.

Google no longer treats "nofollow" as strongly as it used to: https://webmasters.googleblog.com/2019/09/evolving-nofollow-...

dnsmichi · on Oct 29, 2020

Thanks for sharing, I have added it to the issue, maybe you want to join the discussion there :) https://gitlab.com/gitlab-org/gitlab/-/issues/273618#note_43...

Just so that I can follow - URLs posted by non-admins should not render as HTML URLs at all? Wouldn't that be quite limiting for OSS project members for example?

coder543 · on Oct 29, 2020

My opinion on the topic isn't definitive by any means, but I think a lot of projects would do just fine without allowing arbitrary hyperlinks to be added by non-admins.

I think being able to link to related issues and link into the code is still important, for example.

It's certainly a trade off, but spammers want it to be rendered as a link.

riffic · on Oct 28, 2020

getting that sweet sweet seo backlink juice

CoffeeOnWrite · on Oct 28, 2020

Isn’t that why we add rel=nofollow to low friction user submitted links on our platforms?

ivank · on Oct 28, 2020

Google changed the interpretation of those a year ago. https://webmasters.googleblog.com/2019/09/evolving-nofollow-...

nurettin · on Oct 28, 2020

> Looking at all the links we encounter can also help us better understand unnatural linking patterns.

It appears as though they want to mark these links in order to prevent inorganic SEO, not help it.

nerdponx · on Oct 28, 2020

I don't get it. They post all this spam in the hopes that people click on the links therein, thereby boosting the ranking of those sites? Does that actually work at all?

rudedogg · on Oct 28, 2020

It doesn’t actually require anyone clicking on the links. Google sees inbound links and uses that as a factor when calculating the ranking of the linked page.

IggleSniggle · on Oct 28, 2020

I thought that was how it worked like a decade or more ago, but not today.

technion · on Oct 28, 2020

Regardless of whether it works, people still pay for it. I have a Facebook ad right now that says "Get over 500,000 backlinks for $29.99". No doubt it's someone with a bot that spams comment forms.

mpol · on Oct 28, 2020

A service like Stop Forum Spam might be a solution to this. It checks for IP address and email address and gives it a value based on how likely it is assumed to be a spammer.

When they have to set up a new email account and maybe even a new IP address for every few accounts, it gets to be a lot of work soon.

https://www.stopforumspam.com/

Siira · on Oct 28, 2020

How do you know they are real humans? I imagine bots doing 2FA would still be cheaper.

gramakri · on Oct 28, 2020

I know this because in our forum we have LOTS of "spam" users - https://forum.cloudron.io/users . These users will go into posts and actually make "helpful" comments. Like, "Oh I tried this solution but I found that my disk my full. Deleting data fixed my problem". It almost seems genuine but they build reputation and once they have some votes, they go back and edit all the comments to have links.

eznzt · on Oct 28, 2020

Banning entire countries helps a lot. I don't want to name certain countries, but let's assume it's one where it's common to see human corpses floating on a big river.

jychang · on Oct 28, 2020

That doesn't help narrow it down. I live in Seattle and the first think I thought of was a popular tiktok of teens finding a corpse in the river last month.

reaperducer · on Oct 28, 2020

The word you missed was "common."

d3nj4l · on Oct 29, 2020

Apart from the fact that banning an entire country from contributing to their code would be antithetical to the Wikimedia foundation, if you're implying the country which I think you're implying (which is also where I live, btw) you'll:

1. Ban a burgeoning tech industry which has produced over 20 unicorns,receives billions in funding from across the world and produces world-class tech talent; 2. Ban millions of other OSS developers from contributing; and 3. Just lead to SEO spammers picking out other impoverished countries to spam from, which means finally you'll end up with only people from the "west" being able to contribute in any way.

boneitis · on Oct 28, 2020

Many bots are likely still powered under the hood by humans.

On my backlog of projects to do is to make a browser extension that solves the more obnoxious captchas for me, as I'm regularly behind vpn and fall into ridiculously long solve loops.

On the most popular api i could find, $10 buys you a shockingly LOT of solves (not that I've tested it yet). It is automatable but ultimately still powered by humans.

dannyw · on Oct 28, 2020

It’s incredibly sad how the open web is being destroyed by google’s recaptcha.

Kalium · on Oct 28, 2020

Without google's recaptcha, do you think there would be less spam?

Personally, I suspect there would be more without at least some speed bumps to raise the cost of spamming. I would absolutely love for there to be better options than recaptcha that meets the same needs around bot-detection, price, implementation effort, and accessibility. It is, sadly, the best option I've seen on offer.

You're right. The scenario we're in is incredibly sad. It would be wonderful if the individual actors involved had better options to meet their needs.

nerdkid93 · on Oct 28, 2020

I'd argue that it's equally sad to see the open web get destroyed by massive DDoS attacks and malicious actors. How would you keep your own website up if it was constantly being attacked?

hombre_fatal · on Oct 29, 2020

You're barking up the wrong tree. Bad actors create abuse and spam which they can do because of fundamental weaknesses in the design of the internet. People trying to solve that reality with Recaptcha (and Cloudflare for that matter) aren't the ones destroying the internet.

boneitis · on Oct 29, 2020

I don't think it's so much the wrong tree as it is but one tree in a forest to be barking up.

All the maturely developed bot filters frequently throw me in an endless battery of tests that have me giving up in frustration before finally making it through to content I'm requesting.

> aren't the ones destroying the internet

IMO they are every bit as much destroying it as the abusers they're claiming to fend off.

boneitis · on Oct 28, 2020

I'm totally in that camp of opinion, although I'll acknowledge the escalating abuses carried out by both "sides."

In the meantime, i hope to have the savviness to program my own way out of unsolvable captchas.

FredFS456 · on Oct 28, 2020

Already exists: https://github.com/dessant/buster

Edit: on re-read, you meant solving using humans. Buster uses speech-to-text APIs to solve.

boneitis · on Oct 28, 2020

Every lead to spice(e:solve, lol gboard) my problems is probably worth a peek. I'll take a look, thank you.

pcmaffey · on Oct 28, 2020

Could add nocrawl to your robots.txt and advertise the fact on signup page that search engines won’t find this content.

sytse · on Oct 28, 2020

At GitLab Inc. we have a Trust and Safety team https://about.gitlab.com/handbook/engineering/security/opera... that prevents spam.

So far that functionality has lived in separate repositories from the core codebase since few people needed it, the cycle time was quicker, and it is an advantage to not have the spammers see the code.

If there is strong interest in collaborating on this I'm sure they will be happy to engage. I'll ask them how best to structure this.

ran3824692 · on Oct 28, 2020

There's been a gitlab bug for almost 3 years to stop relying on recaptcha, https://gitlab.com/gitlab-org/gitlab-foss/-/issues/45684 Debian, KDE and Gnome have never wanted to make their users run Google's nonfree javascript blob to contribute on their gitlab instance. There's been interest, Gitlab has done very little about it. Edit: other bugs about this can be found here https://gitlab.com/gitlab-org/gitlab-foss/-/issues/46548

GLJHunt · on Oct 28, 2020

We have a team currently working on improving the detection and mitigation of spam. We continue to look for ways to improve the security and user experience of our product. Our product includes the Akismet Spam filter which you can read more about in our handbook: https://about.gitlab.com/handbook/support/workflows/managing.... Further, Gitlab.com includes the ability to report abuse directly to our trust & safety team here: https://about.gitlab.com/handbook/engineering/security/opera... however, the report abuse feature on self-managed reports back to the instance admin. We are also currently developing an anti-spam feature intended to further improve spam detection & mitigation. This is set to be enabled on GitLab.com within 3 months.

nico_h · on Oct 28, 2020

As mentioned above in the thread, multiple times, maybe a simpler solution to reduce spam is to remove incentives by:

- removing links (making them as plain text forcing users to copy paste them..) - hiding links from non-registered users (plain text to non-registered users, clickable for registered users), - blocking links from search engine crawlers (robots.txt / rel=nofollow...).

Maybe these fall in the "for each complex problem there is simple but wrong solution" but it sounds like it's worth a try.

mpol · on Oct 28, 2020

(I already replied on a different thread but this might make more sense)

A service like Stop Forum Spam might be a solution to this. It checks for IP address and email address and gives it a value based on how likely it is assumed to be a spammer.

When they have to set up a new email account and maybe even a new IP address for every few accounts, it gets to be a lot of work soon.

https://www.stopforumspam.com/

It has a very simple API and is not that hard to implement (really, I have done it myself :) )

GLJHunt · on Oct 28, 2020

Appreciate the response - I'll look into now

mpol · on Oct 28, 2020

Okay, thank you. I see Gitlab is mostly Ruby. Just to get a general idea of the code this is a simple PHP function to use it:

https://plugins.trac.wordpress.org/browser/gwolle-gb/trunk/f...

That function can be called when the register form has been submitted. It will return true or false. Forget about the transient stuff, that is just WordPres caching stuff.

You don't need an API key like with Akismet. You would only need it if you want to add or remove entries from the SFS database. It really is much simpler. Ofcourse you might want to have a checkbox in the settings. But still, in an afternoon you might be able to finish this :)

Wish you the best.

lbierner · on Oct 29, 2020

Great suggestion, this looks like a very straightforward service and implementation. All open source as well.

sytse · on Oct 28, 2020

I think the code of this problem is that it is hard to identify if a user is a bot or a human. I've not seen any elegant free solutions to this.

ran3824692 · on Oct 28, 2020

That is not the core of the problem. Spammers are humans, and sometimes they will solve recaptchas in large quantities to get their spam through. Its about having a multipronged approach for administrators to stay ahead of them. For some examples of free solutions see https://www.mediawiki.org/wiki/Manual:Combating_spam. It's even possible to connect spamassassin to forms. Gitlab needs tools and automation that detects and rolls back spam, bans users, knobs to tune restrictions and rate limits based on how spammers are acting. Gitlab inc just hasn't seemed to care much to help people trying to use Gitlab and keep their software freedom.

sytse · on Oct 28, 2020

I think the focus of our Trust and Safety team has been on GitLab.com and not on all GitLab instances. We'll discuss changing this.

ran3824692 · on Oct 28, 2020

Thank you.

edchan · on Oct 30, 2020

GitLab team member here. We just added a new page to our Handbook where we share approaches to preventing, detecting and mitigating spam on self-managed instances of GitLab. https://about.gitlab.com/handbook/engineering/security/opera...

We want to hear from you! Instructions on how to contact us: https://about.gitlab.com/handbook/engineering/security/opera...

robotmay · on Oct 28, 2020

I'm curious about the spamassassin integration. Do you know of any open source projects currently using it for a web application?

trynewideas · on Oct 28, 2020

I'll be curious to see whether they even use GitLab user auth. For Gerrit (and Phabricator), Wikimedia already requires contributors to have a dev account on Wikimedia's LDAP system: https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_...

jefftk · on Oct 28, 2020

Can you say more about how it's a problem if people can view things without logging in? Naively I would have seen that as a plus.

dfabulich · on Oct 28, 2020

If you allow new users to create user profiles with links, and those user profiles are visible to Google, spammers will create a bunch of new user accounts and fill them with spam links.

The easiest way to prevent this is to block Google from seeing user profiles by requiring login to see the profiles.

chrisweekly · on Oct 28, 2020

> "and those user profiles are visible to Google"

googlebot adheres to robots.txt, right?

in which case couldn't self-hosted gitlab admins add a robots.txt entry for the profile page url?

jefftk · on Oct 28, 2020

That requires the spammers to notice that it's blocked in robots.txt, which seems optimistic

pitay · on Oct 28, 2020

There is also adding the rel="nofollow ugc" to user submitted links that removes the benefits of linking for spammers.

https://support.google.com/webmasters/answer/96569

dspillett · on Oct 28, 2020

Most bots aren't going to bother checking if you've done that, so while they'll not get the expected benefit you'll still get the spam.

jancsika · on Oct 28, 2020

Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Nobody views snippets and user profiles as a common part of daily development, so it takes time away from development to investigate those things to prune them. And if you don't prune it fast enough, it gets into the search results at which point it's even more of a pain in the ass to remove (even using Google's webmaster tools).

dragonwriter · on Oct 28, 2020

> Because spammers fill the publicly viewable things-- like snippets and user profiles-- with spam. If it can be viewed without logging in, then they get it indexed with Google and it dilutes the search results.

Isn't then the problem not that it is viewable but that it's not excluded from indexing by robots.txt?

LeifCarrotson · on Oct 28, 2020

Anything visible without login will be visible to people who want to follow a bare URL to your tracker and visible to a search engine crawler, making it visible to people without a URL who just search for the issue. That is, indeed, a plus.

But even if you require login to post stuff to the issue tracker, creating a login and posting a comment has been trivially automated.

You're no longer running a useful issue tracker, you're running a free ad network: you're hosting a dozen useful issues and a thousand advertisements and blackhat SEO comments for spammers.

If it's not visible to search engines, and your repo doesn't get much traffic, it's not nearly as valuable to spammers. It's kind of like cutting off your nose to spite your face, but those basic economics of cost to spammers, cost to users, value to spammers, and value to users are the only rules that you can really apply when hosting content on the Internet.

jefftk · on Oct 28, 2020

> creating a login and posting a comment has been trivially automated

Isn't this what various CAPTCHA tools handle?

You can also require email address validation.

marcinzm · on Oct 28, 2020

1. Create user account

2. Create spam content (snippets, profile, etc.)

3. Get spam content search indexed

4. Profit

dtech · on Oct 28, 2020

a free user sign up is not going to prevent scraping...

cortesoft · on Oct 28, 2020

I think the idea is that if you can't view issues without logging in, then google won't index your issues (because it can't view them), so you won't get people spamming in order to get into google

noizejoy · on Oct 28, 2020

Over the years I've frequently seen Google search results showing things that require login to the indexed site. Has that changed?

marcinzm · on Oct 28, 2020

I believe that requires the site giving google’s bot a logged in view rather than something google does themselves.

wolco2 · on Oct 28, 2020

It also means 95% of people will stop casually viewing issues.

wyldfire · on Oct 28, 2020

The problem isn't scraping, it's spam.

ognarb · on Oct 28, 2020

Strange I never saw this behaviour on our Gitlab instance invent.kde.org.

marcinzm · on Oct 28, 2020

You seem to be using a central login system (https://identity.kde.org/) that requires going to a separate website to create an account which presumably is non-standard enough to throw off most bots.

ran3824692 · on Oct 28, 2020

invent.kde.org uses the nonfree google Recaptcha, that prevents it mostly. Not very nice for KDE to make people run nonfree software blob in their browser that gives up their freedom, gives up their privacy to google and trains Google's proprietary machine learning models.

mattl · on Oct 29, 2020

Where does it use that?

wdb · on Oct 28, 2020

Personally, I also think, GitHub's Checks API, and it Github bots. Gives a much better experience compared to GitLab.

On a daily basis I am confused about how diff's are being rendered in GitLab merge requests, as it has a weird way to render ${blah} in strings in lines.

Also when you want to check if an issue exists in GitLab own repository, you always end up in some jungle of redirected tickets. Just now I got redirected to three different tickets , as they switch projects or something. Really annoying.

boleary-gl · on Oct 28, 2020

Can you share an example or screenshot of what you mean around the rendering issue in merge requests?

As for the issue redirects - it does stink. It's an artifact of our move to a single code base for CE and EE [1]. A lot of issues have long-standing SEO so the "old" issue often comes up in a Google search.

[1] https://about.gitlab.com/blog/2019/08/23/a-single-codebase-f...

wdb · on Oct 30, 2020

Yes, for example, in today's pull request, I added a new file and now in my Typescript file it renders things like this, which I find confusing: https://imgur.com/a/C82hcMK

zeeZ · on Oct 28, 2020

In general settings, you can check "Public" under "Restricted visibility levels". According to the blurb, "selected levels cannot be used by non-admin users for groups, projects or snippets".

Is that not what you want with #2?

jancsika · on Oct 28, 2020

It's what I want for #2, but it has the unfortunate side-effect of restricting visibility for my main public repo. My #1 goal above is for people to be able to clone from my main repo without logging in.

zeeZ · on Oct 28, 2020

I see. And as soon as you make the repo public you end up with public issues again, unless you restrict them to project members...

And a workaround of auto syncing just the code to a public repo where issues and stuff is disabled isn't available natively in CE.

mgbmtl · on Oct 28, 2020

If you can, place your Gitlab CE instance behind an LDAP server. Have another site handle signups. (admitedly, setting up something with LDAP is often a massive pain. I duct-tape around it by using LdapJS on top of a CMS)

I've had a handful of projects where human spammers will bother to create an account and jump through the loops, but in the 2-3 years of running a Gitlab instance, which has 1300 users, I only had 2-3 incidents (we keep an eye on recent projects, snippets, etc).

chillfox · on Oct 29, 2020

The GitLab LDAP config is pretty easy.

_ktx2 · on Oct 28, 2020

I would encourage folks to look at Gitea.io. I run that on Kubernetes alongside Drone and it basically replicates all the most important parts of GitHub.

ben0x539 · on Oct 28, 2020

You'd think Wikimedia in particular has experience with the issue of spam bots polluting the site's search result.

abbe98 · on Oct 28, 2020

Is this mainly a concern for the Gitlab issue tracker? Wikimedia will continue to use Phabricator for issue tracking, Gitlab CE will only be used for CI and code review/hosting...

ran3824692 · on Oct 28, 2020

No. Spammers will create repos and user profiles and snippets and anything they can with spam in them.

abbe98 · on Oct 28, 2020

I would imagine authentication being done through Wikimedia's existing LDAP or Mediawiki solution and I hope that features that already exists in Phabricator(such as snippets) will be disabled.

oconnor663 · on Oct 28, 2020

Is it possible to configure a robots.txt file to accomplish #2?

clscott · on Oct 28, 2020

No, robots.txt is for well behaving bots like bing bot and google bot, not bots that will spam your forums (and Git repo apparently).

detaro · on Oct 28, 2020

It doesn't stop the spam from being created, but it does stop the spam from ruining your site reputation in search results was the suggestion I guess.

_ikke_ · on Oct 28, 2020

But the goal is to prevent spam in the first place. I don't think these bots will verify robots.txt to see if the spamming is effective. They just spam anything they can get their hands on.

remram · on Oct 28, 2020

They probably don't have general code that spam any form, it's more likely that they have code specific to GitLab CE instances that knows to post snippets. If GitLab changes their default configuration so that those snippets are no longer indexed by Google, the spammer are likely to stop using that GitLab CE spamming script, after a while.

marcinzm · on Oct 28, 2020

But if the issue is SEO bots then robots.txt would block the search engines thus meaning the spam content is of no importance (it's effectively private) and doesn't cause issues for the main sites SEO (nor help the spammers).

jancsika · on Oct 28, 2020

To be honest, I'm not certain of the purpose of the spam.

Some portion of it would end up in the search results, sure.

But I don't know if there's some secondary benefit to, say, a casino showing a link coming from my site even if my site has a robots.txt saying that the address for that link isn't to be directly indexed.

Is there such a benefit? If not then I'll just set up the robots.txt and observe whether that does indeed solve the problem. But I'd much prefer to just set up the permissions I know I want on my own running instance than spend time making inferences about the reasons bots are abusing my instance's inputs.