DéjàVu: a map of code duplicates on GitHub

hawski · on Nov 20, 2017

I always like to see how some API is used in real projects. Sadly GitHub search is mostly useless for this, because of the number of duplicates. Google code search was great. It even supported regexps. Then the was koders.com, now there's also something from ohloh and it's better than GitHub AFAIR.

EDIT: ohloh became openhub and now the code search is discontinued. So there is the nonfunctional GitHub search and an open niche for other projects...

sdesol · on Nov 21, 2017

Disclaimer: I'm the founder of GitSense (https://gitsense.com) that indexes code and Git history.

Indexing and retrieving code at scale is actually a really challenging problem due to the fact that there is a lot of code, on a lot of branches, in a lot forked repos. With GitSense, it doesn't even try to determine the authoritative source (repo/branch), since I personally think this is a lost cause, given current AI technology.

With GitSense, everything is context driven, which is how you can reasonably remove duplication. To search, you have to define what branches/repositories to consider, which can a be a few to a few thousand. Note, once a search context has been defined, it can be reused, so this isn't something you have to create every time, if you want to search.

I sort of envision a Yahoo type (the first incarnation) approach to searching for code. The basic idea is provide a curated search experience, where domain experts can share what they believe to be relevant branches to consider, for a given problem.

Without some human intervention, I think duplication is a given and as you point out, can lead to useless results.

stickydink · on Nov 20, 2017

I've used searchcode.com for a while, I don't think it has regex though.

j_s · on Nov 21, 2017

There are a few things that might be worth checking out:

https://hn.algolia.com/?query=code%20search

coding123 · on Nov 20, 2017

What really sucks is people committing node_modules, that's just plain wrong.

hinkley · on Nov 20, 2017

No what sucks is they made a clickbaity article by not removing checked in dependencies right at the beginning.

There are even scenarios where I’ve seen Java projects check in their dependencies (in conservative industries) so this erodes the value of the numbers greatly.

We have a pretty big issue with duplication of effort. I was hoping for an article that would be a wake up call but instead it’s just measurement artifacts.

keiyakins · on Nov 21, 2017

Copy-and-pasted libraries are a direct cause of duplicated effort.

If you all grab it from the same repo (possibly through some sort of 'repo link' mechanism that keeps a local clone but is clearly marked as a clone, of course) then when a bug is fixed, it's fixed.

If you copy and paste it, suddenly every fix has to be done once per project using the library.

hinkley · on Nov 21, 2017

No they aren’t. They’re just a snapshot that works when the internet is broken, or when you can’t trust the internet.

Nobody is supposed to modify those files. They’re just a local replica.

yeukhon · on Nov 20, 2017

This reminds me to hate AWS Lambda. I am not sure if anything has changed since my last use (around spring of 2017). I wrote my code in Python, and wanted to use psycopg2 to connect to my PG RDS instance. Luck out, yike! I was told to commit some psycopg2 files in my repository. I don't remember whether s3 was a feasible option, I think it wasn't; even then I would hate having a separate place for storing dependencies just because AWS couldn't solve customer problems.

And no, I refuse to believe such requirement is hard to solve at all. I refuse to believe AWS doesn't create some container-like environment behind the scene for launching and running a function.

See https://forums.aws.amazon.com/thread.jspa?messageID=791221&t...

bpicolo · on Nov 20, 2017

> that's just plain wrong

there was at least one point in time where that was the recommended strategy

aalleavitch · on Nov 20, 2017

I'm not sure I could understanding the reasoning behind that. Does it have to do with dependency versions or the assumption npm might not be available or what?

bpicolo · on Nov 20, 2017

There were/are a few factors. NPM availability is definitely one - before caching, and without the overhead of running your own npm replica. It also didn't used to have things like lock files. Vendoring gives you a deterministic build and removes availability concerns. In that aspect, it's not the worst thing ever, mostly just leads to noisy diffs (and maybe c extension issues if your team works on a variety of OSs?)

This is pretty much what the golang world does (though now there are some tools that do a better job).

yoz-y · on Nov 20, 2017

Both actually. Even with yarn and lock files, npm servers can (and eventually will) pull a rug from under you.

I have already been in a situation where a dependency version that I was locked to was simply removed from the official registry.

The right solution is to have your own registry or backup the archives of dependencies that you are using.

I think it is better to commit the archives rather than the whole node_modules as it does not produce a mess.

yeukhon · on Nov 20, 2017

Yeah running a registry is the better option, although another solution is forking / committing the dependencies in a separate repository, and use branch + shallow clone to keep the download size small.

yoz-y · on Nov 20, 2017

What I ended up doing at work is to have a docker image of a registry with package zips committed in the repo (with a small script to publish all packages in). That way I can rebuild a running registry in a few moments.

c_shu · on Nov 21, 2017

like this one?

https://evertpot.com/npm-revoke-breaks-the-build/

linkregister · on Nov 20, 2017

... until LeftPad 2.0 [1] happens, then everyone will be mocking those who didn't commit node_modules.

1. http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm

granos · on Nov 20, 2017

Seriously. I've been on a deployment that failed because China blocked access to Github for a little while over some piece of code in some repository. Make sure your dependencies are available.

keiyakins · on Nov 21, 2017

What's wrong is that the package manager and repo manager aren't talking to one another. "Keeping a copy of the dependency with your code" and "copy and pasting the code" should not be the same thing.

linkregister · on Nov 21, 2017

I agree, ideally artifacts should be stored and cached, e.g. in a stored container or as a tarball.

Other than a local npm mirror, what do you think is the best way to store artifacts for new builds? Vendoring code is not a bad solution if it permits new builds to occur even in the event of a package manager outage. I see vendoring code as the simplest solution for small companies who don't want the management overhead of running a local mirror of package managers.

That said, I'm very much open to learning a new technique for handling this problem.

zbentley · on Nov 20, 2017

Wow, GitHub could save a lot of storage space if they dedup'd across projects/files explicitly, rather than storing Git repos, which is what I'm assuming they do.

Even with a good deduping/compressing filesystem, the way git history is stored means that they're probably missing out on a ton of savings here. Eh, it's probably not worth the complexity/deviation from standard Git tooling.

dfox · on Nov 20, 2017

Github uses their own storage backend which I believe shares objects across all of projects regardless of whether they are explicit forks or not.

ethomson · on Nov 20, 2017

GitHub only shares objects among forks. Source: I used to work on GitHub's Git Infrastructure team, but this is publicly available information.

You can read about their architecture in their discussions of Spokes, which replicates repository networks (the original repository and its forks) across data centers. eg: https://githubengineering.com/stretching-spokes/

Trying to put all of GitHub's object files in a single packfile - even just putting them on a single server - would be impossible.

But even on hosting providers with a bespoke implementation - that do not use core git to manage Git repositories - this would be challenging. We have a custom Git server implementation in Visual Studio Team Services, but it still makes sense to shard object storage with the repository: you have to worry about scalability and performance, but also things like data sovereignty. We can't just put a user's git repository in some global SQL Azure database that contains all the repositories in Visual Studio Team Services: the repositories need to be geographically located with the VSTS account they created.

dfox · on Nov 20, 2017

I vaguely remember that somebody from github claimed in HN comment that all objects are shared.

Puting whole github in single packfile is obviously impractical, but having whole github on some bespoke Venti/IPFS-style content addressable object store is not.

ethomson · on Nov 20, 2017

It's not impossible but it too is impractical for performance reasons. Providers that run git core repack repositories regularly because packfiles are efficient; loose objects are horribly inefficient - even on the local filesystem. Moving them to a network filesystem is impractical.

zbentley · on Nov 20, 2017

That's really neat. Is there any documentation/discussion available on that technology? It sounds like something that would be fascinating to learn about.

Edmond · on Nov 20, 2017

I am not sure if the parent's claim is true, ie that Github is storing objects and sharing them across forked repos.

If they are, then it likely just a direct implementation of git the technology. you can see how git stores data here: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

aidos · on Nov 20, 2017

Not an expert, but that would only work if there was a single repo, right?

Edmond · on Nov 20, 2017

Not an expert either :)

There is the concept of submodules which allows for multiple repos while maintaining the checksum mechanics that allows sharing the same bit of information between branches and across commits: https://git-scm.com/book/en/v2/Git-Tools-Submodules

The trick is that git maintains an abstract file system (ie a graph) across commits. The graph consist of pointers to content without having to create a clone of the actual content for every new version of said graph....it gets a little dizzy to explain but it is really not too complicated :)

allenz · on Nov 20, 2017

Not necessarily. To share blobs, GitHub would need to replace the standard git filesystem blob storage with a distributed database of blobs. All Git repos would share this distributed database.

rrix2 · on Nov 20, 2017

This talk[1] from git merge has a lot of information about how they run their backend storage and what silly things can happen as a result of that.

[1] https://www.youtube.com/watch?v=-ZNKR9wFe8o

mcluck · on Nov 20, 2017

This is very interesting. I would have liked to see the results for JavaScript when you ignore the node_modules folder. If that's going to count for code duplication then pip dependencies should be included as well.

This should definitely be taken as a lesson though: JS needs a better deployment solution. That, or better education on the current solution(s).

jlangemeier · on Nov 20, 2017

They need a way better deployment method. Pip dependencies are usually just enumerated in a file (much like the json for NPM), but I think there's fundamental differences between how the Python Foundation and NPM Inc. handle their repositories. And if something isn't a nicely bundled wheel, I can still go out and install it (and any dependencies) the old fashioned way. With some of the dependency chains for various js modules, you're really forced to use a package manager of some sort for anything beyond your basics with minimal dependencies; or you'll be pulling your hair out and looking for that virgin goat to sacrifice.

FWIW, and I know it's not much, I really don't use Node unless I have to (or javascript outside the basics, JQuery & LoDash for that matter); I was turned off from it when I was told to download and install Node via a copy-paste from their website of some short command-line wget script. That's shoddy at best; so the current state of affairs can be linked back to early practices. It's nice that Node has been cleaning up their act, but it's still kinda a crap fest; and now that is the standard that they've provided for their community.

To wrap up this meandering train of thought. The paper actually addresses this nicely, because when 70% of the fluff and cruft in JS repos is node_modules, you end up with hidden dependencies, which is how things like left-pad happen. With pip I know exactly what all of my dependencies are (explicit dependencies); with Node, you're required to dig into the node_modules for every known dependency of that initial dependency (implicit dependencies).

k__ · on Nov 20, 2017

Do people check in their node_modules?!

mcluck · on Nov 20, 2017

Apparently some people do. They really shouldn't. I can only imagine this is in some places ignorance and others out of fear for another left-pad scenario.

k__ · on Nov 20, 2017

Then shouldn't that article account for it?

I would have thought that JS has fewer dupes because of NPM

mcluck · on Nov 20, 2017

That's exactly what I'm saying. The author even states that the node_modules folder makes up 70% of the files in the JS section. Seems like a poor way to measure.

hultner · on Nov 22, 2017

Would love to see a follow up where we would see how much duplication existed if we controlled for common dependencies and autogenerated code in conjunction with data on how many repositories are fully cloned (i.e. all code is near identical to another repository).

az0 · on Nov 20, 2017

Very interesting from a security perspective. So much potentially dangerous code copy-pasted and most of it is probably never updated too. I've personally found some C vulnerabilities in code that I easily found used in many projects by Googling the vulnerable line... Usually not so much to do about it too.

jlarocco · on Nov 20, 2017

Trying to frame this as a security problem is a stretch, IMO.

My impression is that most public projects on GitHub are only of interest to the author, and maybe a small handful of people. I, for example, have over 100 non-forked public repos and, except for 3 or 4 projects, nobody even looks at most of them, much less clones them and uses them. Even the ~4 that do get attention, it's usually not because they're using the code itself - it's because they're doing something similar and want to see how I did it.

On the other hand, I only have anecdotal evidence to back up that claim, so who knows.

Tommakx · on Nov 21, 2017

Would be more interesting to see an analysis of almost equal files - to detect reimplementations of the same thing

inetknght · on Nov 20, 2017

Now predicting automatic software that looks at duplicated code, flags it for violating license agreements, and sues for money.

Welcome to the future of copyright trolls.

ddavis · on Nov 20, 2017

Burden of proof would hopefully save the day here.

inetknght · on Nov 20, 2017

...tell that to existing copyright / piracy trolls

nihonium · on Nov 20, 2017

In order to prevent code duplication on a global scale, we need more frameworks, like leftpad. :sarc:

hinkley · on Nov 20, 2017

You’re being sacastic, but sitting down and looking at what subject areas appear most frequently and talking about why would be useful for any language.

Are they even getting it right or do they all have the same bugs? Are there no existing libraries? Are the downsides worse? Can we fix that? Should this functionality live in the core language (did we miss a feature).