I always like to see how some API is used in real projects. Sadly GitHub search is mostly useless for this, because of the number of duplicates. Google code search was great. It even supported regexps. Then the was koders.com, now there's also something from ohloh and it's better than GitHub AFAIR.
EDIT: ohloh became openhub and now the code search is discontinued. So there is the nonfunctional GitHub search and an open niche for other projects...
Disclaimer: I'm the founder of GitSense (https://gitsense.com) that indexes code and Git history.
Indexing and retrieving code at scale is actually a really challenging problem due to the fact that there is a lot of code, on a lot of branches, in a lot forked repos. With GitSense, it doesn't even try to determine the authoritative source (repo/branch), since I personally think this is a lost cause, given current AI technology.
With GitSense, everything is context driven, which is how you can reasonably remove duplication. To search, you have to define what branches/repositories to consider, which can a be a few to a few thousand. Note, once a search context has been defined, it can be reused, so this isn't something you have to create every time, if you want to search.
I sort of envision a Yahoo type (the first incarnation) approach to searching for code. The basic idea is provide a curated search experience, where domain experts can share what they believe to be relevant branches to consider, for a given problem.
Without some human intervention, I think duplication is a given and as you point out, can lead to useless results.
No what sucks is they made a clickbaity article by not removing checked in dependencies right at the beginning.
There are even scenarios where I’ve seen Java projects check in their dependencies (in conservative industries) so this erodes the value of the numbers greatly.
We have a pretty big issue with duplication of effort. I was hoping for an article that would be a wake up call but instead it’s just measurement artifacts.
Copy-and-pasted libraries are a direct cause of duplicated effort.
If you all grab it from the same repo (possibly through some sort of 'repo link' mechanism that keeps a local clone but is clearly marked as a clone, of course) then when a bug is fixed, it's fixed.
If you copy and paste it, suddenly every fix has to be done once per project using the library.
This reminds me to hate AWS Lambda. I am not sure if anything has changed since my last use (around spring of 2017). I wrote my code in Python, and wanted to use psycopg2 to connect to my PG RDS instance. Luck out, yike! I was told to commit some psycopg2 files in my repository. I don't remember whether s3 was a feasible option, I think it wasn't; even then I would hate having a separate place for storing dependencies just because AWS couldn't solve customer problems.
And no, I refuse to believe such requirement is hard to solve at all. I refuse to believe AWS doesn't create some container-like environment behind the scene for launching and running a function.
I'm not sure I could understanding the reasoning behind that. Does it have to do with dependency versions or the assumption npm might not be available or what?
There were/are a few factors. NPM availability is definitely one - before caching, and without the overhead of running your own npm replica. It also didn't used to have things like lock files. Vendoring gives you a deterministic build and removes availability concerns. In that aspect, it's not the worst thing ever, mostly just leads to noisy diffs (and maybe c extension issues if your team works on a variety of OSs?)
This is pretty much what the golang world does (though now there are some tools that do a better job).
Yeah running a registry is the better option, although another solution is forking / committing the dependencies in a separate repository, and use branch + shallow clone to keep the download size small.
What I ended up doing at work is to have a docker image of a registry with package zips committed in the repo (with a small script to publish all packages in). That way I can rebuild a running registry in a few moments.
Seriously. I've been on a deployment that failed because China blocked access to Github for a little while over some piece of code in some repository. Make sure your dependencies are available.
What's wrong is that the package manager and repo manager aren't talking to one another. "Keeping a copy of the dependency with your code" and "copy and pasting the code" should not be the same thing.
I agree, ideally artifacts should be stored and cached, e.g. in a stored container or as a tarball.
Other than a local npm mirror, what do you think is the best way to store artifacts for new builds? Vendoring code is not a bad solution if it permits new builds to occur even in the event of a package manager outage. I see vendoring code as the simplest solution for small companies who don't want the management overhead of running a local mirror of package managers.
That said, I'm very much open to learning a new technique for handling this problem.
Wow, GitHub could save a lot of storage space if they dedup'd across projects/files explicitly, rather than storing Git repos, which is what I'm assuming they do.
Even with a good deduping/compressing filesystem, the way git history is stored means that they're probably missing out on a ton of savings here. Eh, it's probably not worth the complexity/deviation from standard Git tooling.
GitHub only shares objects among forks. Source: I used to work on GitHub's Git Infrastructure team, but this is publicly available information.
You can read about their architecture in their discussions of Spokes, which replicates repository networks (the original repository and its forks) across data centers. eg: https://githubengineering.com/stretching-spokes/
Trying to put all of GitHub's object files in a single packfile - even just putting them on a single server - would be impossible.
But even on hosting providers with a bespoke implementation - that do not use core git to manage Git repositories - this would be challenging. We have a custom Git server implementation in Visual Studio Team Services, but it still makes sense to shard object storage with the repository: you have to worry about scalability and performance, but also things like data sovereignty. We can't just put a user's git repository in some global SQL Azure database that contains all the repositories in Visual Studio Team Services: the repositories need to be geographically located with the VSTS account they created.
I vaguely remember that somebody from github claimed in HN comment that all objects are shared.
Puting whole github in single packfile is obviously impractical, but having whole github on some bespoke Venti/IPFS-style content addressable object store is not.
It's not impossible but it too is impractical for performance reasons. Providers that run git core repack repositories regularly because packfiles are efficient; loose objects are horribly inefficient - even on the local filesystem. Moving them to a network filesystem is impractical.
That's really neat. Is there any documentation/discussion available on that technology? It sounds like something that would be fascinating to learn about.
There is the concept of submodules which allows for multiple repos while maintaining the checksum mechanics that allows sharing the same bit of information between branches and across commits: https://git-scm.com/book/en/v2/Git-Tools-Submodules
The trick is that git maintains an abstract file system (ie a graph) across commits. The graph consist of pointers to content without having to create a clone of the actual content for every new version of said graph....it gets a little dizzy to explain but it is really not too complicated :)
Not necessarily. To share blobs, GitHub would need to replace the standard git filesystem blob storage with a distributed database of blobs. All Git repos would share this distributed database.
This is very interesting. I would have liked to see the results for JavaScript when you ignore the node_modules folder. If that's going to count for code duplication then pip dependencies should be included as well.
This should definitely be taken as a lesson though: JS needs a better deployment solution. That, or better education on the current solution(s).
They need a way better deployment method. Pip dependencies are usually just enumerated in a file (much like the json for NPM), but I think there's fundamental differences between how the Python Foundation and NPM Inc. handle their repositories. And if something isn't a nicely bundled wheel, I can still go out and install it (and any dependencies) the old fashioned way. With some of the dependency chains for various js modules, you're really forced to use a package manager of some sort for anything beyond your basics with minimal dependencies; or you'll be pulling your hair out and looking for that virgin goat to sacrifice.
FWIW, and I know it's not much, I really don't use Node unless I have to (or javascript outside the basics, JQuery & LoDash for that matter); I was turned off from it when I was told to download and install Node via a copy-paste from their website of some short command-line wget script. That's shoddy at best; so the current state of affairs can be linked back to early practices. It's nice that Node has been cleaning up their act, but it's still kinda a crap fest; and now that is the standard that they've provided for their community.
To wrap up this meandering train of thought. The paper actually addresses this nicely, because when 70% of the fluff and cruft in JS repos is node_modules, you end up with hidden dependencies, which is how things like left-pad happen. With pip I know exactly what all of my dependencies are (explicit dependencies); with Node, you're required to dig into the node_modules for every known dependency of that initial dependency (implicit dependencies).
Apparently some people do. They really shouldn't. I can only imagine this is in some places ignorance and others out of fear for another left-pad scenario.
That's exactly what I'm saying. The author even states that the node_modules folder makes up 70% of the files in the JS section. Seems like a poor way to measure.
Would love to see a follow up where we would see how much duplication existed if we controlled for common dependencies and autogenerated code in conjunction with data on how many repositories are fully cloned (i.e. all code is near identical to another repository).
Very interesting from a security perspective. So much potentially dangerous code copy-pasted and most of it is probably never updated too. I've personally found some C vulnerabilities in code that I easily found used in many projects by Googling the vulnerable line... Usually not so much to do about it too.
Trying to frame this as a security problem is a stretch, IMO.
My impression is that most public projects on GitHub are only of interest to the author, and maybe a small handful of people. I, for example, have over 100 non-forked public repos and, except for 3 or 4 projects, nobody even looks at most of them, much less clones them and uses them. Even the ~4 that do get attention, it's usually not because they're using the code itself - it's because they're doing something similar and want to see how I did it.
On the other hand, I only have anecdotal evidence to back up that claim, so who knows.
You’re being sacastic, but sitting down and looking at what subject areas appear most frequently and talking about why would be useful for any language.
Are they even getting it right or do they all have the same bugs? Are there no existing libraries? Are the downsides worse? Can we fix that? Should this functionality live in the core language (did we miss a feature).
EDIT: ohloh became openhub and now the code search is discontinued. So there is the nonfunctional GitHub search and an open niche for other projects...