Keeping unrelated projects in a single repo is every bit as bad as keeping all the source code for one project in a single 500k-line C file. CVS modules are a nightmare to work with, and I consider the lack of such abominations a feature of git et al, not a shortcoming.
The author's straw man of a git repo per file is also quite unconvincing.
Git may have some shortcoming, but this is not one of them.
not really.
The problem is the way that projects are dependent on other projects.
It is especially illustrative, if you use a package manager like apt-get. Installing something pulls in tons of other libraries/dependencies.
When you want to build something, you would want to build against their dependencies.. which you would want to tag in some manner.
For e.g. v1.0 of ProjA needs v0.8 of ProjB and v3.2 of ProjC. A nice example is NodeJS that needs The GoogleV8 engine. Take a look at the number of changesets that are simply managing revisions of V8 inside the NodeJS tree (http://github.com/ry/node/commits/master/deps/v8)
In an ideal world, you could have tagged a version of NodeJS source code with a particular version of V8 (which lives in a different repository), all under the same directory structure.
If v8 was available as a git repository, node could have gone the submodules route which works perfectly and is solving the exact problem the parent article is talking about.
To stay with node/v8, you would add v8 as a submodule and check out a specific revision or tag. The parent repository (node) would keep track of that revision and whenever you clone the parent, you'd be able to fetch the submodule and check out the correct revision.
If you want to update v8, you cd into the v8 dep directory and check out the newer tag. The parent repository will see this and offer you to commit that change to the submodule as a commit of the main repository.
We're using this facility internally with some projects, sometimes adding internally developped modules as submodules, sometimes adding public libraries.
Public libraries as submodules is really cool as you can use git's awesome merging features to keep around modifications you made to said libraries even if you update the target library.
I'm having a bit of difficulty at explaining the concept, but once you do it, it's simple and beautiful.
This entire article is about how, at the scale of entire repositories, we are seeing issues that are analogous to the issues CVS saw on individual files. I would even go so far as to say this entire article is about submodules, how git's architecture causes there to be more of them, and how changesets aren't effectively managed between them, given that you have to make separate commits between repositories and then attempt to knit the changes back together at a higher level of scale where they have become mostly unrelated.
> Checking out a Git repository involves downloading not only the entire current revision, but the entire history. So this creates pressure against putting two partially-related projects together in the same repository, especially if one of the projects is huge.
See git clone's "--depth" option. There are limitations to what you can do with a shallow clone, but you do not, in fact, need to pull down an entire history.
Regarding the rest of TFA, the author should look at git submodules, as already mentioned here in the comments. However, submodules should not be used without understanding their implications and limitations (it is worth searching the Git mailing list).
There are at least three alternatives to submodules I'm aware of:
The latter two are not really designed to be used outside of the projects for which they were created, but both are Python code so they can probably be adapted pretty easily.
The problem is not CVS or Git or SVN. The problem is fundamental and actually very widespread. Within a given semantic domain, one program, one database manipulated by one program, one OS process, we can enforce all sorts of useful guarantees, from consistency of the database to guaranteeing interesting and useful properties about the data to strong typing (in the program case) and so on. Once you leave the island, you are back in the untyped swamp of arbitrary code. Git makes progress by enforcing more interesting properties on a larger chunk of stuff than CVS, but nobody can enforce those interesting properties on the entire universe.
When you learn to see this problem, you start to see it everywhere; it's so fundamental and pervasive you can't hardly even see it.
Mercurial's subrepo is a pretty good stab at fixing this, IMO. It's unpolished, right now, in that it doesn't deal with some of the more complex scenarios well, but the idea is really getting there (and it works with SVN subrepos, too -- we have git in the works).
* _The DAG-based systems don't represent changesets that cross repositories. They don't have a type of object for representing a snapshot across repositories. _
* _Creating a tag across repositories would involve visiting every repository to add a tag to it._
These two, at least, are technically accounted for by git submodules, though they are a bit of a pain to use.
I don't know. Personally, I love submodules. Once you "get it", it's really straight forward.
It really shines, if you use them for third party libraries that you are keeping patches for. In the old days, updating said libraries was nearly impossible once you began patching them. With submodules, you can use git's really cool merging capabilities to help you.
Even better: The workflow when updating a library patched by you to a newer upstream version stays the same as if you'd just do a traditional merge in your own codebase. So there's no need to learn even more commands or new workflows - it just works.
Until you have merge conflict. If someone could explain to me which file contains the conflict markers so I can fix them I'd be really happy. Its not the .gitmodules file.
if it's a merge conflict happening while updating the submodule, then the conflict is inside the working copy of that submodule. Go there and resolve it using the usual tools, commit and commit the updated submodule revision in the parent repository
Except that doesnt work, because that just changes the hash of the submodule (as stored by the supermodule) again. The problem is that the supermodule just stores the hash of the submodule. The only way I found of fixing it was to:
1. Create a new repo.
2. Get the branched submodule.
3. Check that in as head: Note, now HEAD will NOT compile - I've committed broken code.
4. Go back to the original repo.
5. Get HEAD: this pulls the hash I need.
6. Do the merge. The submodule doesn merge, because it has the right hash.
Surely there must be some way to just modify the hash. But I don't know where to change it. I found it in several locations, but I don't know how it works, and hacking my VCS isnt my idea of good sense.
the command you are looking for is git diff. git status will show you which files changed, git diff will show you where the conflicts are in the files during the merge.
I am pretty certain in this situation that git diff claims that the submodule directory itself has somehow been modified from the previous commit hash to the new one. Please recognize that the case here is a merge conflict between two people advancing the commit the submodule points to to different revisions of the subproject, not merge conflicts inside of the subproject (which really should not be happening).
> In the DAG-based systems, branching is done at the level of a repository. You cannot branch and merge subdirectories of a repository independently: you cannot create a commit that only partially merges two parent commits.
Please inform me, how can you ever possibly create a commit that is a "partial merge" from two parent commits? At some point, you had to explicitly choose what you wanted to merge or not merge, and you can do exactly that in Git, and I would certainly assume that you could do that in Mercurial and Bazaar too. Or am I missing something?
I don't get it. What prevents him from writing a simple shell/python/whatever script that uses a simple hash-table with tag associations for different repositories for synced checkout via git hooks?
The author's straw man of a git repo per file is also quite unconvincing.
Git may have some shortcoming, but this is not one of them.