RFC: Best practices to improve Git logs #1

admbtlr · 2020-08-24T11:51:05Z

Proposal

The git log can be a clear and powerful record of the work that has been undertaken within a repo, and as such can be a vital resource. Or it can be a spaghetti junction box of noise and cruft that requires some serious sed fu to yield anything useful. I prefer the former, and propose some simple best practices to make it happen, while also allowing for exceptions in cases where developers make extra effort to avoid crufty spaghetti. These best practices are:

A PR does one and only one thing
A PR's title should use the semantic commit message format
If you want to keep the individual commits in your PR, you must specify this in the PR description so that it can be taken into account in the code review
If you want to keep the individual commits in your PR, they should also use the semantic commit message format
It doesn't really matter whether you assign yourself or someone else to the PR. What matters is that the assignee is the one and only person who is responsible for merging it.
Unless it's your PR and it has been approved by someone who knows that you intend to do otherwise, squash the commits

Reasoning

Here's a recent chunk of the output of git log in force, (please note that I'm not saying this is anyone's fault! I just want to point out that this is how it looks if you don't prioritise keeping it tidy):

| * | | 4490a1d40 Complete desktop header notifications
| * | |   2b9f24686 Merge branch 'master' into desktop-header-notif
| |\ \ \
| * | | | 764897b93 Use new logged in actions
| * | | |   bd7e8b0b6 Merge branch 'master' into desktop-header-notif
| |\ \ \ \
| * | | | | 9ea1d0951 paired with xxx, created new component to hold logged in items
* | | | | |   00e44f18a Merge pull request #xxx from xxx/close-expanded-input
|\ \ \ \ \ \
| |_|_|_|_|/
|/| | | | |
| * | | | | 54aa4fab7 added ts-ignore
| * | | | | e5f6217c5 text area autoresizes on message send
* | | | | |   6eb5bb8de Merge pull request #xxx from xxx/inbox-text-formatting

By my reckoning there are about three lines in that log that tell me what's been happening in the force repo; the rest of it is noise and spaghetti. As a comparison, here's the git log on Aether, an open source project for data curation from the last place I worked, where we put a lot of thought into our git workflows:

* aad07e79 (HEAD -> develop, origin/develop, origin/HEAD) chore: upgrade dependencies (#881)
* 3516d97b fix(extractor): generate unique ids per mapping (#879)
* 86314e16 docs: python 3.8 [ci-skip]
* 83650a84 chore: upgrade dependencies (#878)
* d9bbb746 fix(producer): remove "callable" warnings (#876)
* 2f47fe41 chore: upgrade dependencies (#874)
* bbc417f3 fix: empty realm (#871)
* dda3bed6 fix: the presence of artifacts without realms were causing an issue when connecting via database (#870)
* 79c4b564 fix: limit producer concurrency (#867)
* eefed57c refactor: move producer to aether.producer (#865)
* 50c08b42 docs: typo (#866)
* f457fd2c test: set dynamic priority to task execution (#862)
* 0e873fe4 chore: LOGGING_LEVEL and named volumes in docker (#861)

To my mind, this is orders of magnitude more useful: I can quickly and easily see what's been happening in the repo. Generating release notes becomes trivial – in fact the git log is itself an ongoing, real-time set of change notes.

A specific example of the generalised motivation for this RFC is something that happened to me a few weeks ago: I had just finished a new feature on Force, and I was doing a final merge of master into my working branch before creating a PR – at which point I discovered that some new commits had completely refactored a lot of the lines of code that I had just worked on. In order to do a sensible merge, I had to try and figure out the motivation for the upstream changes so that I could work out how many adjustments to make to my code, and how many to make to the incoming code. I always have git blame displayed for the current line, so it's trivial to see info about the last commit that touched each line. But it turned out that the title of the commit in question was... Merge master. To find out the purpose of the PR of which this Merge master commit was a constituent part required me to untangle some git log spaghetti, and then refer to the closed pull requests in Github.

Imagine how much easier the process would have been if the commit message was a description of the actual PR, informing me that it was, say, a refactor and not a feature; knowing this fact would have made it easier for me to handcraft a reasonable merge. Now try to imagine a situation in which it would be useful to know that the author of that PR merged master into their branch at that point in time. I can't think of one. Of course that developer was right to record that particular action in their branch's commit log – but in the context of the main branch, it was just unnecessary obfuscation.

This crufty spaghetti has been a constant problem in every place I've worked at since git became the dominant version control system, and the only way that I have seen the situation improve is by introducing the best practices I'm outlining here. Git is one of the most powerful and graceful pieces of software ever created, but it isn't automatically perfect for every use case; in order to really leverage that power and grace, you need to introduce some consistent practices.

A little more detail on some of the best practices listed above:

A PR should only do one thing. This is, I hope, non-controversial. A single PR should represent a single feature, bugfix, refactor, config change, etc. Now admittedly, if I'm working on a feature and I find a bug that needs fixing, I find it hard to resist the temptation to just fix it and commit it on the feature branch I'm working on. This is easier for me, but liable to cause problems for the codebase, the git log and everyone who interactts with either of them. The right thing to do is to reverse out of where I am, fix the bug in a separate branch and open a PR, and then come back to the feature I was working on. If I get lazy, or just forget, then I would hope that a code reviewer picks me up on it and asks me to cherry pick the offending commit(s) to a separate branch and open a separate PR.
PR titles should use the semantic commit message format. These kind of conventions ultimately do double duty: not only does the reader of the title need to think less in order to understand it, but the author needs to think less when writing it too. Luckily we already have a convention that is available and in widespread usage: semantic commit messages (SCM). The important part here is that we only enforce SCM formatting on entries that will end up in the git log, so in the case of a developer who is not artisanally crafting their individual commits, this means that it is the PR title that uses SCM. This kind of standardisation also makes it easy to do all kinds of cool automation, e.g. a fastlane script that is triggered by a commit to the main branch can decide whether or not to build a new beta release, depending on what kind of commit it was. Semantic commit messages also make semantic versioning really easy.
When merging a PR, (almost) always squash the commits. This has been discussed at great length in an earlier RFC. The many great comments in that PR have informed the Exceptions section below. In general though, Squash and merge has two advantages: it gets rid of those Merge pull request commit messages, and it removes the spaghetti: it hides the how behind the what.

Philosophical aside: PRs as Git Molecules

Note that this represents a shift in git usage away from being commit-focused towards being PR-focused. I like to think of this as being "molecular" as opposed to "atomic". You hear about the idea of "atomic commits" in git usage, which generally means that you should commit often, in small chunks of work. This is good practice for many reasons, not least of which because it allows you to undo, interleave, and generally time travel within the work that you're doing - i.e. workflow considerations. But workflow considerations are only important when you're working. Once the work is done, there is usually no real need to keep these atoms around. It's more useful to group them into a molecule.

Atoms are interesting for understanding how stuff works, but if you want to see what is really happening, you're much better off studying things at a molecular level. Once the work is done, reviewed and merged, I am almost always only interested in the "what".

Exceptions

This shift from "how" to "what" is of course not uncontroversial. I have talked to various people who maintain that git logs can provide huge value by allowing others to see how someone went about implementing a particular piece of functionality (or fixing a gnarly bug). This is without doubt a valuable approach that transforms the log into a powerful learning tool. My contention here would be that although this is hugely useful in the right context (what I call an "exemplary implementation"; a great example would be Roop's recent NextJs experiment), you can only ever be working in such a context some of the time. On the other hand, at Artsy we are working in a product-focused context 100% of the time. When the common use case is so much more common than other, rarer (nonetheless valuable!) use cases, it makes sense to instil a set of practices that are aimed at making that common use case smoother, easier and more effective.

In order to address the question of exemplary implementations that can be used for educational purposes – or that are so complex as to merit more detailed log entries – I would propose one of the following two solutions:

The closed PR lives on the repo (and is linked to somewhere), its branch undeleted, with its original commit log intact so that others can study the code within that context and learn about the implementation details
For developers who find these practices too restrictive and prefer to tidy up a PR's commit log using interactive rebasing, it should be acceptable to merge or rebase (depending on how valuable it would be to group the commits) these commits onto the main branch.

If the second solution is preferred, I would propose that the author of the PR is required to:

Mention this in the PR description, so that reviewers know that they should also be reviewing the structure and descriptions of the commits
Assign themselves, so that they are the ones who will ultimately merge the approved PR. The upshot of this is the simple rule that you always squash commits, unless it's your own PR and you have specified otherwise.
Include the PR number in the titles of each of the commits (in the same format that is used automatically when squashing commits), so that it's easy to see that they're grouped together, and it's easy to dig out the context of the PR.

Additional context

There have been several RFCs on this and adjacent topics:

Defining PR roles
Make each commit a working version and have linear history (which has a lot of overlap – but is not coterminous – with this RFC)
Standardize "who merges the PR"

I can see from the many great comments in Eloy's RFC in particular that this is a controversial topic. But at the same time I feel like – as a comparative outsider – the current state of git logs in most of the repos that I've worked in since I joined is suboptimal and introduces friction in day-to-day work. I'd therefore like to try and reach some consensus on a coherent approach to improve this situation, even if it's not necessarily the one I've outlined here.

How is this RFC resolved?

By agreeing on a set of best practices and then socialising them. This would mean sharing the documented best practices, possibly a lunch and learn session. Based on previous experience, I would favour just randomly dropping in on open PRs to see whether people are adhering to the best practices, and nudging them if not. Hopefully this could be done in such a way as to not be too surveillance-y.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Best practices to improve Git logs #1

RFC: Best practices to improve Git logs #1

admbtlr commented Aug 24, 2020

RFC: Best practices to improve Git logs #1

RFC: Best practices to improve Git logs #1

Comments

admbtlr commented Aug 24, 2020

Proposal

Reasoning

Philosophical aside: PRs as Git Molecules

Exceptions

Additional context

How is this RFC resolved?