Remove LLVM from the repository, and clean up history.
Table of contents
Problem
We have tried both Git submodules and subtrees to manage accessing the LLVM source code as part of Carbon. We have had significant issues with both of these over time.
Git submodules are well supported in general, but make the repository significantly less user-friendly. There are a number of common operations, not least the initial clone of the repository, that are made frustratingly more complex in the presence of a submodule. In some cases (switching branches), it can even cause hard to diagnose errors for users. It also doesn’t directly help with carrying local patches. Because of all of these, we switched to subtrees.
Git subtrees in some ways work much better once set up as they are simply a normal directory for most users. They also make it especially easy to make and carry local patches. However, setting up the subtree and updating the subtree are bug prone and interact very poorly with GitHub’s pull request workflow. The result is that updating LLVM is currently both very difficult and error prone.
Another fundamental problem subtrees expose is that the LLVM repository is huge. It includes large and complex projects that Carbon is unlikely to ever use, and it doesn’t make sense for us to start our repository off with all of that history and space if we can avoid it. Including LLVM causes things like git status
or cloning the repository to be dramatically more expensive without significant benefit.
Proposal
We should switch to using Bazel to download a snapshot of LLVM during the build, and do a destructive history re-write to the Carbon repository to completely remove LLVM from it. All of the building against LLVM will continue to work seamlessly. However, the repository will become tiny and fast to work with, and the snapshot download is significantly cheaper than cloning or working with the full-history of LLVM.
The only way to capture the benefit here is to do a destructive update to the repository’s history. This is unfortunate, but essential to do as soon as possible and before we shift to be public. Despite the cost of making this change, the cost of not making it will grow without bound for the life of the project.
Important: This will require a manual update of some kind for every fork and clone. The steps to update them are provided below.
Details
Implementation
This will be accomplished using the git-filter-repo
tool. This makes it very easy to prune any trees from the history, replaying the history left and resulting in a clean and clear result.
We will remove three other trees while here that are no longer used:
src/jekyll/
src/firebase/
website/
Once we’re taking a history break, we should capture the value we can and have a minimal history of the repository we are actually using. The largest of these is src/jekyll
(988kb even packed), but it seems cleanest to remove the three together.
Patching LLVM
We still need to support patching LLVM. Initially, we will start with a simple approach of applying a collection of patch files from the Carbon repository to the LLVM snapshot. This is especially convenient while Carbon’s repositories are not public. Eventually, we can switch to having a fork of LLVM that we snapshot and developing any needed patches there. Having a more robust long-term solution is expected to be important if larger scale development is needed on Clang when building interoperability support.
Developing patches is also very easy when using Bazel. You can pass --override_repository=llvm-project=<directory>
(docs) to have Bazel use a local checkout of LLVM (with any patches you are testing) rather than downloading it. If doing significant development, this can be added to your user.bazelrc
file to consistently use your development area.
Updating forks and clones
If you have a fork and any clones of the repository with work that you want to save, we suggest the following process:
Archive your existing fork and clones
-
Go to the archived repository: (broken link:
https://github.com/carbon-language/archived-carbon-lang
) -
Fork this archived repository just like you normally would (into your personal space). It is important to fork the
carbon-language
hosted archived repository so you pick up the ACLs. You should not create your own personal repository. -
Update any clones to use these archived remotes to avoid accidentally trying to interact with the new repository. Example commands assuming the remote
origin
points to your fork andupstream
points to the main repository. You may also need to adjust from the SSH-style URLs to HTTPS ones to match:git remote set-url upstream git@github.com:carbon-language/archived-carbon-lang.git git@github.com:carbon-language/carbon-lang.git git remote set-url origin git@github.com:$USER/archived-carbon-lang.git git@github.com:$USER/carbon-lang.git
-
For each clone, mirror everything into the new
origin
(your fork):git push --mirror origin
If you have multiple clones with different but overlapping branches, this may require some extra steps to avoid collisions when doing the mirror push.
At this point, your clones should all point to an archival fork and should be able to function “normally” cleanly. The goal here is just archival and making sure you don’t lose anything.
Create a fresh fork and clone
Delete your existing fork of carbon-lang
(not archived-carbon-lang
):
- Go to your fork’s settings page:
https://github.com/$USER/carbon-lang/settings
. Be certain this is your fork and not thecarbon-language
organization. - Scroll to the bottom, and click
Delete this repository
.
Once deleted, create a fresh fork from the now-filtered main repository: https://github.com/carbon-language/carbon-lang/fork
Clone this fork and you’re ready to go with the new repository structure.
Porting in-progress branches from your archived clone
For any branch in your archive clones that you want to import, there are two approaches.
The simplest way is to extract the branch as a patch file and apply it in the new repository:
# If your old clone is in the `archived-carbon-lang` directory and your new
# clone is in `carbon-lang`, both of them on the `trunk` branch:
cd archived-carbon-lang
git switch $BRANCH
git diff upstream/trunk... > ../$BRANCH.patch
cd ../carbon-lang
git switch -c $BRANCH
patch -p1 < ../$BRANCH.patch
git commit
This will just flatten the branch into a single commit in the new clone.
If you want to preserve the commit history, you can do that using git format-patch
. Some points to remember here:
- Make sure you’ve rebased the branch into a clean series of commits on top of the latest trunk in the archived repository.
- Use
git format-patch
to get a series of patches in a directory. See its help for detailed usage. - Use
git am
to apply the series of patches in your new clone under some branch. Again, see its help for detailed usage.
Updating pending PRs
The pending PRs should end up closed automatically when you delete your fork above. If not, the simplest approach is to close them and just create a new PR. In-flight discussion comment threads will be awkward, but there isn’t much else to do. Rename the proposal document if needed.
We will explore whether we can re-open PRs with a fresh forced push from the fresh fork, but aren’t confident in this working. It also isn’t necessary, and we only have 12 PRs open at the moment.
Review comments may be disrupted
When we edit the repository, some PR comments (especially pending PRs that are updated to follow a freshly created fork) may lose their association with the line of code they were made against, and it is possible we may be unable to find some comments. This will at most apply to inline comments within the code of PRs, but that is the common case for PRs. We expect most of these to still be somewhat visible in the conversation view of the pull request, but they may be hard to find due to no longer being attached to a line.
Rationale
- Community and culture
- This will make it even easier and smoother for folks to clone, build, and even contribute.
- It will make clones significantly cheaper, especially those not building code but just working on documentation.
- Language tools and ecosystem
- Preserving the ability to build with LLVM and develop patches will be important as we expand the set of language tools and ecosystem being developed.
Alternatives considered
Do nothing
We could simply continue using the moderately broken Git subtree approach.
Advantages:
- No need to change our approach.
- No destructive operation on the repository requiring everyone to update.
Disadvantages:
- Every update of LLVM remains difficult, manual, and error prone.
- We’re currently carrying an LLVM patch, and it’s easy to lose (and has been lost) in updates.
- Will continually hit bugs in Git subtree as it seems both brittle and not a priority. We will have to struggle to understand and fix or work around them.
- The repository remains massive, slow, and contains significant LLVM code that we will never use.
- If we ever have users that wish to import the Carbon repository into some other environment, they will have to pay the cost of LLVM or remove it somehow. It may even cause them to have two copies of LLVM.
We think this problem is worth solving.
Don’t rewrite the repository history.
We could fix this without rewriting history. If we choose not to rewrite history now, it should be noted that the cost of rewriting history only grows and so we should expect to never rewrite history.
Advantages:
- No need for manual steps to update forks, clones, in-flight patches, etc.
- Commit hashes remain stable.
- All code review comment information associated with commit hashes will be retained.
Disadvantages:
- We pay the cost of having imported LLVM forever. Even if this cost is incrementally small (some amount of repository space when cloning with history), it is a cost we will never stop paying.
While the immediate costs are high, the unbounded time frame for which we will pay for leaving LLVM in the history means that will eventually be the dominant cost and we should just rewrite history, and the sooner the better.
Go back to submodules
Rather than switching to use Bazel downloaded snapshots, we could go back to using Git submodules. The original motivation to move away from submodules was needing to carry a local patch. Submodules and the proposed direction have the same options there, and our approach is discussed above.
Advantages:
- Somewhat more of a Git-native approach.
- Would avoid needing to invent another approach if we add a non-Bazel build system.
Disadvantages:
- Much of the cloning cost and other Git command costs we see with subtrees would still be present.
- It makes working with the Git repository even more tricky to get right.
- While less esoteric than subtrees, it still exposes a less polished surface of Git that we will have to cope with.
- Most notably, this will re-introduce the stumbling block of users first encountering Carbon not having a seamless experience.
We think we should try something simpler here, even if a bit less of a native-Git solution. It is also especially important for us to optimize the initial new contributor flow, and the Bazel approach is expected to be more seamless.
Rename the repository, and create a new one
Rather than editing the repository in place, this would move it aside and create a new one with the edited history.
Advantages:
- Some disruptive aspects of the in-place history edit would be avoided, specifically parts of PRs that are associated with commit hashes would likely continue working due to the commit hashes being preserved.
Disadvantages:
- For most cases, this would be equally disruptive – forks and clones would have the same update needed as they reference the repository by name.
- We would either need to build a system to carefully re-create the exact issue numbering and PR numbering, which may not even be possible, or we would need to allow all issue numbers and PR numbers to churn.
- The PR numbers churning is especially disruptive as the commit log references them to connect commits to the code review that led to the commit. They are also the basis of the proposal numbers.
- There isn’t likely a way to move the PRs back to the main repository, so browsing the historical code reviews would become problematic.
This seems to largely trade off preservation of some of the comment history within PRs for preserving the location, numbers, and links of all PRs and issues. That doesn’t seem like the right tradeoff.
Manually extract and archive some review comments
We could attempt to extract and archive review comments in case they are lost or made hard to find by the change.
Advantages:
- Defense in depth against any information loss here.
Disadvantages:
- Would be a decent amount of work to get this right.
- May not lose much information here. We think the comments will still be in the conversation view, but maybe hard to find.
- Unclear whether any of these will actually have value, especially extracted and out of context from the review where they were made.
- Serious concerns about arbitrarily scraping and moving comments other folks authored outside of the system (GitHub) that they authored them within.
Overall, while there is some risk here, we don’t think it is too high and the cost of trying to mitigate this seems unreasonable given the relatively small total amount of data at issue here.