BuildKit in depth: Docker's build engine explained

inetknght · 2024-02-06T19:59:56 1707249596

Buildx also has a completely different build output and it's completely useless for anything. Use `--output=plain`.

Also, buildx won't commit the intermediate layers during the build. So if something fails, you can't just grab the previous intermediate layer and do `docker run --entrypoint /bin/bash` on that layer to poke around.

I've had cases where BuildKit will get stuck and `docker buildx prune -a` will leave things around. The way to fix it is with `systemctl restart containerd.io`. This seems to happen a lot when I interrupt the build (eg, I know it will fail and need to restart it with a fix), but it's also happened on successful builds.

All of that is on top of docker itself having extremely poor error messages.

I really do not like docker or buildx.

cpuguy83 · 2024-02-06T20:35:55 1707251755

"BUILDX_EXPERIMENTAL=1 docker buildx build <args> --invoke /bin/sh"

There's actually a whole debugging client available.

radus · 2024-02-07T05:04:55 1707282295

This is great! I’ve been waiting a long time for this and it seems like more debug features are in the pipeline too: https://github.com/docker/buildx/issues/1104

Docs: https://docs.docker.com/engine/reference/commandline/buildx_...

adamgordonbell · 2024-02-06T20:12:04 1707250324

If you build your dockerfile with earthly, we have a -i flag that will pop you into a shell on a failure, and we are using buildkit, so it's totally possible.

AkihiroSuda · 2024-02-07T06:04:17 1707285857

> Also, buildx won't commit the intermediate layers during the build. So if something fails, you can't just grab the previous intermediate layer and do `docker run --entrypoint /bin/bash` on that layer to poke around.

`buildg debug` (Dockerfile debugger based on BuildKit) to rescue: https://github.com/ktock/buildg

mdaniel · 2024-02-06T21:32:08 1707255128

and its "oh, you want multi-arch, do you?" friend. While prosecuting this <https://gitlab.com/gitlab-org/gitlab/-/issues/339567> I learned that they bundle binfmt into the buildkit docker image <https://github.com/moby/buildkit/blob/v0.12.5/Dockerfile#L11...> which itself is compiled against a static copy of qemu <https://github.com/tonistiigi/binfmt/blob/deploy/v6.2.0-26/D...> with various /proc/sys/fs/binfmt_misc wizardry <https://github.com/tonistiigi/binfmt/blob/deploy/v6.2.0-26/c...>

We ended up switching to using multiarch/qemu-user-static:7.2.0-1 <https://hub.docker.com/layers/multiarch/qemu-user-static/7.2...> which also mutates binfmt_misc in buildx's context in order to exec the static copy of qemu in it but I find their shell script much more legible about what's going on: <https://github.com/multiarch/qemu-user-static/blob/v7.2.0-1/...> not to mention the fact that it more clearly represents the recent qemu version versus secreting it away inside the moby/buildkit:heheh-good-luck-friend

They have an entire GH label for that: https://github.com/docker/buildx/labels/area%2Fqemu and my journey into the buildx sewers started from https://github.com/docker/buildx/issues/1170

zamalek · 2024-02-07T05:28:58 1707283738

I see you're trying to do multi-arch builds. At reasonable (on bare metal) speeds you say?

Delete the entire docker stack. It's utter shite. Then install Podman and Buildah, and rejoice! For all the evil buildx magic can now be done explicitly.

This solution was born when trying to do native builds on CircleCI - where you do not have a buildx cluster with multiple architectures. You have to build the archs as independent build steps, then pull them together afterwards. This is something that Docker never considered, and Docker only functions on the happy path. Buildah has multiarch manipulation commands: https://danmanners.com/posts/2022-01-buildah-multi-arch/ (not a plug, first blog on search)

We migrated to GHA, but still use separate builders to avoid the Docker trite. We also ditched DCT (Docker signing) for Cosign because - again - DCT is home-grown garbage.

4oo4 · 2024-02-06T20:24:44 1707251084

Well said, these are the reasons I use podman/buildah now.

EdwardDiego · 2024-02-06T21:54:00 1707256440

Was it a drop in replacement or did you need to give things a jiggle to get it going?

abound · 2024-02-07T00:49:57 1707266997

Similar to sibling comment, mostly just works out of the box. Things I've noticed that are different:

- There's no docker.sock API socket by default, which makes sense because Podman doesn't have Docker's daemon-based architecture. You can run one with `podman system service [...]` if you have non-CLI clients expecting to connect to some $DOCKER_HOST

- Mounts/volumes behave subtly differently IIRC, I had a bunch of scripts that didn't work with podman-as-docker out of the box, I think the issue was that Docker will create some directories automatically that Podman won't?

Aside from that, rootless Podman is great and I've had far fewer issues with it than I did for rootless Docker, only real caveat is that -p <some port> doesn't actually connect to the host, it's still namespaced. You have to use --net=host for actual host networking.

nicce · 2024-02-07T01:02:18 1707267738

Only reason I have used buildx is the easy multi-arch for the same tag (big manifest). How to do this with Podman/buildah?

The mount/volume issues are probably caused by SELinux / other security decisions

floating-io · 2024-02-07T01:15:01 1707268501

I recently took the plunge in a couple of my CI pipelines and had little trouble. The only thing that actually broke was that the `--cache-from` option doesn't accept a tag, where it does for docker. Everything else pretty much Just Worked.

That said, my pipelines are fairly simple, as are my Dockerfiles, so...

It shaved significant time off my pipelines, though, not having to wait for a docker-in-docker service to spin up (I use Gitlab with Kubernetes runners).

Narushia · 2024-02-06T22:42:04 1707259324

I've had a good experience with Podman as a Docker alternative. The Podman CLI is designed to be API-compatible with Docker's CLI, so most things should just work. The only thing I've had to change with my Dockerfiles (or Containerfiles) is the `SHELL` instruction, since Podman by default builds OCI images, and that instruction is not supported by the OCI standard. But you can still specify the shell individually for the `RUN` instructions, or even just use `--format=docker` if you don't want OCI images.

throwaway894345 · 2024-02-07T03:32:15 1707276735

Honestly when I dug into the podman/buildah code it was just differently cursed. It was reallly disappointing.

pimterry · 2024-02-07T08:35:59 1707294959

I've spent a bunch of time playing around with Docker's API directly. BuildKit is notable because its API is completely undocumented AFAICT - certainly the API listed under https://docs.docker.com/engine/api/v1.44/#tag/Image/operatio... is just the API for a normal non-BuildKit build. If you want a BuildKit build, you have to use the Docker CLI.

I did some digging and wiresharking, and I'm pretty sure it's undocumented because the API is _insane_. It starts as HTTP/1 from the Docker client to the Docker engine, but a key BuildKit feature is that the engine pulls files & data from the client on-demand, which is hard in a normal REST API, so how does it do that? By renegotiating the connection to flip the direction after the client connects.

That means: the client sends an HTTP/1 request, the server offers to upgrade to HTTP/2 by in reverse, and then the server becomes an HTTP client and the client becomes an HTTP server, still on the same existing connection. All actual communication then happens as gRPC, but backwards.

Absolute madness, and very difficult to document or support in 3rd party SDKs (and so they haven't) but it's very clever. Some more context here: https://twitter.com/i/web/status/1423353288129396740

kylegalbraith · 2024-02-07T08:44:45 1707295485

This is one of the things that led us to write this post! It's almost all undocumented, as you're mentioning, and there are quite a lot of complexities underneath the surface that often feel very inefficient as well.

We're working on another project in this realm that might interest you. Happy to send over more details via email if your interested in better build APIs. My contact info is in my profile.

codethief · 2024-02-07T02:34:36 1707273276

Does anyone have a good reference on how exactly Docker/BuildKit/Podman/… build Dockerfiles or, more generally, run containers using the underlying OCI container runtime (runc/crun/…)? I think I have a rough understanding but I'm particularly interested in how exactly the rootless setup works, the details of layer caching (beyond "it uses overlayfs"), etc.

kylegalbraith · 2024-02-07T05:38:45 1707284325

Not to shill another blog post of ours in here, but maybe this is helpful.

It's not exactly what you're asking for as it relates to running containers via runc. But this walks through how OCI layers are actually built up behind the scenes. In case it's helpful: https://depot.dev/blog/building-container-layers-from-scratc...

codethief · 2024-02-11T23:10:02 1707693002

Thanks, I had come across that post already but I find the structure of OCI images to be the easy part to understand. The tougher part is to get runc to run everything without root. :)

Anyway, fortunately I seem to have found a solution for now (running runc with an overlay rootfs without root), see the link in the other sibling/nephew comment I posted.

ttymck · 2024-02-07T02:39:19 1707273559

I got familiar with container intervals via, basically, "building images without docker"

One example: https://ravichaganti.com/blog/2022-11-28-building-container-...

codethief · 2024-02-07T05:00:48 1707282048

Thanks for the link!

> I got familiar with container intervals via, basically, "building images without docker"

That's been my approach, as well. However, right now I'm stuck at getting overlayfs to work without privileges (easy: use `unshare`), while not breaking rootless runc (apparently not so easy).

codethief · 2024-02-07T14:00:50 1707314450

EDIT: I solved my issue. Looks like I don't need to `unshare` by hand but runc actually does this already:

https://github.com/psanford/runc-examples/blob/master/rootle...

c0wb0yc0d3r · 2024-02-07T04:39:20 1707280760

Did you ever try running any containers without pushing to a registry? That is the part I can't wrap my head around yet? How do I get the run time to start the container locally?

ttymck · 2024-02-07T06:06:44 1707286004

You can build an image with docker, and run it locally.

You can also use docker save to get a tarball and ship that file to another machine, which can be run through docker load and then run as if it was built locally.

If you have an oci bundle, you might look at runc instead: https://github.com/opencontainers/runc

codethief · 2024-02-07T05:01:46 1707282106

I'm not sure I'm following. What part of running a container would entail pushing anything to a registry?

nicornk · 2024-02-06T21:19:45 1707254385

So how did you „further optimized [buildkit] to build Docker images up to 40x faster“

kylegalbraith · 2024-02-07T05:31:18 1707283878

Hey, thanks for the question! Depot co-founder here.

We've optimized BuildKit for remote container builds with Depot. So we've added things like a different `--load` for pulling the image back that is optimized to only pull the layers back that have actually changed between the build and what is on the client. We've also done things like automatically supporting eStargz, adding the ability to `--push` and `--load` at the same time, and the ability to push to multiple registries in parallel.

We've removed saving/loading layer cache over the network. Instead, the BuildKit builder is ephemeral, and we orchestrate the cache across builds by persisting the layer cache to Ceph and reattaching it on the next build.

The largest speedup with Depot is that we build on native CPUs so that you can avoid emulation. We run native Intel and Arm builders with 16 CPUs and 32GB of memory inside of AWS. We also have the ability to run these builders in your own cloud account with a self-hosted data plane.

So the bulk of the speed comes from persisting layer cache across builds with Ceph and native CPUs. The optimized portions of BuildKit really help post-build currently. That said, we are working on some things in the middle of the build related to the DAG structure of BuildKit that will also optimize up in front of the build.

mdaniel · 2024-02-07T16:32:44 1707323564

> with Ceph

Seeing that reminded me of some healthy discussion in https://news.ycombinator.com/item?id=39235593 (SeaweedFS fast distributed storage system for blobs, objects, files and datalake) that may interest you. control-f for "Ceph" to see why it caught my eye

maxmcd · 2024-02-06T22:35:38 1707258938

hopefully depot will reply, but from my perspective it is mostly laid out on their homepage. they are comparing against builds in other CI products that use network-backed disks, virtualized hardware, and don’t keep a layer cache around. Depot provides fast hardware and disks and is good at making the layer cache available for subsequent builds.

You could likely get very similar performance by provisioning a single host with good hardware and simply leverage the on-host cache.

adamgordonbell · 2024-02-06T20:20:50 1707250850

I did not know about the CargoWharf frontend!

Writing your own front end to BuildKit can be pretty simple. At my work, this is sort of a starter task for everyone who joins the team.

My frontend was based on intercal and I don't recommend anyone use it but it was fun to play around with[1].

[1]: https://github.com/adamgordonbell/compiling-containers/tree/...

notso411 · 2024-02-06T21:07:14 1707253634

Really? Just busy work? Senior engineers incapable of thinking of good onboarding tasks?

adamgordonbell · 2024-02-06T21:14:48 1707254088

Nope, i mean i can see how youd think that, but familarity with BuildKit is just part of the job.

We are building on top of it, and sometimes adding to it.

codethief · 2024-02-07T02:37:56 1707273476

Hi Adam! If I may ask, why did you go with BuildKit (instead of, say, just calling runc on your own) and what has your experience been like?

adamgordonbell · 2024-02-07T15:43:10 1707320590

We use a lot of what buildkit does.

For instance, building up the DAG of all the build steps and then scheduling things so that various parts can be built in parallel. Buildkit does a lot that we use beyond just being a way do build things inside a runc container.

That said, we have a fork of buildkit for the various things we add that don't fit well in the upstream.

Already our auto-skip feature and our branching are implemented on top of Buildkit rather than using it. Probably as we grow and add more build centric features we will continue to diverge.

I'm just a DevRel person though, so that's just my 2 cents. The core team may disagree with me.

pdimitar · 2024-02-09T13:35:02 1707485702

> building up the DAG of all the build steps and then scheduling things so that various parts can be built in parallel

This is one of my hobbies as a programmer and I'm extremely interested in this area. Can you share more or is it a trade secret?

Furthermore, do you have links to tools or papers that deal with this?

adamgordonbell · 2024-02-15T17:48:49 1708019329

Just saw this now.

I don't have any great references to be honest. It's not that its a trade-secret, its just that its an area I don't work on.

The AWK book describes how to build you own Make using AWK. You might like that. And an explanation of it and python implementation is found here:

https://benhoyt.com/writings/awk-make/

Hopefully that helps.

pdimitar · 2024-02-15T17:53:45 1708019625

Should be great, thank you.

I am considering writing a small library / framework to optimize scheduled tasks by making a DAG and inferring which ones are safe to be parallel and which aren't. So this can be useful.

code_biologist · 2024-02-07T00:07:37 1707264457

It seems sane. He works at Earthly, a containerized build platform: https://earthly.dev/

verdverm · 2024-02-07T07:40:45 1707291645

Dagger (https://dagger.io) is a great way to use BuildKit through language SDKs. It's such a better paradigm than Dockerfiles, I cannot imagine going back.

Dagger is by the same folks that brought us Docker. This is their fresh take on solving the problem of container building and much more. BuildKit can more than build images and Dagger unlocks it for you.

kylegalbraith · 2024-02-07T08:47:24 1707295644

Dagger is quite cool, and we've been doing some collaboration with their new Zenith stuff. We actually have a module where you can route an image build in your Dagger pipeline to Depot and do interesting things with the result as we stage your image in an ephemeral registry [0].

[0] https://github.com/depot/daggerverse/tree/main/depot

DavyJone · 2024-02-07T08:34:15 1707294855

I like the idea of Dagger and have been following for a while, but it seems like its tailored to get people to use their Cloud product. eg. Caching.

Is that not the case?

verdverm · 2024-02-07T16:06:08 1707321968

Caching is definitely their product and one of the main benefits.

1. You can get the developer UX improvements without shared caching

2. You can run it in k8s and get shared caches there. I believe it should support any BuildKit caching solutions, and they will help you figure that out in Discord

potamic · 2024-02-07T15:10:23 1707318623

Staged docker builds seem like a bad idea. Why would you want to tie your build specification with your runtime specification which are inherently different concerns? You lose out on independent evolution and isolation between the two. And for organizations, it forces a structure where a single team needs to manage both.

What problems does it solve that cannot be easily achieved with separate build and runtime files?

basicallybones · 2024-02-07T15:33:20 1707320000

I use both, so here is what I'm doing and why I find value in staged Docker builds.

I am running a (relatively new) NX monorepo, which deploys ~6 microservices to various K8s environments using Skaffold. All services are Typescript, Node 20.11, with pnpm as a package manager.

I have separate build files for certain common base images, which mostly are Node-($v)-alpine with a few CI/common dependencies installed. I use Skaffold to compose the image builds.

When I Dockerize each service, I need to install dependencies for that specific service only, and in that case the install/build specification must be the same as the runtime specification. Rather than creating separate build image files and composing them with Skaffold, I simply use a build stage to install the microservice dependencies (using one of the base images with NX and PNPM), then copy them to the final image (which does not have those deps). This is a pretty common pattern, and I like working with it. The advantages to build stages are readability and simplicity (no extra orchestration needed), and the whole setup makes it very difficult to make mistakes.