Michael Stapelbergs Website: posts tagged distri

a new distri linux (fast package management) release

2020-05-16T09:13:00+02:00

I just released a new version of distri.

The focus of this release lies on:

a better developer experience, allowing users to debug any installed package without extra setup steps
performance improvements in all areas (starting programs, building distri packages, generating distri images)
better tooling for keeping track of upstream versions

See the release notes for more details.

The distri research linux distribution project was started in 2019 to research whether a few architectural changes could enable drastically faster package management.

While the package managers in common Linux distributions (e.g. apt, dnf, …) top out at data rates of only a few MB/s, distri effortlessly saturates 1 Gbit, 10 Gbit and even 40 Gbit connections, resulting in fast installation and update speeds.

Hermetic packages (in distri)

2020-05-09T18:48:00+02:00

In distri, packages (e.g. emacs) are hermetic. By hermetic, I mean that the dependencies a package uses (e.g. libusb) don’t change, even when newer versions are installed.

For example, if package libusb-amd64-1.0.22-7 is available at build time, the package will always use that same version, even after the newer libusb-amd64-1.0.23-8 will be installed into the package store.

Another way of saying the same thing is: packages in distri are always co-installable.

This makes the package store more robust: additions to it will not break the system. On a technical level, the package store is implemented as a directory containing distri SquashFS images and metadata files, into which packages are installed in an atomic way.

Out of scope: plugins are not hermetic by design

One exception where hermeticity is not desired are plugin mechanisms: optionally loading out-of-tree code at runtime obviously is not hermetic.

As an example, consider glibc’s Name Service Switch (NSS) mechanism. Page 29.4.1 Adding another Service to NSS describes how glibc searches $prefix/lib for shared libraries at runtime.

Debian ships about a dozen NSS libraries for a variety of purposes, and enterprise setups might add their own into the mix.

systemd (as of v245) accounts for 4 NSS libraries, e.g. nss-systemd for user/group name resolution for users allocated through systemd’s DynamicUser= option.

Having packages be as hermetic as possible remains a worthwhile goal despite any exceptions: I will gladly use a 99% hermetic system over a 0% hermetic system any day.

Side note: Xorg’s driver model (which can be characterized as a plugin mechanism) does not fall under this category because of its tight API/ABI coupling! For this case, where drivers are only guaranteed to work with precisely the Xorg version for which they were compiled, distri uses per-package exchange directories.

Implementation of hermetic packages in distri

On a technical level, the requirement is: all paths used by the program must always result in the same contents. This is implemented in distri via the read-only package store mounted at /ro, e.g. files underneath /ro/emacs-amd64-26.3-15 never change.

To change all paths used by a program, in practice, three strategies cover most paths:

ELF interpreter and dynamic libraries

Programs on Linux use the ELF file format, which contains two kinds of references:

First, the ELF interpreter (PT_INTERP segment), which is used to start the program. For dynamically linked programs on 64-bit systems, this is typically ld.so(8).

Many distributions use system-global paths such as /lib64/ld-linux-x86-64.so.2, but distri compiles programs with -Wl,--dynamic-linker=/ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2 so that the full path ends up in the binary.

The ELF interpreter is shown by file(1), but you can also use readelf -a $BINARY | grep 'program interpreter' to display it.

And secondly, the rpath, a run-time search path for dynamic libraries. Instead of storing full references to all dynamic libraries, we set the rpath so that ld.so(8) will find the correct dynamic libraries.

Originally, we used to just set a long rpath, containing one entry for each dynamic library dependency. However, we have since switched to using a single lib subdirectory per package as its rpath, and placing symlinks with full path references into that lib directory, e.g. using -Wl,-rpath=/ro/grep-amd64-3.4-4/lib. This is better for performance, as ld.so uses a per-directory cache.

Note that program load times are significantly influenced by how quickly you can locate the dynamic libraries. distri uses a FUSE file system to load programs from, so getting proper -ENOENT caching into place drastically sped up program load times.

Instead of compiling software with the -Wl,--dynamic-linker and -Wl,-rpath flags, one can also modify these fields after the fact using patchelf(1). For closed-source programs, this is the only possibility.

The rpath can be inspected by using e.g. readelf -a $BINARY | grep RPATH.

Environment variable setup wrapper programs

Many programs are influenced by environment variables: to start another program, said program is often found by checking each directory in the PATH environment variable.

Such search paths are prevalent in scripting languages, too, to find modules. Python has PYTHONPATH, Perl has PERL5LIB, and so on.

To set up these search path environment variables at run time, distri employs an indirection. Instead of e.g. teensy-loader-cli, you run a small wrapper program that calls precisely one execve system call with the desired environment variables.

Initially, I used shell scripts as wrapper programs because they are easily inspectable. This turned out to be too slow, so I switched to compiled programs. I’m linking them statically for fast startup, and I’m linking them against musl libc for significantly smaller file sizes than glibc (per-executable overhead adds up quickly in a distribution!).

Note that the wrapper programs prepend to the PATH environment variable, they don’t replace it in its entirely. This is important so that users have a way to extend the PATH (and other variables) if they so choose. This doesn’t hurt hermeticity because it is only relevant for programs that were not present at build time, i.e. plugin mechanisms which, by design, cannot be hermetic.

Shebang interpreter patching

The Shebang of scripts contains a path, too, and hence needs to be changed.

We don’t do this in distri yet (the number of packaged scripts is small), but we should.

Performance requirements

The performance improvements in the previous sections are not just good to have, but practically required when many processes are involved: without them, you’ll encounter second-long delays in magit which spawns many git processes under the covers, or in dracut, which spawns one cp(1) process per file.

Downside: rebuild of packages required to pick up changes

Linux distributions such as Debian consider it an advantage to roll out security fixes to the entire system by updating a single shared library package (e.g. openssl).

The flip side of that coin is that changes to a single critical package can break the entire system.

With hermetic packages, all reverse dependencies must be rebuilt when a library’s changes should be picked up by the whole system. E.g., when openssl changes, curl must be rebuilt to pick up the new version of openssl.

This approach trades off using more bandwidth and more disk space (temporarily) against reducing the blast radius of any individual package update.

Downside: long env variables are cumbersome to deal with

This can be partially mitigated by removing empty directories at build time, which will result in shorter variables.

In general, there is no getting around this. One little trick is to use tr : '\n', e.g.:

distri0# echo $PATH
/usr/bin:/bin:/usr/sbin:/sbin:/ro/openssh-amd64-8.2p1-11/out/bin

distri0# echo $PATH | tr : '\n'
/usr/bin
/bin
/usr/sbin
/sbin
/ro/openssh-amd64-8.2p1-11/out/bin

Edge cases

The implementation outlined above works well in hundreds of packages, and only a small handful exhibited problems of any kind. Here are some issues I encountered:

Issue: accidental ABI breakage in plugin mechanisms

NSS libraries built against glibc 2.28 and newer cannot be loaded by glibc 2.27. In all likelihood, such changes do not happen too often, but it does illustrate that glibc’s published interface spec is not sufficient for forwards and backwards compatibility.

In distri, we could likely use a per-package exchange directory for glibc’s NSS mechanism to prevent the above problem from happening in the future.

Issue: wrapper bypass when a program re-executes itself

Some programs try to arrange for themselves to be re-executed outside of their current process tree. For example, consider building a program with the meson build system:

When meson first configures the build, it generates ninja files (think Makefiles) which contain command lines that run the meson --internal helper.
Once meson returns, ninja is called as a separate process, so it will not have the environment which the meson wrapper sets up. ninja then runs the previously persisted meson command line. Since the command line uses the full path to meson (not to its wrapper), it bypasses the wrapper.

Luckily, not many programs try to arrange for other process trees to run them. Here is a table summarizing how affected programs might try to arrange for re-execution, whether the technique results in a wrapper bypass, and what we do about it in distri:

technique to execute itself	uses wrapper	mitigation
run-time: find own basename in `PATH`	yes	wrapper program
compile-time: embed expected path	no; bypass!	configure or patch
run-time: `argv[0]` or `/proc/self/exe`	no; bypass!	patch

One might think that setting argv[0] to the wrapper location seems like a way to side-step this problem. We tried doing this in distri, but had to revert and go the other way.

Misc smaller issues

Appendix: Could other distributions adopt hermetic packages?

At a very high level, adopting hermetic packages will require two steps:

Using fully qualified paths whose contents don’t change (e.g. /ro/emacs-amd64-26.3-15) generally requires rebuilding programs, e.g. with --prefix set.
Once you use fully qualified paths you need to make the packages able to exchange data. distri solves this with exchange directories, implemented in the /ro file system which is backed by a FUSE daemon.

The first step is pretty simple, whereas the second step is where I expect controversy around any suggested mechanism.

Appendix: demo (in distri)

This appendix contains commands and their outputs, run on upcoming distri version supersilverhaze, but verified to work on older versions, too.

Large outputs have been collapsed and can be expanded by clicking on the output.

The /bin directory contains symlinks for the union of all package’s bin subdirectories:

distri0# readlink -f /bin/teensy_loader_cli

distri: 20x faster initramfs (initrd) from scratch

2020-01-21T17:50:00+01:00

In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition:

“[…] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process […] to make preparations before the real root file system can be mounted.”

Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory.

Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.

Motivation

Thus far, building a distri disk image was quite slow:

This is on an AMD Ryzen 3900X 12-core processor (2019):

distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd

Of these 50 seconds, dracut’s initramfs generation accounts for 31 seconds (62%)!

Initramfs generation time drops to 8.7 seconds once dracut no longer needs to use the single-threaded gzip(1) , but the multi-threaded replacement pigz(1) :

This brings the total time to build a distri disk image down to:

distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd

Clearly, when you use dracut on any modern computer, you should make pigz available. dracut should fail to compile unless one explicitly opts into the known-slower gzip. For more thoughts on optional dependencies, see “Optional dependencies don’t work”.

But why does it take 8.7 seconds still? Can we go faster?

The answer is Yes! I recently built a distri-specific initramfs I’m calling minitrd. I wrote both big parts from scratch:

the initramfs generator program (distri initrd)
a custom Go userland (cmd/minitrd), running as /init in the initramfs.

minitrd generates the initramfs image in ≈400ms, bringing the total time down to:

distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd

(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.)

How can minitrd be 20 times faster than dracut?

dracut is mainly written in shell, with a C helper program. It drives the generation process by spawning lots of external dependencies (e.g. ldd or the dracut-install helper program). I assume that the combination of using an interpreted language (shell) that spawns lots of processes and precludes a concurrent architecture is to blame for the poor performance.

minitrd is written in Go, with speed as a goal. It leverages concurrency and uses no external dependencies; everything happens within a single process (but with enough threads to saturate modern hardware).

Measuring early boot time using qemu, I measured the dracut-generated initramfs taking 588ms to display the full disk encryption passphrase prompt, whereas minitrd took only 195ms.

The rest of this article dives deeper into how minitrd works.

What does an initramfs do?

Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:

1. Load kernel modules to access the block devices with the root file system

Depending on the system, the block devices with the root file system might already be present when the initramfs runs, or some kernel modules might need to be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already present when the initramfs starts, whereas in qemu, we need to load the virtio_pci module, followed by the virtio_scsi module.

How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:

initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci

Devices in sysfs have a modalias file whose content can be matched against these declarations to identify the module to load:

initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[…]

Hence, for the initial round of module loading, it is sufficient to locate all modalias files within sysfs and load the responsible modules.

Loading a kernel module can result in new devices appearing. When that happens, the kernel sends a uevent, which the uevent consumer in userspace receives via a netlink socket. Typically, this consumer is udev(7) , but in our case, it’s minitrd.

For each uevent messages that comes with a MODALIAS variable, minitrd will load the relevant kernel module(s).

When loading a kernel module, its dependencies need to be loaded first. Dependency information is stored in the modules.dep file in a Makefile-like syntax:

initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko

To load a module, we can open its file and then call the Linux-specific finit_module(2) system call. Some modules are expected to return an error code, e.g. ENODEV or ENOENT when some hardware device is not actually present.

Side note: next to the textual versions, there are also binary versions of the modules.alias and modules.dep files. Presumably, those can be queried more quickly, but for simplicitly, I have not (yet?) implemented support in minitrd.

2. Console settings: font, keyboard layout

Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:

initrd# setfont latarcyrheb-sun32

Setting the user’s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:

initrd# loadkeys neo

3. Block device identification

In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer’s device topology (e.g. connect a new disk to a formerly unused port).

Hence, it is good style to refer to disks and their partitions with stable identifiers. This also applies to boot loader configuration, and so most distributions will set a kernel parameter such as root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121.

Identifying the block device or partition with the specified UUID is the initramfs’s job.

Depending on what the device contains, the UUID comes from a different place. For example, ext4 file systems have a UUID field in their file system superblock, whereas LUKS volumes have a UUID in their LUKS header.

Canonically, probing a device to extract the UUID is done by libblkid from the util-linux package, but the logic can easily be re-implemented in other languages and changes rarely. minitrd comes with its own implementation to avoid cgo or running the blkid(8) program.

4. LUKS full-disk encryption unlocking (only on encrypted systems)

Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.

initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1

After the user entered their passphrase, the root file system can be mounted:

initrd# mount /dev/dm-0 /mnt

5. Continuing the boot process (switch_root)

Now that everything is set up, we need to pass execution to the init program on the root file system with a careful sequence of chdir(2) , mount(2) , chroot(2) , chdir(2) and execve(2) system calls that is explained in this busybox switch_root comment.

initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init

To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.

How is an initramfs generated?

An initramfs “image” (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4.

Generators typically prepare a temporary directory and feed it to the cpio(1) program. In minitrd, we read the files into memory and generate the cpio archive using the go-cpio package. We use the pgzip package for parallel gzip compression.

The following files need to go into the cpio archive:

minitrd Go userland

The minitrd binary is copied into the cpio archive as /init and will be run by the kernel after extracting the archive.

Like the rest of distri, minitrd is built statically without cgo, which means it can be copied as-is into the cpio archive.

Linux kernel modules

Aside from the modules.alias and modules.dep metadata files, the kernel modules themselves reside in e.g. /lib/modules/5.4.6/kernel and need to be copied into the cpio archive.

Copying all modules results in a ≈80 MiB archive, so it is common to only copy modules that are relevant to the initramfs’s features. This reduces archive size to ≈24 MiB.

The filtering relies on hard-coded patterns and module names. For example, disk encryption related modules are all kernel modules underneath kernel/crypto, plus kernel/drivers/md/dm-crypt.ko.

When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.

Console Fonts and Keymaps

The kbd package’s setfont(8) and loadkeys(1) programs load console fonts and keymaps from /usr/share/consolefonts and /usr/share/keymaps, respectively.

Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.

cryptsetup, setfont, loadkeys

These programs are (currently) required because minitrd does not implement their functionality.

As they are dynamically linked, not only the programs themselves need to be copied, but also the ELF dynamic linking loader (path stored in the .interp ELF section) and any ELF library dependencies.

For example, cryptsetup in distri declares the ELF interpreter /ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2 and declares dependencies on shared libraries libcryptsetup.so.12, libblkid.so.1 and others. Luckily, in distri, packages contain a lib subdirectory containing symbolic links to the resolved shared library paths (hermetic packaging), so it is sufficient to mirror the lib directory into the cpio archive, recursing into shared library dependencies of shared libraries.

cryptsetup also requires the GCC runtime library libgcc_s.so.1 to be present at runtime, and will abort with an error message about not being able to call pthread_cancel(3) if it is unavailable.

time zone data

To print log messages in the correct time zone, we copy /etc/localtime from the host into the cpio archive.

minitrd outside of distri?

I currently have no desire to make minitrd available outside of distri. While the technical challenges (such as extending the generator to not rely on distri’s hermetic packages) are surmountable, I don’t want to support people’s initramfs remotely.

Also, I think that people’s efforts should in general be spent on rallying behind dracut and making it work faster, thereby benefiting all Linux distributions that use dracut (increasingly more). With minitrd, I have demonstrated that significant speed-ups are achievable.

Conclusion

It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as “debug why the encrypted root file system is not unlocked” to more complicated tasks such as “set up a root file system on DRBD for a high-availability setup”. But even with that sort of experience, I didn’t know all the details, until I was forced to implement every little thing.

As I suspected going into this exercise, dracut is much slower than it needs to be. Re-implementing its generation stage in a modern language instead of shell helps a lot.

Of course, my minitrd does a bit less than dracut, but not drastically so. The overall architecture is the same.

I hope my effort helps with two things:

As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, …), people can learn about how an initramfs works in a single place.
I hope the significant time difference motivates people to improve dracut.

Appendix: qemu development environment

Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I’m sharing my notes here.

First, I copied all kernel modules and a statically built busybox binary:

% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh

To generate an initramfs from the current directory, I used:

% find . | cpio -o -H newc | pigz > /tmp/initrd

In distri’s Makefile, I append these flags to the QEMU invocation:

-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"

The vga= mode parameter is required for loading font latarcyrheb-sun32.

Once in the busybox shell, I manually prepared the required mount points and kernel modules:

ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi

As a next step, I copied cryptsetup and dependencies into the initramfs directory:

% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full | sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/

In busybox, I used the following commands to unlock the root file system:

modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt

distri: a Linux distribution to research fast package management

2019-08-17T18:36:00+02:00

Over the last year or so I have worked on a research linux distribution in my spare time. It’s not a distribution for researchers (like Scientific Linux), but my personal playground project to research linux distribution development, i.e. try out fresh ideas.

This article focuses on the package format and its advantages, but there is more to distri, which I will cover in upcoming blog posts.

Motivation

I was a Debian Developer for the 7 years from 2012 to 2019, but using the distribution often left me frustrated, ultimately resulting in me winding down my Debian work.

Frequently, I was noticing a large gap between the actual speed of an operation (e.g. doing an update) and the possible speed based on back of the envelope calculations. I wrote more about this in my blog post “Package managers are slow”.

To me, this observation means that either there is potential to optimize the package manager itself (e.g. apt), or what the system does is just too complex. While I remember seeing some low-hanging fruit¹, through my work on distri, I wanted to explore whether all the complexity we currently have in Linux distributions such as Debian or Fedora is inherent to the problem space.

I have completed enough of the experiment to conclude that the complexity is not inherent: I can build a Linux distribution for general-enough purposes which is much less complex than existing ones.

① Those were low-hanging fruit from a user perspective. I’m not saying that fixing them is easy in the technical sense; I know too little about apt’s code base to make such a statement.

Key idea: packages are images, not archives

One key idea is to switch from using archives to using images for package contents. Common package managers such as dpkg(1) use tar(1) archives with various compression algorithms.

distri uses SquashFS images, a comparatively simple file system image format that I happen to be familiar with from my work on the gokrazy Raspberry Pi 3 Go platform.

This idea is not novel: AppImage and snappy also use images, but only for individual, self-contained applications. distri however uses images for distribution packages with dependencies. In particular, there is no duplication of shared libraries in distri.

A nice side effect of using read-only image files is that applications are immutable and can hence not be broken by accidental (or malicious!) modification.

Key idea: separate hierarchies

Package contents are made available under a fully-qualified path. E.g., all files provided by package zsh-amd64-5.6.2-3 are available under /ro/zsh-amd64-5.6.2-3. The mountpoint /ro stands for read-only, which is short yet descriptive.

Perhaps surprisingly, building software with custom prefix values of e.g. /ro/zsh-amd64-5.6.2-3 is widely supported, thanks to:

Linux distributions, which build software with prefix set to /usr, whereas FreeBSD (and the autotools default), which build with prefix set to /usr/local.
Enthusiast users in corporate or research environments, who install software into their home directories.

Because using a custom prefix is a common scenario, upstream awareness for prefix-correctness is generally high, and the rarely required patch will be quickly accepted.

Key idea: exchange directories

Software packages often exchange data by placing or locating files in well-known directories. Here are just a few examples:

gcc(1) locates the libusb(3) headers via /usr/include
man(1) locates the nginx(1) manpage via /usr/share/man.
zsh(1) locates executable programs via PATH components such as /bin

In distri, these locations are called exchange directories and are provided via FUSE in /ro.

Exchange directories come in two different flavors:

global. The exchange directory, e.g. /ro/share, provides the union of the share sub directory of all packages in the package store.
Global exchange directories are largely used for compatibility, see below.
per-package. Useful for tight coupling: e.g. irssi(1) does not provide any ABI guarantees, so plugins such as irssi-robustirc can declare that they want e.g. /ro/irssi-amd64-1.1.1-1/out/lib/irssi/modules to be a per-package exchange directory and contain files from their lib/irssi/modules.

Search paths sometimes need to be fixed

Programs which use exchange directories sometimes use search paths to access multiple exchange directories. In fact, the examples above were taken from gcc(1) ’s INCLUDEPATH, man(1) ’s MANPATH and zsh(1) ’s PATH. These are prominent ones, but more examples are easy to find: zsh(1) loads completion functions from its FPATH.

Some search path values are derived from --datadir=/ro/share and require no further attention, but others might derive from e.g. --prefix=/ro/zsh-amd64-5.6.2-3/out and need to be pointed to an exchange directory via a specific command line flag.

FHS compatibility

Global exchange directories are used to make distri provide enough of the Filesystem Hierarchy Standard (FHS) that third-party software largely just works. This includes a C development environment.

I successfully ran a few programs from their binary packages such as Google Chrome, Spotify, or Microsoft’s Visual Studio Code.

Fast package manager

I previously wrote about how Linux distribution package managers are too slow.

distri’s package manager is extremely fast. Its main bottleneck is typically the network link, even at high speed links (I tested with a 100 Gbps link).

Its speed comes largely from an architecture which allows the package manager to do less work. Specifically:

Package images can be added atomically to the package store, so we can safely skip fsync(2) . Corruption will be cleaned up automatically, and durability is not important: if an interactive installation is interrupted, the user can just repeat it, as it will be fresh on their mind.
Because all packages are co-installable thanks to separate hierarchies, there are no conflicts at the package store level, and no dependency resolution (an optimization problem requiring SAT solving) is required at all.
In exchange directories, we resolve conflicts by selecting the package with the highest monotonically increasing distri revision number.
distri proves that we can build a useful Linux distribution entirely without hooks and triggers. Not having to serialize hook execution allows us to download packages into the package store with maximum concurrency.
Because we are using images instead of archives, we do not need to unpack anything. This means installing a package is really just writing its package image and metadata to the package store. Sequential writes are typically the fastest kind of storage usage pattern.

Fast installation also make other use-cases more bearable, such as creating disk images, be it for testing them in qemu(1) , booting them on real hardware from a USB drive, or for cloud providers such as Google Cloud.

Fast package builder

Contrary to how distribution package builders are usually implemented, the distri package builder does not actually install any packages into the build environment.

Instead, distri makes available a filtered view of the package store (only declared dependencies are available) at /ro in the build environment.

This means that even for large dependency trees, setting up a build environment happens in a fraction of a second! Such a low latency really makes a difference in how comfortable it is to iterate on distribution packages.

Package stores

In distri, package images are installed from a remote package store into the local system package store /roimg, which backs the /ro mount.

A package store is implemented as a directory of package images and their associated metadata files.

You can easily make available a package store by using distri export.

To provide a mirror for your local network, you can periodically distri update from the package store you want to mirror, and then distri export your local copy. Special tooling (e.g. debmirror in Debian) is not required because distri install is atomic (and update uses install).

Producing derivatives is easy: just add your own packages to a copy of the package store.

The package store is intentionally kept simple to manage and distribute. Its files could be exchanged via peer-to-peer file systems, or synchronized from an offline medium.

distri’s first release

distri works well enough to demonstrate the ideas explained above. I have branched this state into branch jackherer, distri’s first release code name. This way, I can keep experimenting in the distri repository without breaking your installation.

From the branch contents, our autobuilder creates:

disk images, which…

a package repository. Installations can pick up new packages with distri update.
documentation for the release.

Definitely check out the “Cool things to try” README section.

The project website can be found at https://distr1.org. The website is just the README for now, but we can improve that later.

The repository can be found at https://github.com/distr1/distri

Project outlook

Right now, distri is mainly a vehicle for my spare-time Linux distribution research. I don’t recommend anyone use distri for anything but research, and there are no medium-term plans of that changing. At the very least, please contact me before basing anything serious on distri so that we can talk about limitations and expectations.

I expect the distri project to live for as long as I have blog posts to publish, and we’ll see what happens afterwards. Note that this is a hobby for me: I will continue to explore, at my own pace, parts that I find interesting.

My hope is that established distributions might get a useful idea or two from distri.

There’s more to come: subscribe to the distri feed

I don’t want to make this post too long, but there is much more!

Please subscribe to the following URL in your feed reader to get all posts about distri:

https://michael.stapelberg.ch/posts/tags/distri/feed.xml

Next in my queue are articles about hermetic packages and good package maintainer experience (including declarative packaging).

Feedback or questions?

I’d love to discuss these ideas in case you’re interested!

Please send feedback to the distri mailing list so that everyone can participate!

Linux package managers are slow

2019-08-17T18:27:00+02:00

Pending feedback: Allan McRae pointed out that I should be more precise with my terminology: strictly speaking, distributions are slow, and package managers are only part of the puzzle.

I’ll try to be clearer in future revisions/posts.

I measured how long the most popular Linux distribution’s package manager take to install small and large packages (the ack(1p) source code search Perl script and qemu, respectively).

Where required, my measurements include metadata updates such as transferring an up-to-date package list. For me, requiring a metadata update is the more common case, particularly on live systems or within Docker containers.

All measurements were taken on an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz running Docker 20.10.8 on Linux 5.13.10, backed by a Corsair Force MP600 NVMe drive boasting many hundreds of MB/s write performance. The machine is located in Zürich and connected to the Internet with a 1 Gigabit fiber connection, so the expected top download speed is ≈115 MB/s.

See Appendix D for details on the measurement method and command outputs.

Measurements

Keep in mind that these are one-time measurements. They should be indicative of actual performance, but your experience may vary.

ack (small Perl program)

distribution	package manager	data	wall-clock time	rate
Fedora	dnf	84 MB	25s	3.4 MB/s
NixOS	Nix	15 MB	7s	2.3 MB/s
Debian	apt	16 MB	3s	4.9 MB/s
Arch Linux	pacman	25 MB	1s	18.4 MB/s
Alpine	apk	10 MB	1s	11.9 MB/s

qemu (large C program)

distribution	package manager	data	wall-clock time	rate
Fedora	dnf	350 MB	56s	6.25 MB/s
Debian	apt	256 MB	39s	6.5 MB/s
NixOS	Nix	251 MB	36s	6.8 MB/s
Arch Linux	pacman	128 MB	10s	12.1 MB/s
Alpine	apk	34 MB	1.8s	18.6 MB/s

(Looking for older measurements? See Appendix B (2019) or Appendix C (2020)).

The difference between the slowest and fastest package managers is 30x!

How can Alpine’s apk and Arch Linux’s pacman be an order of magnitude faster than the rest? They are doing a lot less than the others, and more efficiently, too.

Pain point: too much metadata

For example, Fedora transfers a lot more data than others because its main package list is 60 MB (compressed!) alone. Compare that with Alpine’s 734 KB APKINDEX.tar.gz.

Of course the extra metadata which Fedora provides helps some use case, otherwise they hopefully would have removed it altogether. The amount of metadata seems excessive for the use case of installing a single package, which I consider the main use-case of an interactive package manager.

I expect any modern Linux distribution to only transfer absolutely required data to complete my task.

Pain point: no concurrency

Because they need to sequence executing arbitrary package maintainer-provided code (hooks and triggers), all tested package managers need to install packages sequentially (one after the other) instead of concurrently (all at the same time).

In my blog post “Can we do without hooks and triggers?”, I outline that hooks and triggers are not strictly necessary to build a working Linux distribution.

Thought experiment: further speed-ups

Strictly speaking, the only required feature of a package manager is to make available the package contents so that the package can be used: a program can be started, a kernel module can be loaded, etc.

By only implementing what’s needed for this feature, and nothing more, a package manager could likely beat apk’s performance. It could, for example:

skip archive extraction by mounting file system images (like AppImage or snappy)
use compression which is light on CPU, as networks are fast (like apk)
skip fsync when it is safe to do so, i.e.:
- package installations don’t modify system state
- atomic package installation (e.g. an append-only package store)
- automatically clean up the package store after crashes

Current landscape

Here’s a table outlining how the various package managers listed on Wikipedia’s list of software package management systems fare:

name	scope	package file format	hooks/triggers
AppImage	apps	image: ISO9660, SquashFS	no
snappy	apps	image: SquashFS	yes: hooks
FlatPak	apps	archive: OSTree	no
0install	apps	archive: tar.bz2	no
nix, guix	distro	archive: nar.{bz2,xz}	activation script
dpkg	distro	archive: tar.{gz,xz,bz2} in ar(1)	yes
rpm	distro	archive: cpio.{bz2,lz,xz}	scriptlets
pacman	distro	archive: tar.xz	install
slackware	distro	archive: tar.{gz,xz}	yes: doinst.sh
apk	distro	archive: tar.gz	yes: .post-install
Entropy	distro	archive: tar.bz2	yes
ipkg, opkg	distro	archive: tar{,.gz}	yes

Conclusion

As per the current landscape, there is no distribution-scoped package manager which uses images and leaves out hooks and triggers, not even in smaller Linux distributions.

I think that space is really interesting, as it uses a minimal design to achieve significant real-world speed-ups.

I have explored this idea in much more detail, and am happy to talk more about it in my post distri: a Linux distribution to research fast package management.

There are a couple of recent developments going into the same direction:

“Revisiting How We Put Together Linux Systems” describes mounting app bundles
Android Q uses ext4 loopback images
The Haiku Operating System’s package manager Haiku Depot uses images

Appendix D: measurement details (2021)

ack

You can expand each of these:

Fedora’s dnf takes almost 25 seconds to fetch and unpack 84 MB.

Linux distributions: Can we do without hooks and triggers?

2019-07-20T00:00:00+00:00

Hooks are an extension feature provided by all package managers that are used in larger Linux distributions. For example, Debian uses apt, which has various maintainer scripts. Fedora uses rpm, which has scriptlets. Different package managers use different names for the concept, but all of them offer package maintainers the ability to run arbitrary code during package installation and upgrades. Example hook use cases include adding daemon user accounts to your system (e.g. postgres), or generating/updating cache files.

Triggers are a kind of hook which run when other packages are installed. For example, on Debian, the man(1) package comes with a trigger which regenerates the search database index whenever any package installs a manpage. When, for example, the nginx(8) package is installed, a trigger provided by the man(1) package runs.

Over the past few decades, Open Source software has become more and more uniform: instead of each piece of software defining its own rules, a small number of build systems are now widely adopted.

Hence, I think it makes sense to revisit whether offering extension via hooks and triggers is a net win or net loss.

Hooks preclude concurrent package installation

Package managers commonly can make very little assumptions about what hooks do, what preconditions they require, and which conflicts might be caused by running multiple package’s hooks concurrently.

Hence, package managers cannot concurrently install packages. At least the hook/trigger part of the installation needs to happen in sequence.

While it seems technically feasible to retrofit package manager hooks with concurrency primitives such as locks for mutual exclusion between different hook processes, the required overhaul of all hooks¹ seems like such a daunting task that it might be better to just get rid of the hooks instead. Only deleting code frees you from the burden of maintenance, automated testing and debugging.

① In Debian, there are 8620 non-generated maintainer scripts, as reported by find shard*/src/*/debian -regex ".*$pre\|post$$inst\|rm$$" on a Debian Code Search instance.

Triggers slow down installing/updating other packages

Personally, I never use the apropos(1) command, so I don’t appreciate the man(1) package’s trigger which updates the database used by apropos(1). The process takes a long time and, because hooks and triggers must be executed serially (see previous section), blocks my installation or update.

When I tell people this, they are often surprised to learn about the existance of the apropos(1) command. I suggest adopting an opt-in model.

Unnecessary work if programs are not used between updates

Hooks run when packages are installed. If a package’s contents are not used between two updates, running the hook in the first update could have been skipped. Running the hook lazily when the package contents are used reduces unnecessary work.

As a welcome side-effect, lazy hook evaluation automatically makes the hook work in operating system images, such as live USB thumb drives or SD card images for the Raspberry Pi. Such images must not ship the same crypto keys (e.g. OpenSSH host keys) to all machines, but instead generate a different key on each machine.

Why do users keep packages installed they don’t use? It’s extra work to remember and clean up those packages after use. Plus, users might not realize or value that having fewer packages installed has benefits such as faster updates.

I can also imagine that there are people for whom the cost of re-installing packages incentivizes them to just keep packages installed—you never know when you might need the program again…

Implemented in an interpreted language

While working on hermetic packages (more on that in another blog post), where the contained programs are started with modified environment variables (e.g. PATH) via a wrapper bash script, I noticed that the overhead of those wrapper bash scripts quickly becomes significant. For example, when using the excellent magit interface for Git in Emacs, I encountered second-long delays² when using hermetic packages compared to standard packages. Re-implementing wrappers in a compiled language provided a significant speed-up.

Similarly, getting rid of an extension point which mandates using shell scripts allows us to build an efficient and fast implementation of a predefined set of primitives, where you can reason about their effects and interactions.

② magit needs to run git a few times for displaying the full status, so small overhead quickly adds up.

Incentivizing more upstream standardization

Hooks are an escape hatch for distribution maintainers to express anything which their packaging system cannot express.

Distributions should only rely on well-established interfaces such as autoconf’s classic ./configure && make && make install (including commonly used flags) to build a distribution package. Integrating upstream software into a distribution should not require custom hooks. For example, instead of requiring a hook which updates a cache of schema files, the library used to interact with those files should transparently (re-)generate the cache or fall back to a slower code path.

Distribution maintainers are hard to come by, so we should value their time. In particular, there is a 1:n relationship of packages to distribution package maintainers (software is typically available in multiple Linux distributions), so it makes sense to spend the work in the 1 and have the n benefit.

Can we do without them?

If we want to get rid of hooks, we need another mechanism to achieve what we currently achieve with hooks.

If the hook is not specific to the package, it can be moved to the package manager. The desired system state should either be derived from the package contents (e.g. required system users can be discovered from systemd service files) or declaratively specified in the package build instructions—more on that in another blog post. This turns hooks (arbitrary code) into configuration, which allows the package manager to collapse and sequence the required state changes. E.g., when 5 packages are installed which each need a new system user, the package manager could update /etc/passwd just once.

If the hook is specific to the package, it should be moved into the package contents. This typically means moving the functionality into the program start (or the systemd service file if we are talking about a daemon). If (while?) upstream is not convinced, you can either wrap the program or patch it. Note that this case is relatively rare: I have worked with hundreds of packages and the only package-specific functionality I came across was automatically generating host keys before starting OpenSSH’s sshd(8)³.

There is one exception where moving the hook doesn’t work: packages which modify state outside of the system, such as bootloaders or kernel images.

③ Even that can be moved out of a package-specific hook, as Fedora demonstrates.

Conclusion

Global state modifications performed as part of package installation today use hooks, an overly expressive extension mechanism.

Instead, all modifications should be driven by configuration. This is feasible because there are only a few different kinds of desired state modifications. This makes it possible for package managers to optimize package installation.

Michael Stapelbergs Website: posts tagged distri

a new distri linux (fast package management) release

Hermetic packages (in distri)

Out of scope: plugins are not hermetic by design

Implementation of hermetic packages in distri

ELF interpreter and dynamic libraries

Environment variable setup wrapper programs

Shebang interpreter patching

Performance requirements

Downside: rebuild of packages required to pick up changes

Downside: long env variables are cumbersome to deal with

Edge cases

Issue: accidental ABI breakage in plugin mechanisms

Issue: wrapper bypass when a program re-executes itself

Misc smaller issues

Appendix: Could other distributions adopt hermetic packages?

Appendix: demo (in distri)

distri: 20x faster initramfs (initrd) from scratch

Motivation

What does an initramfs do?

1. Load kernel modules to access the block devices with the root file system

2. Console settings: font, keyboard layout

3. Block device identification

4. LUKS full-disk encryption unlocking (only on encrypted systems)

5. Continuing the boot process (switch_root)

How is an initramfs generated?

minitrd Go userland

Linux kernel modules

Console Fonts and Keymaps

cryptsetup, setfont, loadkeys

time zone data

minitrd outside of distri?

Conclusion

Appendix: qemu development environment

distri: a Linux distribution to research fast package management

Motivation

Key idea: packages are images, not archives

Key idea: separate hierarchies

Key idea: exchange directories

Search paths sometimes need to be fixed

FHS compatibility

Fast package manager

Fast package builder

Package stores

distri’s first release

Project outlook

There’s more to come: subscribe to the distri feed

Feedback or questions?

Linux package managers are slow

Measurements

ack (small Perl program)

qemu (large C program)

Pain point: too much metadata

Pain point: no concurrency

Thought experiment: further speed-ups

Current landscape

Conclusion

Appendix A: related work

Appendix D: measurement details (2021)

ack

Linux distributions: Can we do without hooks and triggers?

Hooks preclude concurrent package installation

Triggers slow down installing/updating other packages

Unnecessary work if programs are not used between updates

Implemented in an interpreted language

Incentivizing more upstream standardization

Can we do without them?

Conclusion