The focus of this release lies on:
a better developer experience, allowing users to debug any installed package without extra setup steps
performance improvements in all areas (starting programs, building distri packages, generating distri images)
better tooling for keeping track of upstream versions
See the release notes for more details.
The distri research linux distribution project was started in 2019 to research whether a few architectural changes could enable drastically faster package management.
While the package managers in common Linux distributions (e.g. apt, dnf, …) top out at data rates of only a few MB/s, distri effortlessly saturates 1 Gbit, 10 Gbit and even 40 Gbit connections, resulting in fast installation and update speeds.
]]>emacs
) are hermetic. By
hermetic, I mean that the dependencies a package uses (e.g. libusb
) don’t
change, even when newer versions are installed.
For example, if package libusb-amd64-1.0.22-7
is available at build time, the
package will always use that same version, even after the newer
libusb-amd64-1.0.23-8
will be installed into the package store.
Another way of saying the same thing is: packages in distri are always co-installable.
This makes the package store more robust: additions to it will not break the system. On a technical level, the package store is implemented as a directory containing distri SquashFS images and metadata files, into which packages are installed in an atomic way.
One exception where hermeticity is not desired are plugin mechanisms: optionally loading out-of-tree code at runtime obviously is not hermetic.
As an example, consider glibc’s Name Service Switch
(NSS)
mechanism. Page 29.4.1 Adding another Service to
NSS
describes how glibc searches $prefix/lib
for shared libraries at runtime.
Debian ships about a dozen NSS libraries for a variety of purposes, and enterprise setups might add their own into the mix.
systemd (as of v245) accounts for 4 NSS libraries,
e.g. nss-systemd
for user/group name resolution for users allocated through systemd’s
DynamicUser=
option.
Having packages be as hermetic as possible remains a worthwhile goal despite any exceptions: I will gladly use a 99% hermetic system over a 0% hermetic system any day.
Side note: Xorg’s driver model (which can be characterized as a plugin mechanism) does not fall under this category because of its tight API/ABI coupling! For this case, where drivers are only guaranteed to work with precisely the Xorg version for which they were compiled, distri uses per-package exchange directories.
On a technical level, the requirement is: all paths used by the program must
always result in the same contents. This is implemented in distri via the
read-only package store mounted at /ro
, e.g. files underneath
/ro/emacs-amd64-26.3-15
never change.
To change all paths used by a program, in practice, three strategies cover most paths:
Programs on Linux use the ELF file format, which contains two kinds of references:
First, the ELF interpreter (PT_INTERP
segment), which is used to start the
program. For dynamically linked programs on 64-bit systems, this is typically
ld.so(8)
.
Many distributions use system-global paths such as
/lib64/ld-linux-x86-64.so.2
, but distri compiles programs with
-Wl,--dynamic-linker=/ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2
so
that the full path ends up in the binary.
The ELF interpreter is shown by file(1)
, but you can also use readelf -a $BINARY | grep 'program interpreter'
to display it.
And secondly, the rpath, a run-time search
path for dynamic libraries. Instead of
storing full references to all dynamic libraries, we set the rpath so that
ld.so(8)
will find the correct dynamic libraries.
Originally, we used to just set a long rpath, containing one entry for each
dynamic library dependency. However, we have since switched to using a single
lib
subdirectory per
package
as its rpath, and placing symlinks with full path references into that lib
directory, e.g. using -Wl,-rpath=/ro/grep-amd64-3.4-4/lib
. This is better for
performance, as ld.so
uses a per-directory cache.
Note that program load times are significantly influenced by how quickly you can
locate the dynamic libraries. distri uses a FUSE file system to load programs
from, so getting proper -ENOENT
caching into
place
drastically sped up program load times.
Instead of compiling software with the -Wl,--dynamic-linker
and -Wl,-rpath
flags, one can also modify these fields after the fact using patchelf(1)
. For
closed-source programs, this is the only possibility.
The rpath can be inspected by using e.g. readelf -a $BINARY | grep RPATH
.
Many programs are influenced by environment variables: to start another program,
said program is often found by checking each directory in the PATH
environment
variable.
Such search paths are prevalent in scripting languages, too, to find
modules. Python has PYTHONPATH
, Perl has PERL5LIB
, and so on.
To set up these search path environment variables at run time, distri employs an
indirection. Instead of e.g. teensy-loader-cli
, you run a small wrapper
program that calls precisely one execve
system call with the desired
environment variables.
Initially, I used shell scripts as wrapper programs because they are easily inspectable. This turned out to be too slow, so I switched to compiled programs. I’m linking them statically for fast startup, and I’m linking them against musl libc for significantly smaller file sizes than glibc (per-executable overhead adds up quickly in a distribution!).
Note that the wrapper programs prepend to the PATH
environment variable, they
don’t replace it in its entirely. This is important so that users have a way to
extend the PATH
(and other variables) if they so choose. This doesn’t hurt
hermeticity because it is only relevant for programs that were not present at
build time, i.e. plugin mechanisms which, by design, cannot be hermetic.
The Shebang of scripts contains a path, too, and hence needs to be changed.
We don’t do this in distri yet (the number of packaged scripts is small), but we should.
The performance improvements in the previous sections are not just good to have,
but practically required when many processes are involved: without them, you’ll
encounter second-long delays in magit which spawns many git
processes under the covers, or in
dracut, which spawns one
cp(1)
process per file.
Linux distributions such as Debian consider it an advantage to roll out security
fixes to the entire system by updating a single shared library package
(e.g. openssl
).
The flip side of that coin is that changes to a single critical package can break the entire system.
With hermetic packages, all reverse dependencies must be rebuilt when a
library’s changes should be picked up by the whole system. E.g., when openssl
changes, curl
must be rebuilt to pick up the new version of openssl
.
This approach trades off using more bandwidth and more disk space (temporarily) against reducing the blast radius of any individual package update.
This can be partially mitigated by removing empty directories at build time, which will result in shorter variables.
In general, there is no getting around this. One little trick is to use tr : '\n'
, e.g.:
distri0# echo $PATH
/usr/bin:/bin:/usr/sbin:/sbin:/ro/openssh-amd64-8.2p1-11/out/bin
distri0# echo $PATH | tr : '\n'
/usr/bin
/bin
/usr/sbin
/sbin
/ro/openssh-amd64-8.2p1-11/out/bin
The implementation outlined above works well in hundreds of packages, and only a small handful exhibited problems of any kind. Here are some issues I encountered:
NSS libraries built against glibc 2.28 and newer cannot be loaded by glibc 2.27. In all likelihood, such changes do not happen too often, but it does illustrate that glibc’s published interface spec is not sufficient for forwards and backwards compatibility.
In distri, we could likely use a per-package exchange directory for glibc’s NSS mechanism to prevent the above problem from happening in the future.
Some programs try to arrange for themselves to be re-executed outside of their
current process tree. For example, consider building a program with the meson
build system:
When meson
first configures the build, it generates ninja
files (think
Makefiles) which contain command lines that run the meson --internal
helper.
Once meson
returns, ninja
is called as a separate process, so it will not
have the environment which the meson
wrapper sets up. ninja
then runs the
previously persisted meson
command line. Since the command line uses the
full path to meson
(not to its wrapper), it bypasses the wrapper.
Luckily, not many programs try to arrange for other process trees to run them. Here is a table summarizing how affected programs might try to arrange for re-execution, whether the technique results in a wrapper bypass, and what we do about it in distri:
technique to execute itself | uses wrapper | mitigation |
---|---|---|
run-time: find own basename in PATH |
yes | wrapper program |
compile-time: embed expected path | no; bypass! | configure or patch |
run-time: argv[0] or /proc/self/exe |
no; bypass! | patch |
One might think that setting argv[0]
to the wrapper location seems like a way
to side-step this problem. We tried doing this in distri, but had to
revert
and go the other
way.
-
character prepended to
argv[0]
, so shells like
bash or zsh cannot use wrapper
programs.At a very high level, adopting hermetic packages will require two steps:
Using fully qualified paths whose contents don’t change
(e.g. /ro/emacs-amd64-26.3-15
) generally requires rebuilding programs,
e.g. with --prefix
set.
Once you use fully qualified paths you need to make the packages able to
exchange data. distri solves this with exchange directories, implemented in the
/ro
file system which is backed by a FUSE daemon.
The first step is pretty simple, whereas the second step is where I expect controversy around any suggested mechanism.
This appendix contains commands and their outputs, run on upcoming distri
version supersilverhaze
, but verified to work on older versions, too.
Large outputs have been collapsed and can be expanded by clicking on the output.
The /bin
directory contains symlinks for the union of all package’s bin
subdirectories:
distri0# readlink -f /bin/teensy_loader_cli
/ro/teensy-loader-cli-amd64-2.1+g20180927-7/bin/teensy_loader_cli
The wrapper program in the bin
subdirectory is small:
distri0# ls -lh $(readlink -f /bin/teensy_loader_cli)
-rwxr-xr-x 1 root root 46K Apr 21 21:56 /ro/teensy-loader-cli-amd64-2.1+g20180927-7/bin/teensy_loader_cli
Wrapper programs execute quickly:
distri0# strace -fvy /bin/teensy_loader_cli |& head | cat -n
1 execve("/bin/teensy_loader_cli", ["/bin/teensy_loader_cli"], ["USER=root", "LOGNAME=root", "HOME=/root", "PATH=/ro/bash-amd64-5.0-4/bin:/r"..., "SHELL=/bin/zsh", "TERM=screen.xterm-256color", "XDG_SESSION_ID=c1", "XDG_RUNTIME_DIR=/run/user/0", "DBUS_SESSION_BUS_ADDRESS=unix:pa"..., "XDG_SESSION_TYPE=tty", "XDG_SESSION_CLASS=user", "SSH_CLIENT=10.0.2.2 42556 22", "SSH_CONNECTION=10.0.2.2 42556 10"..., "SSH_TTY=/dev/pts/0", "SHLVL=1", "PWD=/root", "OLDPWD=/root", "_=/usr/bin/strace", "LD_LIBRARY_PATH=/ro/bash-amd64-5"..., "PERL5LIB=/ro/bash-amd64-5.0-4/ou"..., "PYTHONPATH=/ro/bash-amd64-5.b0-4/"...]) = 0
2 arch_prctl(ARCH_SET_FS, 0x40c878) = 0
3 set_tid_address(0x40ca9c) = 715
4 brk(NULL) = 0x15b9000
5 brk(0x15ba000) = 0x15ba000
6 brk(0x15bb000) = 0x15bb000
7 brk(0x15bd000) = 0x15bd000
8 brk(0x15bf000) = 0x15bf000
9 brk(0x15c1000) = 0x15c1000
10 execve("/ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli", ["/ro/teensy-loader-cli-amd64-2.1+"...], ["USER=root", "LOGNAME=root", "HOME=/root", "PATH=/ro/bash-amd64-5.0-4/bin:/r"..., "SHELL=/bin/zsh", "TERM=screen.xterm-256color", "XDG_SESSION_ID=c1", "XDG_RUNTIME_DIR=/run/user/0", "DBUS_SESSION_BUS_ADDRESS=unix:pa"..., "XDG_SESSION_TYPE=tty", "XDG_SESSION_CLASS=user", "SSH_CLIENT=10.0.2.2 42556 22", "SSH_CONNECTION=10.0.2.2 42556 10"..., "SSH_TTY=/dev/pts/0", "SHLVL=1", "PWD=/root", "OLDPWD=/root", "_=/usr/bin/strace", "LD_LIBRARY_PATH=/ro/bash-amd64-5"..., "PERL5LIB=/ro/bash-amd64-5.0-4/ou"..., "PYTHONPATH=/ro/bash-amd64-5.0-4/"...]) = 0
Confirm which ELF interpreter is set for a binary using readelf(1)
:
distri0# readelf -a /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli | grep 'program interpreter'
[Requesting program interpreter: /ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2]
Confirm the rpath is set to the package’s lib subdirectory using readelf(1)
:
distri0# readelf -a /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli | grep RPATH
0x000000000000000f (RPATH) Library rpath: [/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib]
…and verify the lib subdirectory has the expected symlinks and target versions:
distri0# find /ro/teensy-loader-cli-amd64-*/lib -type f -printf '%P -> %l\n'
libc.so.6 -> /ro/glibc-amd64-2.31-4/out/lib/libc-2.31.so
libpthread.so.0 -> /ro/glibc-amd64-2.31-4/out/lib/libpthread-2.31.so
librt.so.1 -> /ro/glibc-amd64-2.31-4/out/lib/librt-2.31.so
libudev.so.1 -> /ro/libudev-amd64-245-11/out/lib/libudev.so.1.6.17
libusb-0.1.so.4 -> /ro/libusb-compat-amd64-0.1.5-7/out/lib/libusb-0.1.so.4.4.4
libusb-1.0.so.0 -> /ro/libusb-amd64-1.0.23-8/out/lib/libusb-1.0.so.0.2.0
To verify the correct libraries are actually loaded, you can set the LD_DEBUG
environment variable for ld.so(8)
:
distri0# LD_DEBUG=libs teensy_loader_cli
[…]
678: find library=libc.so.6 [0]; searching
678: search path=/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib (RPATH from file /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli)
678: trying file=/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib/libc.so.6
678:
[…]
NSS libraries that distri ships:
find /lib/ -name "libnss_*.so.2" -type f -printf '%P -> %l\n'
libnss_myhostname.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_myhostname.so.2
libnss_mymachines.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_mymachines.so.2
libnss_resolve.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_resolve.so.2
libnss_systemd.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_systemd.so.2
libnss_compat.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_compat.so.2
libnss_db.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_db.so.2
libnss_dns.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_dns.so.2
libnss_files.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_files.so.2
libnss_hesiod.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_hesiod.so.2
“[…] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process […] to make preparations before the real root file system can be mounted.”
Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory.
Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.
Thus far, building a distri disk image was quite slow:
This is on an AMD Ryzen 3900X 12-core processor (2019):
distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd
Of these 50 seconds,
dracut
’s initramfs
generation accounts for 31 seconds (62%)!
Initramfs generation time drops to 8.7 seconds once dracut
no longer needs to
use the single-threaded gzip(1)
, but the
multi-threaded replacement pigz(1)
:
This brings the total time to build a distri disk image down to:
distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd
Clearly, when you use dracut
on any modern computer, you should make pigz
available. dracut
should fail to compile unless one explicitly opts into the
known-slower gzip. For more thoughts on optional dependencies, see “Optional
dependencies don’t work”.
But why does it take 8.7 seconds still? Can we go faster?
The answer is Yes! I recently built a distri-specific initramfs I’m calling
minitrd
. I wrote both big parts from scratch:
distri initrd
)cmd/minitrd
), running as /init
in the initramfs.minitrd
generates the initramfs image in ≈400ms, bringing the total time down
to:
distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd
(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.)
How can minitrd
be 20 times faster than dracut
?
dracut
is mainly written in shell, with a C helper program. It drives the
generation process by spawning lots of external dependencies (e.g. ldd
or the
dracut-install
helper program). I assume that the combination of using an
interpreted language (shell) that spawns lots of processes and precludes a
concurrent architecture is to blame for the poor performance.
minitrd
is written in Go, with speed as a goal. It leverages concurrency and
uses no external dependencies; everything happens within a single process (but
with enough threads to saturate modern hardware).
Measuring early boot time using qemu, I measured the dracut
-generated
initramfs taking 588ms to display the full disk encryption passphrase prompt,
whereas minitrd
took only 195ms.
The rest of this article dives deeper into how minitrd
works.
Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:
Depending on the system, the block devices with the root file system might
already be present when the initramfs runs, or some kernel modules might need to
be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already
present when the initramfs starts, whereas in qemu, we need to load the
virtio_pci
module, followed by the virtio_scsi
module.
How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:
initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci
Devices in sysfs
have a modalias
file whose content can be matched against
these declarations to identify the module to load:
initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[…]
Hence, for the initial round of module loading, it is sufficient to locate all
modalias
files within sysfs
and load the responsible modules.
Loading a kernel module can result in new devices appearing. When that happens,
the kernel sends a
uevent,
which the uevent consumer in userspace receives via a netlink socket. Typically,
this consumer is udev(7)
, but in our case, it’s
minitrd
.
For each uevent messages that comes with a MODALIAS
variable, minitrd
will
load the relevant kernel module(s).
When loading a kernel module, its dependencies need to be loaded
first. Dependency information is stored in the modules.dep
file in a
Makefile
-like syntax:
initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko
To load a module, we can open its file and then call the Linux-specific finit_module(2)
system call. Some modules are expected to
return an error code, e.g. ENODEV
or ENOENT
when some hardware device is not
actually present.
Side note: next to the textual versions, there are also binary versions of the
modules.alias
and modules.dep
files. Presumably, those can be queried more
quickly, but for simplicitly, I have not (yet?) implemented support in
minitrd
.
Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:
initrd# setfont latarcyrheb-sun32
Setting the user’s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:
initrd# loadkeys neo
In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer’s device topology (e.g. connect a new disk to a formerly unused port).
Hence, it is good style to refer to disks and their partitions with stable
identifiers. This also applies to boot loader configuration, and so most
distributions will set a kernel parameter such as
root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121
.
Identifying the block device or partition with the specified UUID
is the
initramfs’s job.
Depending on what the device contains, the UUID comes from a different
place. For example, ext4
file systems have a UUID field in their file system
superblock, whereas LUKS volumes have a UUID in their LUKS header.
Canonically, probing a device to extract the UUID is done by libblkid
from the
util-linux
package, but the logic can easily be re-implemented in other
languages
and changes rarely. minitrd
comes with its own implementation to avoid
cgo or running the blkid(8)
program.
Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.
initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1
After the user entered their passphrase, the root file system can be mounted:
initrd# mount /dev/dm-0 /mnt
Now that everything is set up, we need to pass execution to the init program on
the root file system with a careful sequence of chdir(2)
, mount(2)
, chroot(2)
, chdir(2)
and execve(2)
system calls that is explained in this busybox switch_root
comment.
initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init
To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.
An initramfs “image” (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4.
Generators typically prepare a temporary directory and feed it to the cpio(1)
program. In minitrd
, we read the files into memory
and generate the cpio archive using the
go-cpio package. We use the
pgzip package for parallel gzip
compression.
The following files need to go into the cpio archive:
The minitrd
binary is copied into the cpio archive as /init
and will be run
by the kernel after extracting the archive.
Like the rest of distri, minitrd
is built statically without cgo, which means
it can be copied as-is into the cpio archive.
Aside from the modules.alias
and modules.dep
metadata files, the kernel
modules themselves reside in e.g. /lib/modules/5.4.6/kernel
and need to be
copied into the cpio archive.
Copying all modules results in a ≈80 MiB archive, so it is common to only copy modules that are relevant to the initramfs’s features. This reduces archive size to ≈24 MiB.
The filtering relies on hard-coded patterns and module names. For example, disk
encryption related modules are all kernel modules underneath kernel/crypto
,
plus kernel/drivers/md/dm-crypt.ko
.
When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.
The kbd
package’s setfont(8)
and loadkeys(1)
programs load console fonts and keymaps from
/usr/share/consolefonts
and /usr/share/keymaps
, respectively.
Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.
These programs are (currently) required because minitrd
does not implement
their functionality.
As they are dynamically linked, not only the programs themselves need to be
copied, but also the ELF dynamic linking loader (path stored in the .interp
ELF section) and any ELF library dependencies.
For example, cryptsetup
in distri declares the ELF interpreter
/ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
and declares dependencies
on shared libraries libcryptsetup.so.12
, libblkid.so.1
and others. Luckily,
in distri, packages contain a lib
subdirectory containing symbolic links to
the resolved shared library paths (hermetic packaging), so it is sufficient to
mirror the lib directory into the cpio archive, recursing into shared library
dependencies of shared libraries.
cryptsetup
also requires the GCC runtime library libgcc_s.so.1
to be present
at runtime, and will abort with an error message about not being able to call
pthread_cancel(3)
if it is unavailable.
To print log messages in the correct time zone, we copy /etc/localtime
from
the host into the cpio archive.
I currently have no desire to make minitrd
available outside of
distri. While the technical challenges (such as extending
the generator to not rely on distri’s hermetic packages) are surmountable, I
don’t want to support people’s initramfs remotely.
Also, I think that people’s efforts should in general be spent on rallying
behind dracut
and making it work faster, thereby benefiting all Linux
distributions that use dracut (increasingly more). With minitrd
, I have
demonstrated that significant speed-ups are achievable.
It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as “debug why the encrypted root file system is not unlocked” to more complicated tasks such as “set up a root file system on DRBD for a high-availability setup”. But even with that sort of experience, I didn’t know all the details, until I was forced to implement every little thing.
As I suspected going into this exercise, dracut
is much slower than it needs
to be. Re-implementing its generation stage in a modern language instead of
shell helps a lot.
Of course, my minitrd
does a bit less than dracut
, but not drastically
so. The overall architecture is the same.
I hope my effort helps with two things:
As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, …), people can learn about how an initramfs works in a single place.
I hope the significant time difference motivates people to improve dracut
.
Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I’m sharing my notes here.
First, I copied all kernel modules and a statically built busybox binary:
% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh
To generate an initramfs from the current directory, I used:
% find . | cpio -o -H newc | pigz > /tmp/initrd
In distri’s Makefile
, I append these flags to the QEMU
invocation:
-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"
The vga=
mode parameter is required for loading font latarcyrheb-sun32
.
Once in the busybox
shell, I manually prepared the required mount points and
kernel modules:
ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi
As a next step, I copied cryptsetup
and dependencies into the initramfs directory:
% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full | sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/
In busybox
, I used the following commands to unlock the root file system:
modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt
]]>This article focuses on the package format and its advantages, but there is more to distri, which I will cover in upcoming blog posts.
I was a Debian Developer for the 7 years from 2012 to 2019, but using the distribution often left me frustrated, ultimately resulting in me winding down my Debian work.
Frequently, I was noticing a large gap between the actual speed of an operation (e.g. doing an update) and the possible speed based on back of the envelope calculations. I wrote more about this in my blog post “Package managers are slow”.
To me, this observation means that either there is potential to optimize the
package manager itself (e.g. apt
), or what the system does is just too
complex. While I remember seeing some low-hanging fruit¹, through my work on
distri, I wanted to explore whether all the complexity we currently have in
Linux distributions such as Debian or Fedora is inherent to the problem space.
I have completed enough of the experiment to conclude that the complexity is not inherent: I can build a Linux distribution for general-enough purposes which is much less complex than existing ones.
① Those were low-hanging fruit from a user perspective. I’m not saying that
fixing them is easy in the technical sense; I know too little about apt
’s code
base to make such a statement.
One key idea is to switch from using archives to using images for package
contents. Common package managers such as dpkg(1)
use tar(1)
archives with various compression
algorithms.
distri uses SquashFS images, a comparatively simple file system image format that I happen to be familiar with from my work on the gokrazy Raspberry Pi 3 Go platform.
This idea is not novel: AppImage and snappy also use images, but only for individual, self-contained applications. distri however uses images for distribution packages with dependencies. In particular, there is no duplication of shared libraries in distri.
A nice side effect of using read-only image files is that applications are immutable and can hence not be broken by accidental (or malicious!) modification.
Package contents are made available under a fully-qualified path. E.g., all
files provided by package zsh-amd64-5.6.2-3
are available under
/ro/zsh-amd64-5.6.2-3
. The mountpoint /ro
stands for read-only, which is
short yet descriptive.
Perhaps surprisingly, building software with custom prefix
values of
e.g. /ro/zsh-amd64-5.6.2-3
is widely supported, thanks to:
Linux distributions, which build software with prefix
set to /usr
,
whereas FreeBSD (and the autotools default), which build with prefix
set to
/usr/local
.
Enthusiast users in corporate or research environments, who install software into their home directories.
Because using a custom prefix
is a common scenario, upstream awareness for
prefix
-correctness is generally high, and the rarely required patch will be
quickly accepted.
Software packages often exchange data by placing or locating files in well-known directories. Here are just a few examples:
gcc(1)
locates the libusb(3)
headers via /usr/include
man(1)
locates the nginx(1)
manpage via /usr/share/man
.zsh(1)
locates executable programs via PATH
components such as /bin
In distri, these locations are called exchange directories and are provided
via FUSE in /ro
.
Exchange directories come in two different flavors:
global. The exchange directory, e.g. /ro/share
, provides the union of the
share
sub directory of all packages in the package store.
Global exchange directories are largely used for compatibility, see
below.
per-package. Useful for tight coupling: e.g. irssi(1)
does not provide any ABI guarantees, so plugins such as irssi-robustirc
can declare that they want
e.g. /ro/irssi-amd64-1.1.1-1/out/lib/irssi/modules
to be a per-package
exchange directory and contain files from their lib/irssi/modules
.
Programs which use exchange directories sometimes use search paths to access
multiple exchange directories. In fact, the examples above were taken from gcc(1)
’s INCLUDEPATH
, man(1)
’s MANPATH
and zsh(1)
’s PATH
. These are
prominent ones, but more examples are easy to find: zsh(1)
loads completion functions from its FPATH
.
Some search path values are derived from --datadir=/ro/share
and require no
further attention, but others might derive from
e.g. --prefix=/ro/zsh-amd64-5.6.2-3/out
and need to be pointed to an exchange
directory via a specific command line flag.
Global exchange directories are used to make distri provide enough of the Filesystem Hierarchy Standard (FHS) that third-party software largely just works. This includes a C development environment.
I successfully ran a few programs from their binary packages such as Google Chrome, Spotify, or Microsoft’s Visual Studio Code.
I previously wrote about how Linux distribution package managers are too slow.
distri’s package manager is extremely fast. Its main bottleneck is typically the network link, even at high speed links (I tested with a 100 Gbps link).
Its speed comes largely from an architecture which allows the package manager to do less work. Specifically:
Package images can be added atomically to the package store, so we can safely
skip fsync(2)
. Corruption will be cleaned up
automatically, and durability is not important: if an interactive
installation is interrupted, the user can just repeat it, as it will be fresh
on their mind.
Because all packages are co-installable thanks to separate hierarchies, there
are no conflicts at the package store level, and no dependency resolution (an
optimization problem requiring SAT
solving) is required at all.
In exchange directories, we resolve conflicts by selecting the package with the
highest monotonically increasing distri revision number.
distri proves that we can build a useful Linux distribution entirely without hooks and triggers. Not having to serialize hook execution allows us to download packages into the package store with maximum concurrency.
Because we are using images instead of archives, we do not need to unpack anything. This means installing a package is really just writing its package image and metadata to the package store. Sequential writes are typically the fastest kind of storage usage pattern.
Fast installation also make other use-cases more bearable, such as creating disk
images, be it for testing them in qemu(1)
, booting
them on real hardware from a USB drive, or for cloud providers such as Google
Cloud.
Contrary to how distribution package builders are usually implemented, the distri package builder does not actually install any packages into the build environment.
Instead, distri makes available a filtered view of the package store (only
declared dependencies are available) at /ro
in the build environment.
This means that even for large dependency trees, setting up a build environment happens in a fraction of a second! Such a low latency really makes a difference in how comfortable it is to iterate on distribution packages.
In distri, package images are installed from a remote package store into the
local system package store /roimg
, which backs the /ro
mount.
A package store is implemented as a directory of package images and their associated metadata files.
You can easily make available a package store by using distri export
.
To provide a mirror for your local network, you can periodically distri update
from the package store you want to mirror, and then distri export
your local
copy. Special tooling (e.g. debmirror
in Debian) is not required because
distri install
is atomic (and update
uses install
).
Producing derivatives is easy: just add your own packages to a copy of the package store.
The package store is intentionally kept simple to manage and distribute. Its files could be exchanged via peer-to-peer file systems, or synchronized from an offline medium.
distri works well enough to demonstrate the ideas explained above. I have
branched this state into branch
jackherer
, distri’s first
release code name. This way, I can keep experimenting in the distri repository
without breaking your installation.
From the branch contents, our autobuilder creates:
a package repository. Installations can pick up new packages with
distri update
.
The project website can be found at https://distr1.org. The website is just the README for now, but we can improve that later.
The repository can be found at https://github.com/distr1/distri
Right now, distri is mainly a vehicle for my spare-time Linux distribution research. I don’t recommend anyone use distri for anything but research, and there are no medium-term plans of that changing. At the very least, please contact me before basing anything serious on distri so that we can talk about limitations and expectations.
I expect the distri project to live for as long as I have blog posts to publish, and we’ll see what happens afterwards. Note that this is a hobby for me: I will continue to explore, at my own pace, parts that I find interesting.
My hope is that established distributions might get a useful idea or two from distri.
I don’t want to make this post too long, but there is much more!
Please subscribe to the following URL in your feed reader to get all posts about distri:
https://michael.stapelberg.ch/posts/tags/distri/feed.xml
Next in my queue are articles about hermetic packages and good package maintainer experience (including declarative packaging).
I’d love to discuss these ideas in case you’re interested!
Please send feedback to the distri mailing list so that everyone can participate!
]]>Pending feedback: Allan McRae pointed out that I should be more precise with my terminology: strictly speaking, distributions are slow, and package managers are only part of the puzzle.
I’ll try to be clearer in future revisions/posts.
I measured how long the most popular Linux distribution’s package manager take
to install small and large packages (the
ack(1p)
source code search Perl script
and qemu, respectively).
Where required, my measurements include metadata updates such as transferring an up-to-date package list. For me, requiring a metadata update is the more common case, particularly on live systems or within Docker containers.
All measurements were taken on an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
running Docker 20.10.8 on Linux 5.13.10, backed by a Corsair Force MP600 NVMe
drive boasting many hundreds of MB/s write performance. The machine is located
in Zürich and connected to the Internet with a 1 Gigabit fiber connection, so
the expected top download speed is ≈115 MB/s.
See Appendix D for details on the measurement method and command outputs.
Keep in mind that these are one-time measurements. They should be indicative of actual performance, but your experience may vary.
distribution | package manager | data | wall-clock time | rate |
---|---|---|---|---|
Fedora | dnf | 84 MB | 25s | 3.4 MB/s |
NixOS | Nix | 15 MB | 7s | 2.3 MB/s |
Debian | apt | 16 MB | 3s | 4.9 MB/s |
Arch Linux | pacman | 25 MB | 1s | 18.4 MB/s |
Alpine | apk | 10 MB | 1s | 11.9 MB/s |
distribution | package manager | data | wall-clock time | rate |
---|---|---|---|---|
Fedora | dnf | 350 MB | 56s | 6.25 MB/s |
Debian | apt | 256 MB | 39s | 6.5 MB/s |
NixOS | Nix | 251 MB | 36s | 6.8 MB/s |
Arch Linux | pacman | 128 MB | 10s | 12.1 MB/s |
Alpine | apk | 34 MB | 1.8s | 18.6 MB/s |
(Looking for older measurements? See Appendix B (2019) or Appendix C (2020)).
The difference between the slowest and fastest package managers is 30x!
How can Alpine’s apk and Arch Linux’s pacman be an order of magnitude faster than the rest? They are doing a lot less than the others, and more efficiently, too.
For example, Fedora transfers a lot more data than others because its main
package list is 60 MB (compressed!) alone. Compare that with Alpine’s 734 KB
APKINDEX.tar.gz
.
Of course the extra metadata which Fedora provides helps some use case, otherwise they hopefully would have removed it altogether. The amount of metadata seems excessive for the use case of installing a single package, which I consider the main use-case of an interactive package manager.
I expect any modern Linux distribution to only transfer absolutely required data to complete my task.
Because they need to sequence executing arbitrary package maintainer-provided code (hooks and triggers), all tested package managers need to install packages sequentially (one after the other) instead of concurrently (all at the same time).
In my blog post “Can we do without hooks and triggers?”, I outline that hooks and triggers are not strictly necessary to build a working Linux distribution.
Strictly speaking, the only required feature of a package manager is to make available the package contents so that the package can be used: a program can be started, a kernel module can be loaded, etc.
By only implementing what’s needed for this feature, and nothing more, a package
manager could likely beat apk
’s performance. It could, for example:
apk
)Here’s a table outlining how the various package managers listed on Wikipedia’s list of software package management systems fare:
name | scope | package file format | hooks/triggers |
---|---|---|---|
AppImage | apps | image: ISO9660, SquashFS | no |
snappy | apps | image: SquashFS | yes: hooks |
FlatPak | apps | archive: OSTree | no |
0install | apps | archive: tar.bz2 | no |
nix, guix | distro | archive: nar.{bz2,xz} | activation script |
dpkg | distro | archive: tar.{gz,xz,bz2} in ar(1) | yes |
rpm | distro | archive: cpio.{bz2,lz,xz} | scriptlets |
pacman | distro | archive: tar.xz | install |
slackware | distro | archive: tar.{gz,xz} | yes: doinst.sh |
apk | distro | archive: tar.gz | yes: .post-install |
Entropy | distro | archive: tar.bz2 | yes |
ipkg, opkg | distro | archive: tar{,.gz} | yes |
As per the current landscape, there is no distribution-scoped package manager which uses images and leaves out hooks and triggers, not even in smaller Linux distributions.
I think that space is really interesting, as it uses a minimal design to achieve significant real-world speed-ups.
I have explored this idea in much more detail, and am happy to talk more about it in my post distri: a Linux distribution to research fast package management.
There are a couple of recent developments going into the same direction:
You can expand each of these:
% docker run --security-opt=seccomp:unconfined -t -i fedora /bin/bash
[root@62d3cae2e2f9 /]# time dnf install -y ack
Fedora 35 - x86_64 25 MB/s | 61 MB
Fedora 35 openh264 (From Cisco) - x86_64 3.5 kB/s | 2.5 kB
Fedora Modular 35 - x86_64 5.0 MB/s | 2.6 MB
Fedora 35 - x86_64 - Updates 6.0 MB/s | 9.3 MB
Fedora Modular 35 - x86_64 - Updates 4.1 MB/s | 3.3 MB
Dependencies resolved.
[…]
real 0m24.882s
user 0m17.377s
sys 0m0.835s
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.ack'
unpacking channels...
created 1 symlinks in user environment
installing 'perl5.34.0-ack-3.5.0'
these paths will be fetched (15.78 MiB download, 86.82 MiB unpacked):
/nix/store/11xpmmwy95396nkhih3qc3814lqhqb8f-libunistring-0.9.10
/nix/store/1h18nl3gisw89znbzbmnxhd7jk20xlff-perl5.34.0-File-Next-1.18
/nix/store/1mpxs3109cjrbhmi3q1vmvc0djz102pl-libidn2-2.3.2
/nix/store/jr35z7n8jbv9q89my50vhyndqd3y541i-attr-2.5.1
/nix/store/krc4xirbvjnff8m62snqdbayg46z5l5b-acl-2.3.1
/nix/store/mij848h2x5wiqkwhg027byvmf9x3gx7y-glibc-2.33-50
/nix/store/wq38iqzdh40dzfsndb927kh7y5bqh457-perl5.34.0-ack-3.5.0-man
/nix/store/xyn0240zrpprnspg3n0fi8c8aw5bq0mr-coreutils-8.32
/nix/store/y8r9ymbz59yjm1bwr3fdvd23jvcb2bzj-perl5.34.0-ack-3.5.0
/nix/store/ypr273yvmr07n5n1w1gbcqnhpw7lbbvz-perl-5.34.0
copying path '/nix/store/wq38iqzdh40dzfsndb927kh7y5bqh457-perl5.34.0-ack-3.5.0-man' from 'https://cache.nixos.org'...
copying path '/nix/store/11xpmmwy95396nkhih3qc3814lqhqb8f-libunistring-0.9.10' from 'https://cache.nixos.org'...
copying path '/nix/store/1h18nl3gisw89znbzbmnxhd7jk20xlff-perl5.34.0-File-Next-1.18' from 'https://cache.nixos.org'...
copying path '/nix/store/1mpxs3109cjrbhmi3q1vmvc0djz102pl-libidn2-2.3.2' from 'https://cache.nixos.org'...
copying path '/nix/store/mij848h2x5wiqkwhg027byvmf9x3gx7y-glibc-2.33-50' from 'https://cache.nixos.org'...
copying path '/nix/store/jr35z7n8jbv9q89my50vhyndqd3y541i-attr-2.5.1' from 'https://cache.nixos.org'...
copying path '/nix/store/krc4xirbvjnff8m62snqdbayg46z5l5b-acl-2.3.1' from 'https://cache.nixos.org'...
copying path '/nix/store/xyn0240zrpprnspg3n0fi8c8aw5bq0mr-coreutils-8.32' from 'https://cache.nixos.org'...
copying path '/nix/store/ypr273yvmr07n5n1w1gbcqnhpw7lbbvz-perl-5.34.0' from 'https://cache.nixos.org'...
copying path '/nix/store/y8r9ymbz59yjm1bwr3fdvd23jvcb2bzj-perl5.34.0-ack-3.5.0' from 'https://cache.nixos.org'...
building '/nix/store/pwlxhy7kry56z6593rh397fc49x5avlw-user-environment.drv'...
created 49 symlinks in user environment
real 0m 6.82s
user 0m 3.47s
sys 0m 2.11s
% docker run -t -i debian:sid
root@40a3899b1f2f:/# time (apt update && apt install -y ack-grep)
Get:1 http://deb.debian.org/debian sid InRelease [165 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8800 kB]
Fetched 8965 kB in 1s (9495 kB/s)
[…]
The following NEW packages will be installed:
ack libfile-next-perl libgdbm-compat4 libgdbm6 libperl5.32 netbase perl perl-modules-5.32
0 upgraded, 8 newly installed, 0 to remove and 24 not upgraded.
Need to get 7479 kB of archives.
After this operation, 47.7 MB of additional disk space will be used.
[…]
real 0m3.260s
user 0m2.463s
sys 0m0.352s
% docker run -t -i archlinux:base
[root@9f6672688a64 /]# time (pacman -Sy && pacman -S --noconfirm ack)
:: Synchronizing package databases...
core 138.8 KiB 1542 KiB/s
extra 1569.8 KiB 26.9 MiB/s
community 5.8 MiB 92.2 MiB/s
resolving dependencies...
looking for conflicting packages...
Packages (5) db-5.3.28-5 gdbm-1.22-1 perl-5.34.0-2 perl-file-next-1.18-3 ack-3.5.0-2
Total Download Size: 16.77 MiB
Total Installed Size: 66.21 MiB
[…]
real 0m1.403s
user 0m0.484s
sys 0m0.211s
% docker run -t -i alpine
# time apk add ack
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
(1/4) Installing libbz2 (1.0.8-r1)
(2/4) Installing perl (5.32.1-r0)
(3/4) Installing perl-file-next (1.18-r2)
(4/4) Installing ack (3.5.0-r1)
Executing busybox-1.33.1-r3.trigger
OK: 43 MiB in 18 packages
real 0m 0.76s
user 0m 0.27s
sys 0m 0.09s
You can expand each of these:
% docker run -t -i fedora /bin/bash
[root@6a52ecfc3afa /]# time dnf install -y qemu
Fedora 35 - x86_64 15 MB/s | 61 MB
Fedora 35 openh264 (From Cisco) - x86_64 3.0 kB/s | 2.5 kB
Fedora Modular 35 - x86_64 5.2 MB/s | 2.6 MB
Fedora 35 - x86_64 - Updates 6.6 MB/s | 9.3 MB
Fedora Modular 35 - x86_64 - Updates 2.2 MB/s | 3.3 MB
Dependencies resolved.
[…]
Total download size: 274 M
Downloading Packages:
[…]
real 0m56.031s
user 0m31.275s
sys 0m3.868s
% docker run -t -i nixos/nix
83971cf79f7e:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.qemu'
unpacking channels...
created 1 symlinks in user environment
installing 'qemu-6.1.0'
these paths will be fetched (230.72 MiB download, 1424.84 MiB unpacked):
[…]
real 0m 36.55s
user 0m 19.83s
sys 0m 3.34s
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y qemu-system-x86)
Get:1 http://deb.debian.org/debian sid InRelease [146 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8400 kB]
Fetched 8965 kB in 1s (9048 kB/s)
[…]
Fetched 247 MB in 4s (64.9 MB/s)
[…]
real 0m38.875s
user 0m21.282s
sys 0m5.298s
% docker run -t -i archlinux:base
[root@58c78bda08e8 /]# time (pacman -Sy && pacman -S --noconfirm qemu)
:: Synchronizing package databases...
core 138.7 KiB 1541 KiB/s
extra 1569.8 KiB 35.7 MiB/s
community 5.8 MiB 92.2 MiB/s
[…]
Total Download Size: 118.97 MiB
Total Installed Size: 586.68 MiB
[…]
real 0m10.542s
user 0m3.092s
sys 0m1.569s
% docker run -t -i alpine
/ # time apk add qemu-system-x86_64
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
[…]
OK: 281 MiB in 66 packages
real 0m 1.83s
user 0m 0.77s
sys 0m 0.24s
You can expand each of these:
% docker run -t -i fedora /bin/bash
[root@62d3cae2e2f9 /]# time dnf install -y ack
Fedora 32 openh264 (From Cisco) - x86_64 1.9 kB/s | 2.5 kB 00:01
Fedora Modular 32 - x86_64 6.8 MB/s | 4.9 MB 00:00
Fedora Modular 32 - x86_64 - Updates 5.6 MB/s | 3.7 MB 00:00
Fedora 32 - x86_64 - Updates 9.9 MB/s | 23 MB 00:02
Fedora 32 - x86_64 39 MB/s | 70 MB 00:01
[…]
real 0m32.898s
user 0m25.121s
sys 0m1.408s
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.ack'
unpacking channels...
created 1 symlinks in user environment
installing 'perl5.32.0-ack-3.3.1'
these paths will be fetched (15.55 MiB download, 85.51 MiB unpacked):
/nix/store/34l8jdg76kmwl1nbbq84r2gka0kw6rc8-perl5.32.0-ack-3.3.1-man
/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31
/nix/store/9fd4pjaxpjyyxvvmxy43y392l7yvcwy1-perl5.32.0-File-Next-1.18
/nix/store/czc3c1apx55s37qx4vadqhn3fhikchxi-libunistring-0.9.10
/nix/store/dj6n505iqrk7srn96a27jfp3i0zgwa1l-acl-2.2.53
/nix/store/ifayp0kvijq0n4x0bv51iqrb0yzyz77g-perl-5.32.0
/nix/store/w9wc0d31p4z93cbgxijws03j5s2c4gyf-coreutils-8.31
/nix/store/xim9l8hym4iga6d4azam4m0k0p1nw2rm-libidn2-2.3.0
/nix/store/y7i47qjmf10i1ngpnsavv88zjagypycd-attr-2.4.48
/nix/store/z45mp61h51ksxz28gds5110rf3wmqpdc-perl5.32.0-ack-3.3.1
copying path '/nix/store/34l8jdg76kmwl1nbbq84r2gka0kw6rc8-perl5.32.0-ack-3.3.1-man' from 'https://cache.nixos.org'...
copying path '/nix/store/czc3c1apx55s37qx4vadqhn3fhikchxi-libunistring-0.9.10' from 'https://cache.nixos.org'...
copying path '/nix/store/9fd4pjaxpjyyxvvmxy43y392l7yvcwy1-perl5.32.0-File-Next-1.18' from 'https://cache.nixos.org'...
copying path '/nix/store/xim9l8hym4iga6d4azam4m0k0p1nw2rm-libidn2-2.3.0' from 'https://cache.nixos.org'...
copying path '/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31' from 'https://cache.nixos.org'...
copying path '/nix/store/y7i47qjmf10i1ngpnsavv88zjagypycd-attr-2.4.48' from 'https://cache.nixos.org'...
copying path '/nix/store/dj6n505iqrk7srn96a27jfp3i0zgwa1l-acl-2.2.53' from 'https://cache.nixos.org'...
copying path '/nix/store/w9wc0d31p4z93cbgxijws03j5s2c4gyf-coreutils-8.31' from 'https://cache.nixos.org'...
copying path '/nix/store/ifayp0kvijq0n4x0bv51iqrb0yzyz77g-perl-5.32.0' from 'https://cache.nixos.org'...
copying path '/nix/store/z45mp61h51ksxz28gds5110rf3wmqpdc-perl5.32.0-ack-3.3.1' from 'https://cache.nixos.org'...
building '/nix/store/m0rl62grplq7w7k3zqhlcz2hs99y332l-user-environment.drv'...
created 49 symlinks in user environment
real 0m 5.60s
user 0m 3.21s
sys 0m 1.66s
% docker run -t -i debian:sid
root@1996bb94a2d1:/# time (apt update && apt install -y ack-grep)
Get:1 http://deb.debian.org/debian sid InRelease [146 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8400 kB]
Fetched 8546 kB in 1s (8088 kB/s)
[…]
The following NEW packages will be installed:
ack libfile-next-perl libgdbm-compat4 libgdbm6 libperl5.30 netbase perl perl-modules-5.30
0 upgraded, 8 newly installed, 0 to remove and 23 not upgraded.
Need to get 7341 kB of archives.
After this operation, 46.7 MB of additional disk space will be used.
[…]
real 0m9.544s
user 0m2.839s
sys 0m0.775s
% docker run -t -i archlinux/base
[root@9f6672688a64 /]# time (pacman -Sy && pacman -S --noconfirm ack)
:: Synchronizing package databases...
core 130.8 KiB 1090 KiB/s 00:00
extra 1655.8 KiB 3.48 MiB/s 00:00
community 5.2 MiB 6.11 MiB/s 00:01
resolving dependencies...
looking for conflicting packages...
Packages (2) perl-file-next-1.18-2 ack-3.4.0-1
Total Download Size: 0.07 MiB
Total Installed Size: 0.19 MiB
[…]
real 0m2.936s
user 0m0.375s
sys 0m0.160s
% docker run -t -i alpine
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing libbz2 (1.0.8-r1)
(2/4) Installing perl (5.30.3-r0)
(3/4) Installing perl-file-next (1.18-r0)
(4/4) Installing ack (3.3.1-r0)
Executing busybox-1.31.1-r16.trigger
OK: 43 MiB in 18 packages
real 0m 1.24s
user 0m 0.40s
sys 0m 0.15s
You can expand each of these:
% docker run -t -i fedora /bin/bash
[root@6a52ecfc3afa /]# time dnf install -y qemu
Fedora 32 openh264 (From Cisco) - x86_64 3.1 kB/s | 2.5 kB 00:00
Fedora Modular 32 - x86_64 6.3 MB/s | 4.9 MB 00:00
Fedora Modular 32 - x86_64 - Updates 6.0 MB/s | 3.7 MB 00:00
Fedora 32 - x86_64 - Updates 334 kB/s | 23 MB 01:10
Fedora 32 - x86_64 33 MB/s | 70 MB 00:02
[…]
Total download size: 181 M
Downloading Packages:
[…]
real 4m37.652s
user 0m38.239s
sys 0m6.321s
% docker run -t -i nixos/nix
83971cf79f7e:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.qemu'
unpacking channels...
created 1 symlinks in user environment
installing 'qemu-5.1.0'
these paths will be fetched (180.70 MiB download, 1146.92 MiB unpacked):
[…]
real 0m 33.64s
user 0m 16.96s
sys 0m 3.05s
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y qemu-system-x86)
Get:1 http://deb.debian.org/debian sid InRelease [146 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8400 kB]
Fetched 8546 kB in 1s (5998 kB/s)
[…]
Fetched 216 MB in 43s (5006 kB/s)
[…]
real 1m25.375s
user 0m29.163s
sys 0m12.835s
% docker run -t -i archlinux/base
[root@58c78bda08e8 /]# time (pacman -Sy && pacman -S --noconfirm qemu)
:: Synchronizing package databases...
core 130.8 KiB 1055 KiB/s 00:00
extra 1655.8 KiB 3.70 MiB/s 00:00
community 5.2 MiB 7.89 MiB/s 00:01
[…]
Total Download Size: 135.46 MiB
Total Installed Size: 661.05 MiB
[…]
real 0m43.901s
user 0m4.980s
sys 0m2.615s
% docker run -t -i alpine
/ # time apk add qemu-system-x86_64
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
[…]
OK: 78 MiB in 95 packages
real 0m 2.43s
user 0m 0.46s
sys 0m 0.09s
You can expand each of these:
% docker run -t -i fedora /bin/bash
[root@722e6df10258 /]# time dnf install -y ack
Fedora Modular 30 - x86_64 4.4 MB/s | 2.7 MB 00:00
Fedora Modular 30 - x86_64 - Updates 3.7 MB/s | 2.4 MB 00:00
Fedora 30 - x86_64 - Updates 17 MB/s | 19 MB 00:01
Fedora 30 - x86_64 31 MB/s | 70 MB 00:02
[…]
Install 44 Packages
Total download size: 13 M
Installed size: 42 M
[…]
real 0m29.498s
user 0m22.954s
sys 0m1.085s
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -i perl5.28.2-ack-2.28'
unpacking channels...
created 2 symlinks in user environment
installing 'perl5.28.2-ack-2.28'
these paths will be fetched (14.91 MiB download, 80.83 MiB unpacked):
/nix/store/57iv2vch31v8plcjrk97lcw1zbwb2n9r-perl-5.28.2
/nix/store/89gi8cbp8l5sf0m8pgynp2mh1c6pk1gk-attr-2.4.48
/nix/store/gkrpl3k6s43fkg71n0269yq3p1f0al88-perl5.28.2-ack-2.28-man
/nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27
/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31
/nix/store/svgkibi7105pm151prywndsgvmc4qvzs-acl-2.2.53
/nix/store/x4knf14z1p0ci72gl314i7vza93iy7yc-perl5.28.2-File-Next-1.16
/nix/store/zfj7ria2kwqzqj9dh91kj9kwsynxdfk0-perl5.28.2-ack-2.28
copying path '/nix/store/gkrpl3k6s43fkg71n0269yq3p1f0al88-perl5.28.2-ack-2.28-man' from 'https://cache.nixos.org'...
copying path '/nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27' from 'https://cache.nixos.org'...
copying path '/nix/store/x4knf14z1p0ci72gl314i7vza93iy7yc-perl5.28.2-File-Next-1.16' from 'https://cache.nixos.org'...
copying path '/nix/store/89gi8cbp8l5sf0m8pgynp2mh1c6pk1gk-attr-2.4.48' from 'https://cache.nixos.org'...
copying path '/nix/store/svgkibi7105pm151prywndsgvmc4qvzs-acl-2.2.53' from 'https://cache.nixos.org'...
copying path '/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31' from 'https://cache.nixos.org'...
copying path '/nix/store/57iv2vch31v8plcjrk97lcw1zbwb2n9r-perl-5.28.2' from 'https://cache.nixos.org'...
copying path '/nix/store/zfj7ria2kwqzqj9dh91kj9kwsynxdfk0-perl5.28.2-ack-2.28' from 'https://cache.nixos.org'...
building '/nix/store/q3243sjg91x1m8ipl0sj5gjzpnbgxrqw-user-environment.drv'...
created 56 symlinks in user environment
real 0m 14.02s
user 0m 8.83s
sys 0m 2.69s
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y ack-grep)
Get:1 http://cdn-fastly.deb.debian.org/debian sid InRelease [233 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian sid/main amd64 Packages [8270 kB]
Fetched 8502 kB in 2s (4764 kB/s)
[…]
The following NEW packages will be installed:
ack ack-grep libfile-next-perl libgdbm-compat4 libgdbm5 libperl5.26 netbase perl perl-modules-5.26
The following packages will be upgraded:
perl-base
1 upgraded, 9 newly installed, 0 to remove and 60 not upgraded.
Need to get 8238 kB of archives.
After this operation, 42.3 MB of additional disk space will be used.
[…]
real 0m9.096s
user 0m2.616s
sys 0m0.441s
% docker run -t -i archlinux/base
[root@9604e4ae2367 /]# time (pacman -Sy && pacman -S --noconfirm ack)
:: Synchronizing package databases...
core 132.2 KiB 1033K/s 00:00
extra 1629.6 KiB 2.95M/s 00:01
community 4.9 MiB 5.75M/s 00:01
[…]
Total Download Size: 0.07 MiB
Total Installed Size: 0.19 MiB
[…]
real 0m3.354s
user 0m0.224s
sys 0m0.049s
% docker run -t -i alpine
/ # time apk add ack
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
(1/4) Installing perl-file-next (1.16-r0)
(2/4) Installing libbz2 (1.0.6-r7)
(3/4) Installing perl (5.28.2-r1)
(4/4) Installing ack (3.0.0-r0)
Executing busybox-1.30.1-r2.trigger
OK: 44 MiB in 18 packages
real 0m 0.96s
user 0m 0.25s
sys 0m 0.07s
You can expand each of these:
% docker run -t -i fedora /bin/bash
[root@722e6df10258 /]# time dnf install -y qemu
Fedora Modular 30 - x86_64 3.1 MB/s | 2.7 MB 00:00
Fedora Modular 30 - x86_64 - Updates 2.7 MB/s | 2.4 MB 00:00
Fedora 30 - x86_64 - Updates 20 MB/s | 19 MB 00:00
Fedora 30 - x86_64 31 MB/s | 70 MB 00:02
[…]
Install 262 Packages
Upgrade 4 Packages
Total download size: 172 M
[…]
real 1m7.877s
user 0m44.237s
sys 0m3.258s
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -i qemu-4.0.0'
unpacking channels...
created 2 symlinks in user environment
installing 'qemu-4.0.0'
these paths will be fetched (262.18 MiB download, 1364.54 MiB unpacked):
[…]
real 0m 38.49s
user 0m 26.52s
sys 0m 4.43s
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y qemu-system-x86)
Get:1 http://cdn-fastly.deb.debian.org/debian sid InRelease [149 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian sid/main amd64 Packages [8426 kB]
Fetched 8574 kB in 1s (6716 kB/s)
[…]
Fetched 151 MB in 2s (64.6 MB/s)
[…]
real 0m51.583s
user 0m15.671s
sys 0m3.732s
% docker run -t -i archlinux/base
[root@9604e4ae2367 /]# time (pacman -Sy && pacman -S --noconfirm qemu)
:: Synchronizing package databases...
core 132.2 KiB 751K/s 00:00
extra 1629.6 KiB 3.04M/s 00:01
community 4.9 MiB 6.16M/s 00:01
[…]
Total Download Size: 123.20 MiB
Total Installed Size: 587.84 MiB
[…]
real 1m2.475s
user 0m9.272s
sys 0m2.458s
% docker run -t -i alpine
/ # time apk add qemu-system-x86_64
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
[…]
OK: 78 MiB in 95 packages
real 0m 2.43s
user 0m 0.46s
sys 0m 0.09s
postgres
), or generating/updating cache files.
Triggers are a kind of hook which run when other packages are installed. For
example, on Debian, the man(1)
package
comes with a trigger which regenerates the search database index whenever any
package installs a manpage. When, for example, the
nginx(8)
package is installed, a
trigger provided by the man(1)
package
runs.
Over the past few decades, Open Source software has become more and more uniform: instead of each piece of software defining its own rules, a small number of build systems are now widely adopted.
Hence, I think it makes sense to revisit whether offering extension via hooks and triggers is a net win or net loss.
Package managers commonly can make very little assumptions about what hooks do, what preconditions they require, and which conflicts might be caused by running multiple package’s hooks concurrently.
Hence, package managers cannot concurrently install packages. At least the hook/trigger part of the installation needs to happen in sequence.
While it seems technically feasible to retrofit package manager hooks with concurrency primitives such as locks for mutual exclusion between different hook processes, the required overhaul of all hooks¹ seems like such a daunting task that it might be better to just get rid of the hooks instead. Only deleting code frees you from the burden of maintenance, automated testing and debugging.
① In Debian, there are 8620 non-generated maintainer scripts, as reported by
find shard*/src/*/debian -regex ".*\(pre\|post\)\(inst\|rm\)$"
on a Debian
Code Search instance.
Personally, I never use the
apropos(1)
command, so I don’t
appreciate the man(1)
package’s trigger
which updates the database used by
apropos(1)
. The process takes a long
time and, because hooks and triggers must be executed serially (see previous
section), blocks my installation or update.
When I tell people this, they are often surprised to learn about the existance
of the apropos(1)
command. I suggest
adopting an opt-in model.
Hooks run when packages are installed. If a package’s contents are not used between two updates, running the hook in the first update could have been skipped. Running the hook lazily when the package contents are used reduces unnecessary work.
As a welcome side-effect, lazy hook evaluation automatically makes the hook work in operating system images, such as live USB thumb drives or SD card images for the Raspberry Pi. Such images must not ship the same crypto keys (e.g. OpenSSH host keys) to all machines, but instead generate a different key on each machine.
Why do users keep packages installed they don’t use? It’s extra work to remember and clean up those packages after use. Plus, users might not realize or value that having fewer packages installed has benefits such as faster updates.
I can also imagine that there are people for whom the cost of re-installing packages incentivizes them to just keep packages installed—you never know when you might need the program again…
While working on hermetic packages (more on that in another blog post), where
the contained programs are started with modified environment variables
(e.g. PATH
) via a wrapper bash script, I noticed that the overhead of those
wrapper bash scripts quickly becomes significant. For example, when using the
excellent magit interface for Git in Emacs, I encountered
second-long delays² when using hermetic packages compared to standard
packages. Re-implementing wrappers in a compiled language provided a significant
speed-up.
Similarly, getting rid of an extension point which mandates using shell scripts allows us to build an efficient and fast implementation of a predefined set of primitives, where you can reason about their effects and interactions.
② magit needs to run git a few times for displaying the full status, so small overhead quickly adds up.
Hooks are an escape hatch for distribution maintainers to express anything which their packaging system cannot express.
Distributions should only rely on well-established interfaces such as autoconf’s
classic ./configure && make && make install
(including commonly used flags) to
build a distribution package. Integrating upstream software into a distribution
should not require custom hooks. For example, instead of requiring a hook which
updates a cache of schema files, the library used to interact with those files
should transparently (re-)generate the cache or fall back to a slower code path.
Distribution maintainers are hard to come by, so we should value their time. In particular, there is a 1:n relationship of packages to distribution package maintainers (software is typically available in multiple Linux distributions), so it makes sense to spend the work in the 1 and have the n benefit.
If we want to get rid of hooks, we need another mechanism to achieve what we currently achieve with hooks.
If the hook is not specific to the package, it can be moved to the package
manager. The desired system state should either be derived from the package
contents (e.g. required system users can be discovered from systemd service
files) or declaratively specified in the package build instructions—more on that
in another blog post. This turns hooks (arbitrary code) into configuration,
which allows the package manager to collapse and sequence the required state
changes. E.g., when 5 packages are installed which each need a new system user,
the package manager could update /etc/passwd
just once.
If the hook is specific to the package, it should be moved into the package
contents. This typically means moving the functionality into the program start
(or the systemd service file if we are talking about a daemon). If (while?)
upstream is not convinced, you can either wrap the program or patch it. Note
that this case is relatively rare: I have worked with hundreds of packages and
the only package-specific functionality I came across was automatically
generating host keys before starting OpenSSH’s
sshd(8)
³.
There is one exception where moving the hook doesn’t work: packages which modify state outside of the system, such as bootloaders or kernel images.
③ Even that can be moved out of a package-specific hook, as Fedora demonstrates.
Global state modifications performed as part of package installation today use hooks, an overly expressive extension mechanism.
Instead, all modifications should be driven by configuration. This is feasible because there are only a few different kinds of desired state modifications. This makes it possible for package managers to optimize package installation.
]]>