Michael Stapelbergs Website: posts tagged debian

Debian Code Search: OpenAPI now available

2021-03-06T11:15:11+01:00

Debian Code Search now offers an OpenAPI-based API!

Various developers have created ad-hoc client libraries based on how the web interface works.

The goal of offering an OpenAPI-based API is to provide developers with automatically generated client libraries for a large number of programming languages, that target a stable interface independent of the web interface’s implementation details.

Getting started

Visit https://codesearch.debian.net/apikeys/ to download your personal API key. Login via Debian’s GitLab instance salsa.debian.org; register there if you have no account yet.
Find the Debian Code Search client library for your programming language. If none exists yet, auto-generate a client library on editor.swagger.io: click “Generate Client”.
Search all code in Debian from your own analysis tool, migration tracking dashboard, etc.

curl example

curl \
  -H "x-dcs-apikey: $(cat dcs-apikey-stapelberg.txt)" \
  -X GET \
  "https://codesearch.debian.net/api/v1/search?query=i3Font&match_mode=regexp"

Web browser example

You can try out the API in your web browser in the OpenAPI documentation.

Code example (Go)

Here’s an example program that demonstrates how to set up an auto-generated Go client for the Debian Code Search OpenAPI, run a query, and aggregate the results:

func burndown() error {
	cfg := openapiclient.NewConfiguration()
	cfg.AddDefaultHeader("x-dcs-apikey", apiKey)
	client := openapiclient.NewAPIClient(cfg)
	ctx := context.Background()

	// Search through the full Debian Code Search corpus, blocking until all
	// results are available:
	results, _, err := client.SearchApi.Search(ctx, "fmt.Sprint(err)", &openapiclient.SearchApiSearchOpts{
		// Literal searches are faster and do not require escaping special
		// characters, regular expression searches are more powerful.
		MatchMode: optional.NewString("literal"),
	})
	if err != nil {
		return err
	}

	// Print to stdout a CSV file with the path and number of occurrences:
	wr := csv.NewWriter(os.Stdout)
	header := []string{"path", "number of occurrences"}
	if err := wr.Write(header); err != nil {
		return err
	}
	occurrences := make(map[string]int)
	for _, result := range results {
		occurrences[result.Path]++
	}
	for _, result := range results {
		o, ok := occurrences[result.Path]
		if !ok {
			continue
		}
		// Print one CSV record per path:
		delete(occurrences, result.Path)
		record := []string{result.Path, strconv.Itoa(o)}
		if err := wr.Write(record); err != nil {
			return err
		}
	}
	wr.Flush()
	return wr.Error()
}

The full example can be found under burndown.go.

Feedback?

File a GitHub issue on github.com/Debian/dcs please!

Migration status

I’m aware of the following third-party projects using Debian Code Search:

Tool	Migration status
Debian Code Search CLI tool	Updated to OpenAPI
identify-incomplete-xs-go-import-path	Update pending
gnome-codesearch	makes no API queries

If you find any others, please point them to this post in case they are not using Debian Code Search’s OpenAPI yet.

a new distri linux (fast package management) release

2020-05-16T09:13:00+02:00

I just released a new version of distri.

The focus of this release lies on:

a better developer experience, allowing users to debug any installed package without extra setup steps
performance improvements in all areas (starting programs, building distri packages, generating distri images)
better tooling for keeping track of upstream versions

See the release notes for more details.

The distri research linux distribution project was started in 2019 to research whether a few architectural changes could enable drastically faster package management.

While the package managers in common Linux distributions (e.g. apt, dnf, …) top out at data rates of only a few MB/s, distri effortlessly saturates 1 Gbit, 10 Gbit and even 40 Gbit connections, resulting in fast installation and update speeds.

Hermetic packages (in distri)

2020-05-09T18:48:00+02:00

In distri, packages (e.g. emacs) are hermetic. By hermetic, I mean that the dependencies a package uses (e.g. libusb) don’t change, even when newer versions are installed.

For example, if package libusb-amd64-1.0.22-7 is available at build time, the package will always use that same version, even after the newer libusb-amd64-1.0.23-8 will be installed into the package store.

Another way of saying the same thing is: packages in distri are always co-installable.

This makes the package store more robust: additions to it will not break the system. On a technical level, the package store is implemented as a directory containing distri SquashFS images and metadata files, into which packages are installed in an atomic way.

Out of scope: plugins are not hermetic by design

One exception where hermeticity is not desired are plugin mechanisms: optionally loading out-of-tree code at runtime obviously is not hermetic.

As an example, consider glibc’s Name Service Switch (NSS) mechanism. Page 29.4.1 Adding another Service to NSS describes how glibc searches $prefix/lib for shared libraries at runtime.

Debian ships about a dozen NSS libraries for a variety of purposes, and enterprise setups might add their own into the mix.

systemd (as of v245) accounts for 4 NSS libraries, e.g. nss-systemd for user/group name resolution for users allocated through systemd’s DynamicUser= option.

Having packages be as hermetic as possible remains a worthwhile goal despite any exceptions: I will gladly use a 99% hermetic system over a 0% hermetic system any day.

Side note: Xorg’s driver model (which can be characterized as a plugin mechanism) does not fall under this category because of its tight API/ABI coupling! For this case, where drivers are only guaranteed to work with precisely the Xorg version for which they were compiled, distri uses per-package exchange directories.

Implementation of hermetic packages in distri

On a technical level, the requirement is: all paths used by the program must always result in the same contents. This is implemented in distri via the read-only package store mounted at /ro, e.g. files underneath /ro/emacs-amd64-26.3-15 never change.

To change all paths used by a program, in practice, three strategies cover most paths:

ELF interpreter and dynamic libraries

Programs on Linux use the ELF file format, which contains two kinds of references:

First, the ELF interpreter (PT_INTERP segment), which is used to start the program. For dynamically linked programs on 64-bit systems, this is typically ld.so(8).

Many distributions use system-global paths such as /lib64/ld-linux-x86-64.so.2, but distri compiles programs with -Wl,--dynamic-linker=/ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2 so that the full path ends up in the binary.

The ELF interpreter is shown by file(1), but you can also use readelf -a $BINARY | grep 'program interpreter' to display it.

And secondly, the rpath, a run-time search path for dynamic libraries. Instead of storing full references to all dynamic libraries, we set the rpath so that ld.so(8) will find the correct dynamic libraries.

Originally, we used to just set a long rpath, containing one entry for each dynamic library dependency. However, we have since switched to using a single lib subdirectory per package as its rpath, and placing symlinks with full path references into that lib directory, e.g. using -Wl,-rpath=/ro/grep-amd64-3.4-4/lib. This is better for performance, as ld.so uses a per-directory cache.

Note that program load times are significantly influenced by how quickly you can locate the dynamic libraries. distri uses a FUSE file system to load programs from, so getting proper -ENOENT caching into place drastically sped up program load times.

Instead of compiling software with the -Wl,--dynamic-linker and -Wl,-rpath flags, one can also modify these fields after the fact using patchelf(1). For closed-source programs, this is the only possibility.

The rpath can be inspected by using e.g. readelf -a $BINARY | grep RPATH.

Environment variable setup wrapper programs

Many programs are influenced by environment variables: to start another program, said program is often found by checking each directory in the PATH environment variable.

Such search paths are prevalent in scripting languages, too, to find modules. Python has PYTHONPATH, Perl has PERL5LIB, and so on.

To set up these search path environment variables at run time, distri employs an indirection. Instead of e.g. teensy-loader-cli, you run a small wrapper program that calls precisely one execve system call with the desired environment variables.

Initially, I used shell scripts as wrapper programs because they are easily inspectable. This turned out to be too slow, so I switched to compiled programs. I’m linking them statically for fast startup, and I’m linking them against musl libc for significantly smaller file sizes than glibc (per-executable overhead adds up quickly in a distribution!).

Note that the wrapper programs prepend to the PATH environment variable, they don’t replace it in its entirely. This is important so that users have a way to extend the PATH (and other variables) if they so choose. This doesn’t hurt hermeticity because it is only relevant for programs that were not present at build time, i.e. plugin mechanisms which, by design, cannot be hermetic.

Shebang interpreter patching

The Shebang of scripts contains a path, too, and hence needs to be changed.

We don’t do this in distri yet (the number of packaged scripts is small), but we should.

Performance requirements

The performance improvements in the previous sections are not just good to have, but practically required when many processes are involved: without them, you’ll encounter second-long delays in magit which spawns many git processes under the covers, or in dracut, which spawns one cp(1) process per file.

Downside: rebuild of packages required to pick up changes

Linux distributions such as Debian consider it an advantage to roll out security fixes to the entire system by updating a single shared library package (e.g. openssl).

The flip side of that coin is that changes to a single critical package can break the entire system.

With hermetic packages, all reverse dependencies must be rebuilt when a library’s changes should be picked up by the whole system. E.g., when openssl changes, curl must be rebuilt to pick up the new version of openssl.

This approach trades off using more bandwidth and more disk space (temporarily) against reducing the blast radius of any individual package update.

Downside: long env variables are cumbersome to deal with

This can be partially mitigated by removing empty directories at build time, which will result in shorter variables.

In general, there is no getting around this. One little trick is to use tr : '\n', e.g.:

distri0# echo $PATH
/usr/bin:/bin:/usr/sbin:/sbin:/ro/openssh-amd64-8.2p1-11/out/bin

distri0# echo $PATH | tr : '\n'
/usr/bin
/bin
/usr/sbin
/sbin
/ro/openssh-amd64-8.2p1-11/out/bin

Edge cases

The implementation outlined above works well in hundreds of packages, and only a small handful exhibited problems of any kind. Here are some issues I encountered:

Issue: accidental ABI breakage in plugin mechanisms

NSS libraries built against glibc 2.28 and newer cannot be loaded by glibc 2.27. In all likelihood, such changes do not happen too often, but it does illustrate that glibc’s published interface spec is not sufficient for forwards and backwards compatibility.

In distri, we could likely use a per-package exchange directory for glibc’s NSS mechanism to prevent the above problem from happening in the future.

Issue: wrapper bypass when a program re-executes itself

Some programs try to arrange for themselves to be re-executed outside of their current process tree. For example, consider building a program with the meson build system:

When meson first configures the build, it generates ninja files (think Makefiles) which contain command lines that run the meson --internal helper.
Once meson returns, ninja is called as a separate process, so it will not have the environment which the meson wrapper sets up. ninja then runs the previously persisted meson command line. Since the command line uses the full path to meson (not to its wrapper), it bypasses the wrapper.

Luckily, not many programs try to arrange for other process trees to run them. Here is a table summarizing how affected programs might try to arrange for re-execution, whether the technique results in a wrapper bypass, and what we do about it in distri:

technique to execute itself	uses wrapper	mitigation
run-time: find own basename in `PATH`	yes	wrapper program
compile-time: embed expected path	no; bypass!	configure or patch
run-time: `argv[0]` or `/proc/self/exe`	no; bypass!	patch

One might think that setting argv[0] to the wrapper location seems like a way to side-step this problem. We tried doing this in distri, but had to revert and go the other way.

Misc smaller issues

Appendix: Could other distributions adopt hermetic packages?

At a very high level, adopting hermetic packages will require two steps:

Using fully qualified paths whose contents don’t change (e.g. /ro/emacs-amd64-26.3-15) generally requires rebuilding programs, e.g. with --prefix set.
Once you use fully qualified paths you need to make the packages able to exchange data. distri solves this with exchange directories, implemented in the /ro file system which is backed by a FUSE daemon.

The first step is pretty simple, whereas the second step is where I expect controversy around any suggested mechanism.

Appendix: demo (in distri)

This appendix contains commands and their outputs, run on upcoming distri version supersilverhaze, but verified to work on older versions, too.

Large outputs have been collapsed and can be expanded by clicking on the output.

The /bin directory contains symlinks for the union of all package’s bin subdirectories:

distri0# readlink -f /bin/teensy_loader_cli

distri: 20x faster initramfs (initrd) from scratch

2020-01-21T17:50:00+01:00

In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition:

“[…] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process […] to make preparations before the real root file system can be mounted.”

Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory.

Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.

Motivation

Thus far, building a distri disk image was quite slow:

This is on an AMD Ryzen 3900X 12-core processor (2019):

distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd

Of these 50 seconds, dracut’s initramfs generation accounts for 31 seconds (62%)!

Initramfs generation time drops to 8.7 seconds once dracut no longer needs to use the single-threaded gzip(1) , but the multi-threaded replacement pigz(1) :

This brings the total time to build a distri disk image down to:

distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd

Clearly, when you use dracut on any modern computer, you should make pigz available. dracut should fail to compile unless one explicitly opts into the known-slower gzip. For more thoughts on optional dependencies, see “Optional dependencies don’t work”.

But why does it take 8.7 seconds still? Can we go faster?

The answer is Yes! I recently built a distri-specific initramfs I’m calling minitrd. I wrote both big parts from scratch:

the initramfs generator program (distri initrd)
a custom Go userland (cmd/minitrd), running as /init in the initramfs.

minitrd generates the initramfs image in ≈400ms, bringing the total time down to:

distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd

(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.)

How can minitrd be 20 times faster than dracut?

dracut is mainly written in shell, with a C helper program. It drives the generation process by spawning lots of external dependencies (e.g. ldd or the dracut-install helper program). I assume that the combination of using an interpreted language (shell) that spawns lots of processes and precludes a concurrent architecture is to blame for the poor performance.

minitrd is written in Go, with speed as a goal. It leverages concurrency and uses no external dependencies; everything happens within a single process (but with enough threads to saturate modern hardware).

Measuring early boot time using qemu, I measured the dracut-generated initramfs taking 588ms to display the full disk encryption passphrase prompt, whereas minitrd took only 195ms.

The rest of this article dives deeper into how minitrd works.

What does an initramfs do?

Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:

1. Load kernel modules to access the block devices with the root file system

Depending on the system, the block devices with the root file system might already be present when the initramfs runs, or some kernel modules might need to be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already present when the initramfs starts, whereas in qemu, we need to load the virtio_pci module, followed by the virtio_scsi module.

How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:

initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci

Devices in sysfs have a modalias file whose content can be matched against these declarations to identify the module to load:

initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[…]

Hence, for the initial round of module loading, it is sufficient to locate all modalias files within sysfs and load the responsible modules.

Loading a kernel module can result in new devices appearing. When that happens, the kernel sends a uevent, which the uevent consumer in userspace receives via a netlink socket. Typically, this consumer is udev(7) , but in our case, it’s minitrd.

For each uevent messages that comes with a MODALIAS variable, minitrd will load the relevant kernel module(s).

When loading a kernel module, its dependencies need to be loaded first. Dependency information is stored in the modules.dep file in a Makefile-like syntax:

initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko

To load a module, we can open its file and then call the Linux-specific finit_module(2) system call. Some modules are expected to return an error code, e.g. ENODEV or ENOENT when some hardware device is not actually present.

Side note: next to the textual versions, there are also binary versions of the modules.alias and modules.dep files. Presumably, those can be queried more quickly, but for simplicitly, I have not (yet?) implemented support in minitrd.

2. Console settings: font, keyboard layout

Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:

initrd# setfont latarcyrheb-sun32

Setting the user’s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:

initrd# loadkeys neo

3. Block device identification

In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer’s device topology (e.g. connect a new disk to a formerly unused port).

Hence, it is good style to refer to disks and their partitions with stable identifiers. This also applies to boot loader configuration, and so most distributions will set a kernel parameter such as root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121.

Identifying the block device or partition with the specified UUID is the initramfs’s job.

Depending on what the device contains, the UUID comes from a different place. For example, ext4 file systems have a UUID field in their file system superblock, whereas LUKS volumes have a UUID in their LUKS header.

Canonically, probing a device to extract the UUID is done by libblkid from the util-linux package, but the logic can easily be re-implemented in other languages and changes rarely. minitrd comes with its own implementation to avoid cgo or running the blkid(8) program.

4. LUKS full-disk encryption unlocking (only on encrypted systems)

Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.

initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1

After the user entered their passphrase, the root file system can be mounted:

initrd# mount /dev/dm-0 /mnt

5. Continuing the boot process (switch_root)

Now that everything is set up, we need to pass execution to the init program on the root file system with a careful sequence of chdir(2) , mount(2) , chroot(2) , chdir(2) and execve(2) system calls that is explained in this busybox switch_root comment.

initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init

To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.

How is an initramfs generated?

An initramfs “image” (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4.

Generators typically prepare a temporary directory and feed it to the cpio(1) program. In minitrd, we read the files into memory and generate the cpio archive using the go-cpio package. We use the pgzip package for parallel gzip compression.

The following files need to go into the cpio archive:

minitrd Go userland

The minitrd binary is copied into the cpio archive as /init and will be run by the kernel after extracting the archive.

Like the rest of distri, minitrd is built statically without cgo, which means it can be copied as-is into the cpio archive.

Linux kernel modules

Aside from the modules.alias and modules.dep metadata files, the kernel modules themselves reside in e.g. /lib/modules/5.4.6/kernel and need to be copied into the cpio archive.

Copying all modules results in a ≈80 MiB archive, so it is common to only copy modules that are relevant to the initramfs’s features. This reduces archive size to ≈24 MiB.

The filtering relies on hard-coded patterns and module names. For example, disk encryption related modules are all kernel modules underneath kernel/crypto, plus kernel/drivers/md/dm-crypt.ko.

When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.

Console Fonts and Keymaps

The kbd package’s setfont(8) and loadkeys(1) programs load console fonts and keymaps from /usr/share/consolefonts and /usr/share/keymaps, respectively.

Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.

cryptsetup, setfont, loadkeys

These programs are (currently) required because minitrd does not implement their functionality.

As they are dynamically linked, not only the programs themselves need to be copied, but also the ELF dynamic linking loader (path stored in the .interp ELF section) and any ELF library dependencies.

For example, cryptsetup in distri declares the ELF interpreter /ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2 and declares dependencies on shared libraries libcryptsetup.so.12, libblkid.so.1 and others. Luckily, in distri, packages contain a lib subdirectory containing symbolic links to the resolved shared library paths (hermetic packaging), so it is sufficient to mirror the lib directory into the cpio archive, recursing into shared library dependencies of shared libraries.

cryptsetup also requires the GCC runtime library libgcc_s.so.1 to be present at runtime, and will abort with an error message about not being able to call pthread_cancel(3) if it is unavailable.

time zone data

To print log messages in the correct time zone, we copy /etc/localtime from the host into the cpio archive.

minitrd outside of distri?

I currently have no desire to make minitrd available outside of distri. While the technical challenges (such as extending the generator to not rely on distri’s hermetic packages) are surmountable, I don’t want to support people’s initramfs remotely.

Also, I think that people’s efforts should in general be spent on rallying behind dracut and making it work faster, thereby benefiting all Linux distributions that use dracut (increasingly more). With minitrd, I have demonstrated that significant speed-ups are achievable.

Conclusion

It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as “debug why the encrypted root file system is not unlocked” to more complicated tasks such as “set up a root file system on DRBD for a high-availability setup”. But even with that sort of experience, I didn’t know all the details, until I was forced to implement every little thing.

As I suspected going into this exercise, dracut is much slower than it needs to be. Re-implementing its generation stage in a modern language instead of shell helps a lot.

Of course, my minitrd does a bit less than dracut, but not drastically so. The overall architecture is the same.

I hope my effort helps with two things:

As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, …), people can learn about how an initramfs works in a single place.
I hope the significant time difference motivates people to improve dracut.

Appendix: qemu development environment

Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I’m sharing my notes here.

First, I copied all kernel modules and a statically built busybox binary:

% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh

To generate an initramfs from the current directory, I used:

% find . | cpio -o -H newc | pigz > /tmp/initrd

In distri’s Makefile, I append these flags to the QEMU invocation:

-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"

The vga= mode parameter is required for loading font latarcyrheb-sun32.

Once in the busybox shell, I manually prepared the required mount points and kernel modules:

ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi

As a next step, I copied cryptsetup and dependencies into the initramfs directory:

% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full | sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/

In busybox, I used the following commands to unlock the root file system:

modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt

Debian Code Search: positional index, TurboPFor-compressed

2019-09-29T00:00:00+00:00

See the Conclusion for a summary if you’re impatient :-)

Motivation

Over the last few months, I have been developing a new index format for Debian Code Search. This required a lot of careful refactoring, re-implementation, debug tool creation and debugging.

Multiple factors motivated my work on a new index format:

The existing index format has a 2G size limit, into which we have bumped a few times, requiring manual intervention to keep the system running.
Debugging the existing system required creating ad-hoc debugging tools, which made debugging sessions unnecessarily lengthy and painful.
I wanted to check whether switching to a different integer compression format would improve performance (it does not).
I wanted to check whether storing positions with the posting lists would improve performance of identifier queries (= queries which are not using any regular expression features), which make up 78.2% of all Debian Code Search queries (it does).

I figured building a new index from scratch was the easiest approach, compared to refactoring the existing index to increase the size limit (point ①).

I also figured it would be a good idea to develop the debugging tool in lock step with the index format so that I can be sure the tool works and is useful (point ②).

Integer compression: TurboPFor

As a quick refresher, search engines typically store document IDs (representing source code files, in our case) in an ordered list (“posting list”). It usually makes sense to apply at least a rudimentary level of compression: our existing system used variable integer encoding.

TurboPFor, the self-proclaimed “Fastest Integer Compression” library, combines an advanced on-disk format with a carefully tuned SIMD implementation to reach better speeds (in micro benchmarks) at less disk usage than Russ Cox’s varint implementation in github.com/google/codesearch.

If you are curious about its inner workings, check out my “TurboPFor: an analysis”.

Applied on the Debian Code Search index, TurboPFor indeed compresses integers better:

Disk space

8.9G codesearch varint index

5.5G TurboPFor index

Switching to TurboPFor (via cgo) for storing and reading the index results in a slight speed-up of a dcs replay benchmark, which is more pronounced the more i/o is required.

Query speed (regexp, cold page cache)

18s codesearch varint index

14s TurboPFor index (cgo)

Query speed (regexp, warm page cache)

15s codesearch varint index

14s TurboPFor index (cgo)

Overall, TurboPFor is an all-around improvement in efficiency, albeit with a high cost in implementation complexity.

Positional index: trade more disk for faster queries

This section builds on the previous section: all figures come from the TurboPFor index, which can optionally support positions.

Conceptually, we’re going from:

type docid uint32
type index map[trigram][]docid

…to:

type occurrence struct {
    doc docid
    pos uint32 // byte offset in doc
}
type index map[trigram][]occurrence

The resulting index consumes more disk space, but can be queried faster:

We can do fewer queries: instead of reading all the posting lists for all the trigrams, we can read the posting lists for the query’s first and last trigram only.
This is one of the tricks described in the paper “AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures” (PDF), and goes a long way without incurring the complexity, computational cost and additional disk usage of calculating algebraic signatures.
Verifying the delta between the last and first position matches the length of the query term significantly reduces the number of files to read (lower false positive rate).
The matching phase is quicker: instead of locating the query term in the file, we only need to compare a few bytes at a known offset for equality.
More data is read sequentially (from the index), which is faster.

Disk space

A positional index consumes significantly more disk space, but not so much as to pose a challenge: a Hetzner EX61-NVME dedicated server (≈ 64 €/month) provides 1 TB worth of fast NVMe flash storage.

6.5G non-positional

123G positional

93G positional (posrel)

The idea behind the positional index (posrel) is to not store a (doc,pos) tuple on disk, but to store positions, accompanied by a stream of doc/pos relationship bits: 1 means this position belongs to the next document, 0 means this position belongs to the current document.

This is an easy way of saving some space without modifying the TurboPFor on-disk format: the posrel technique reduces the index size to about ¾.

With the increase in size, the Linux page cache hit ratio will be lower for the positional index, i.e. more data will need to be fetched from disk for querying the index.

As long as the disk can deliver data as fast as you can decompress posting lists, this only translates into one disk seek’s worth of additional latency. This is the case with modern NVMe disks that deliver thousands of MB/s, e.g. the Samsung 960 Pro (used in Hetzner’s aforementioned EX61-NVME server).

The values were measured by running dcs du -h /srv/dcs/shard*/full without and with the -pos argument.

Bytes read

A positional index requires fewer queries: reading only the first and last trigram’s posting lists and positions is sufficient to achieve a lower (!) false positive rate than evaluating all trigram’s posting lists in a non-positional index.

As a consequence, fewer files need to be read, resulting in fewer bytes required to read from disk overall.

As an additional bonus, in a positional index, more data is read sequentially (index), which is faster than random i/o, regardless of the underlying disk.

1.2G

19.8G

21.0G regexp queries

4.2G (index)

10.8G (files)

15.0G identifier queries

The values were measured by running iostat -d 25 just before running bench.zsh on an otherwise idle system.

Query speed

Even though the positional index is larger and requires more data to be read at query time (see above), thanks to the C TurboPFor library, the 2 queries on a positional index are roughly as fast as the n queries on a non-positional index (≈4s instead of ≈3s).

This is more than made up for by the combined i/o matching stage, which shrinks from ≈18.5s (7.1s i/o + 11.4s matching) to ≈1.3s.

3.3s (index)

7.1s (i/o)

11.4s (matching)

21.8s regexp queries

3.92s (index)

≈1.3s

5.22s identifier queries

Note that identifier query i/o was sped up not just by needing to read fewer bytes, but also by only having to verify bytes at a known offset instead of needing to locate the identifier within the file.

Conclusion

The new index format is overall slightly more efficient. This disk space efficiency allows us to introduce a positional index section for the first time.

Most Debian Code Search queries are positional queries (78.2%) and will be answered much quicker by leveraging the positions.

Bottomline, it is beneficial to use a positional index on disk over a non-positional index in RAM.

distri: a Linux distribution to research fast package management

2019-08-17T18:36:00+02:00

Over the last year or so I have worked on a research linux distribution in my spare time. It’s not a distribution for researchers (like Scientific Linux), but my personal playground project to research linux distribution development, i.e. try out fresh ideas.

This article focuses on the package format and its advantages, but there is more to distri, which I will cover in upcoming blog posts.

Motivation

I was a Debian Developer for the 7 years from 2012 to 2019, but using the distribution often left me frustrated, ultimately resulting in me winding down my Debian work.

Frequently, I was noticing a large gap between the actual speed of an operation (e.g. doing an update) and the possible speed based on back of the envelope calculations. I wrote more about this in my blog post “Package managers are slow”.

To me, this observation means that either there is potential to optimize the package manager itself (e.g. apt), or what the system does is just too complex. While I remember seeing some low-hanging fruit¹, through my work on distri, I wanted to explore whether all the complexity we currently have in Linux distributions such as Debian or Fedora is inherent to the problem space.

I have completed enough of the experiment to conclude that the complexity is not inherent: I can build a Linux distribution for general-enough purposes which is much less complex than existing ones.

① Those were low-hanging fruit from a user perspective. I’m not saying that fixing them is easy in the technical sense; I know too little about apt’s code base to make such a statement.

Key idea: packages are images, not archives

One key idea is to switch from using archives to using images for package contents. Common package managers such as dpkg(1) use tar(1) archives with various compression algorithms.

distri uses SquashFS images, a comparatively simple file system image format that I happen to be familiar with from my work on the gokrazy Raspberry Pi 3 Go platform.

This idea is not novel: AppImage and snappy also use images, but only for individual, self-contained applications. distri however uses images for distribution packages with dependencies. In particular, there is no duplication of shared libraries in distri.

A nice side effect of using read-only image files is that applications are immutable and can hence not be broken by accidental (or malicious!) modification.

Key idea: separate hierarchies

Package contents are made available under a fully-qualified path. E.g., all files provided by package zsh-amd64-5.6.2-3 are available under /ro/zsh-amd64-5.6.2-3. The mountpoint /ro stands for read-only, which is short yet descriptive.

Perhaps surprisingly, building software with custom prefix values of e.g. /ro/zsh-amd64-5.6.2-3 is widely supported, thanks to:

Linux distributions, which build software with prefix set to /usr, whereas FreeBSD (and the autotools default), which build with prefix set to /usr/local.
Enthusiast users in corporate or research environments, who install software into their home directories.

Because using a custom prefix is a common scenario, upstream awareness for prefix-correctness is generally high, and the rarely required patch will be quickly accepted.

Key idea: exchange directories

Software packages often exchange data by placing or locating files in well-known directories. Here are just a few examples:

gcc(1) locates the libusb(3) headers via /usr/include
man(1) locates the nginx(1) manpage via /usr/share/man.
zsh(1) locates executable programs via PATH components such as /bin

In distri, these locations are called exchange directories and are provided via FUSE in /ro.

Exchange directories come in two different flavors:

global. The exchange directory, e.g. /ro/share, provides the union of the share sub directory of all packages in the package store.
Global exchange directories are largely used for compatibility, see below.
per-package. Useful for tight coupling: e.g. irssi(1) does not provide any ABI guarantees, so plugins such as irssi-robustirc can declare that they want e.g. /ro/irssi-amd64-1.1.1-1/out/lib/irssi/modules to be a per-package exchange directory and contain files from their lib/irssi/modules.

Search paths sometimes need to be fixed

Programs which use exchange directories sometimes use search paths to access multiple exchange directories. In fact, the examples above were taken from gcc(1) ’s INCLUDEPATH, man(1) ’s MANPATH and zsh(1) ’s PATH. These are prominent ones, but more examples are easy to find: zsh(1) loads completion functions from its FPATH.

Some search path values are derived from --datadir=/ro/share and require no further attention, but others might derive from e.g. --prefix=/ro/zsh-amd64-5.6.2-3/out and need to be pointed to an exchange directory via a specific command line flag.

FHS compatibility

Global exchange directories are used to make distri provide enough of the Filesystem Hierarchy Standard (FHS) that third-party software largely just works. This includes a C development environment.

I successfully ran a few programs from their binary packages such as Google Chrome, Spotify, or Microsoft’s Visual Studio Code.

Fast package manager

I previously wrote about how Linux distribution package managers are too slow.

distri’s package manager is extremely fast. Its main bottleneck is typically the network link, even at high speed links (I tested with a 100 Gbps link).

Its speed comes largely from an architecture which allows the package manager to do less work. Specifically:

Package images can be added atomically to the package store, so we can safely skip fsync(2) . Corruption will be cleaned up automatically, and durability is not important: if an interactive installation is interrupted, the user can just repeat it, as it will be fresh on their mind.
Because all packages are co-installable thanks to separate hierarchies, there are no conflicts at the package store level, and no dependency resolution (an optimization problem requiring SAT solving) is required at all.
In exchange directories, we resolve conflicts by selecting the package with the highest monotonically increasing distri revision number.
distri proves that we can build a useful Linux distribution entirely without hooks and triggers. Not having to serialize hook execution allows us to download packages into the package store with maximum concurrency.
Because we are using images instead of archives, we do not need to unpack anything. This means installing a package is really just writing its package image and metadata to the package store. Sequential writes are typically the fastest kind of storage usage pattern.

Fast installation also make other use-cases more bearable, such as creating disk images, be it for testing them in qemu(1) , booting them on real hardware from a USB drive, or for cloud providers such as Google Cloud.

Fast package builder

Contrary to how distribution package builders are usually implemented, the distri package builder does not actually install any packages into the build environment.

Instead, distri makes available a filtered view of the package store (only declared dependencies are available) at /ro in the build environment.

This means that even for large dependency trees, setting up a build environment happens in a fraction of a second! Such a low latency really makes a difference in how comfortable it is to iterate on distribution packages.

Package stores

In distri, package images are installed from a remote package store into the local system package store /roimg, which backs the /ro mount.

A package store is implemented as a directory of package images and their associated metadata files.

You can easily make available a package store by using distri export.

To provide a mirror for your local network, you can periodically distri update from the package store you want to mirror, and then distri export your local copy. Special tooling (e.g. debmirror in Debian) is not required because distri install is atomic (and update uses install).

Producing derivatives is easy: just add your own packages to a copy of the package store.

The package store is intentionally kept simple to manage and distribute. Its files could be exchanged via peer-to-peer file systems, or synchronized from an offline medium.

distri’s first release

distri works well enough to demonstrate the ideas explained above. I have branched this state into branch jackherer, distri’s first release code name. This way, I can keep experimenting in the distri repository without breaking your installation.

From the branch contents, our autobuilder creates:

disk images, which…

a package repository. Installations can pick up new packages with distri update.
documentation for the release.

Definitely check out the “Cool things to try” README section.

The project website can be found at https://distr1.org. The website is just the README for now, but we can improve that later.

The repository can be found at https://github.com/distr1/distri

Project outlook

Right now, distri is mainly a vehicle for my spare-time Linux distribution research. I don’t recommend anyone use distri for anything but research, and there are no medium-term plans of that changing. At the very least, please contact me before basing anything serious on distri so that we can talk about limitations and expectations.

I expect the distri project to live for as long as I have blog posts to publish, and we’ll see what happens afterwards. Note that this is a hobby for me: I will continue to explore, at my own pace, parts that I find interesting.

My hope is that established distributions might get a useful idea or two from distri.

There’s more to come: subscribe to the distri feed

I don’t want to make this post too long, but there is much more!

Please subscribe to the following URL in your feed reader to get all posts about distri:

https://michael.stapelberg.ch/posts/tags/distri/feed.xml

Next in my queue are articles about hermetic packages and good package maintainer experience (including declarative packaging).

Feedback or questions?

I’d love to discuss these ideas in case you’re interested!

Please send feedback to the distri mailing list so that everyone can participate!

Linux package managers are slow

2019-08-17T18:27:00+02:00

Pending feedback: Allan McRae pointed out that I should be more precise with my terminology: strictly speaking, distributions are slow, and package managers are only part of the puzzle.

I’ll try to be clearer in future revisions/posts.

I measured how long the most popular Linux distribution’s package manager take to install small and large packages (the ack(1p) source code search Perl script and qemu, respectively).

Where required, my measurements include metadata updates such as transferring an up-to-date package list. For me, requiring a metadata update is the more common case, particularly on live systems or within Docker containers.

All measurements were taken on an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz running Docker 20.10.8 on Linux 5.13.10, backed by a Corsair Force MP600 NVMe drive boasting many hundreds of MB/s write performance. The machine is located in Zürich and connected to the Internet with a 1 Gigabit fiber connection, so the expected top download speed is ≈115 MB/s.

See Appendix D for details on the measurement method and command outputs.

Measurements

Keep in mind that these are one-time measurements. They should be indicative of actual performance, but your experience may vary.

ack (small Perl program)

distribution	package manager	data	wall-clock time	rate
Fedora	dnf	84 MB	25s	3.4 MB/s
NixOS	Nix	15 MB	7s	2.3 MB/s
Debian	apt	16 MB	3s	4.9 MB/s
Arch Linux	pacman	25 MB	1s	18.4 MB/s
Alpine	apk	10 MB	1s	11.9 MB/s

qemu (large C program)

distribution	package manager	data	wall-clock time	rate
Fedora	dnf	350 MB	56s	6.25 MB/s
Debian	apt	256 MB	39s	6.5 MB/s
NixOS	Nix	251 MB	36s	6.8 MB/s
Arch Linux	pacman	128 MB	10s	12.1 MB/s
Alpine	apk	34 MB	1.8s	18.6 MB/s

(Looking for older measurements? See Appendix B (2019) or Appendix C (2020)).

The difference between the slowest and fastest package managers is 30x!

How can Alpine’s apk and Arch Linux’s pacman be an order of magnitude faster than the rest? They are doing a lot less than the others, and more efficiently, too.

Pain point: too much metadata

For example, Fedora transfers a lot more data than others because its main package list is 60 MB (compressed!) alone. Compare that with Alpine’s 734 KB APKINDEX.tar.gz.

Of course the extra metadata which Fedora provides helps some use case, otherwise they hopefully would have removed it altogether. The amount of metadata seems excessive for the use case of installing a single package, which I consider the main use-case of an interactive package manager.

I expect any modern Linux distribution to only transfer absolutely required data to complete my task.

Pain point: no concurrency

Because they need to sequence executing arbitrary package maintainer-provided code (hooks and triggers), all tested package managers need to install packages sequentially (one after the other) instead of concurrently (all at the same time).

In my blog post “Can we do without hooks and triggers?”, I outline that hooks and triggers are not strictly necessary to build a working Linux distribution.

Thought experiment: further speed-ups

Strictly speaking, the only required feature of a package manager is to make available the package contents so that the package can be used: a program can be started, a kernel module can be loaded, etc.

By only implementing what’s needed for this feature, and nothing more, a package manager could likely beat apk’s performance. It could, for example:

skip archive extraction by mounting file system images (like AppImage or snappy)
use compression which is light on CPU, as networks are fast (like apk)
skip fsync when it is safe to do so, i.e.:
- package installations don’t modify system state
- atomic package installation (e.g. an append-only package store)
- automatically clean up the package store after crashes

Current landscape

Here’s a table outlining how the various package managers listed on Wikipedia’s list of software package management systems fare:

name	scope	package file format	hooks/triggers
AppImage	apps	image: ISO9660, SquashFS	no
snappy	apps	image: SquashFS	yes: hooks
FlatPak	apps	archive: OSTree	no
0install	apps	archive: tar.bz2	no
nix, guix	distro	archive: nar.{bz2,xz}	activation script
dpkg	distro	archive: tar.{gz,xz,bz2} in ar(1)	yes
rpm	distro	archive: cpio.{bz2,lz,xz}	scriptlets
pacman	distro	archive: tar.xz	install
slackware	distro	archive: tar.{gz,xz}	yes: doinst.sh
apk	distro	archive: tar.gz	yes: .post-install
Entropy	distro	archive: tar.bz2	yes
ipkg, opkg	distro	archive: tar{,.gz}	yes

Conclusion

As per the current landscape, there is no distribution-scoped package manager which uses images and leaves out hooks and triggers, not even in smaller Linux distributions.

I think that space is really interesting, as it uses a minimal design to achieve significant real-world speed-ups.

I have explored this idea in much more detail, and am happy to talk more about it in my post distri: a Linux distribution to research fast package management.

There are a couple of recent developments going into the same direction:

“Revisiting How We Put Together Linux Systems” describes mounting app bundles
Android Q uses ext4 loopback images
The Haiku Operating System’s package manager Haiku Depot uses images

Appendix D: measurement details (2021)

ack

You can expand each of these:

Fedora’s dnf takes almost 25 seconds to fetch and unpack 84 MB.

Linux distributions: Can we do without hooks and triggers?

2019-07-20T00:00:00+00:00

Hooks are an extension feature provided by all package managers that are used in larger Linux distributions. For example, Debian uses apt, which has various maintainer scripts. Fedora uses rpm, which has scriptlets. Different package managers use different names for the concept, but all of them offer package maintainers the ability to run arbitrary code during package installation and upgrades. Example hook use cases include adding daemon user accounts to your system (e.g. postgres), or generating/updating cache files.

Triggers are a kind of hook which run when other packages are installed. For example, on Debian, the man(1) package comes with a trigger which regenerates the search database index whenever any package installs a manpage. When, for example, the nginx(8) package is installed, a trigger provided by the man(1) package runs.

Over the past few decades, Open Source software has become more and more uniform: instead of each piece of software defining its own rules, a small number of build systems are now widely adopted.

Hence, I think it makes sense to revisit whether offering extension via hooks and triggers is a net win or net loss.

Hooks preclude concurrent package installation

Package managers commonly can make very little assumptions about what hooks do, what preconditions they require, and which conflicts might be caused by running multiple package’s hooks concurrently.

Hence, package managers cannot concurrently install packages. At least the hook/trigger part of the installation needs to happen in sequence.

While it seems technically feasible to retrofit package manager hooks with concurrency primitives such as locks for mutual exclusion between different hook processes, the required overhaul of all hooks¹ seems like such a daunting task that it might be better to just get rid of the hooks instead. Only deleting code frees you from the burden of maintenance, automated testing and debugging.

① In Debian, there are 8620 non-generated maintainer scripts, as reported by find shard*/src/*/debian -regex ".*$pre\|post$$inst\|rm$$" on a Debian Code Search instance.

Triggers slow down installing/updating other packages

Personally, I never use the apropos(1) command, so I don’t appreciate the man(1) package’s trigger which updates the database used by apropos(1). The process takes a long time and, because hooks and triggers must be executed serially (see previous section), blocks my installation or update.

When I tell people this, they are often surprised to learn about the existance of the apropos(1) command. I suggest adopting an opt-in model.

Unnecessary work if programs are not used between updates

Hooks run when packages are installed. If a package’s contents are not used between two updates, running the hook in the first update could have been skipped. Running the hook lazily when the package contents are used reduces unnecessary work.

As a welcome side-effect, lazy hook evaluation automatically makes the hook work in operating system images, such as live USB thumb drives or SD card images for the Raspberry Pi. Such images must not ship the same crypto keys (e.g. OpenSSH host keys) to all machines, but instead generate a different key on each machine.

Why do users keep packages installed they don’t use? It’s extra work to remember and clean up those packages after use. Plus, users might not realize or value that having fewer packages installed has benefits such as faster updates.

I can also imagine that there are people for whom the cost of re-installing packages incentivizes them to just keep packages installed—you never know when you might need the program again…

Implemented in an interpreted language

While working on hermetic packages (more on that in another blog post), where the contained programs are started with modified environment variables (e.g. PATH) via a wrapper bash script, I noticed that the overhead of those wrapper bash scripts quickly becomes significant. For example, when using the excellent magit interface for Git in Emacs, I encountered second-long delays² when using hermetic packages compared to standard packages. Re-implementing wrappers in a compiled language provided a significant speed-up.

Similarly, getting rid of an extension point which mandates using shell scripts allows us to build an efficient and fast implementation of a predefined set of primitives, where you can reason about their effects and interactions.

② magit needs to run git a few times for displaying the full status, so small overhead quickly adds up.

Incentivizing more upstream standardization

Hooks are an escape hatch for distribution maintainers to express anything which their packaging system cannot express.

Distributions should only rely on well-established interfaces such as autoconf’s classic ./configure && make && make install (including commonly used flags) to build a distribution package. Integrating upstream software into a distribution should not require custom hooks. For example, instead of requiring a hook which updates a cache of schema files, the library used to interact with those files should transparently (re-)generate the cache or fall back to a slower code path.

Distribution maintainers are hard to come by, so we should value their time. In particular, there is a 1:n relationship of packages to distribution package maintainers (software is typically available in multiple Linux distributions), so it makes sense to spend the work in the 1 and have the n benefit.

Can we do without them?

If we want to get rid of hooks, we need another mechanism to achieve what we currently achieve with hooks.

If the hook is not specific to the package, it can be moved to the package manager. The desired system state should either be derived from the package contents (e.g. required system users can be discovered from systemd service files) or declaratively specified in the package build instructions—more on that in another blog post. This turns hooks (arbitrary code) into configuration, which allows the package manager to collapse and sequence the required state changes. E.g., when 5 packages are installed which each need a new system user, the package manager could update /etc/passwd just once.

If the hook is specific to the package, it should be moved into the package contents. This typically means moving the functionality into the program start (or the systemd service file if we are talking about a daemon). If (while?) upstream is not convinced, you can either wrap the program or patch it. Note that this case is relatively rare: I have worked with hundreds of packages and the only package-specific functionality I came across was automatically generating host keys before starting OpenSSH’s sshd(8)³.

There is one exception where moving the hook doesn’t work: packages which modify state outside of the system, such as bootloaders or kernel images.

③ Even that can be moved out of a package-specific hook, as Fedora demonstrates.

Conclusion

Global state modifications performed as part of package installation today use hooks, an overly expressive extension mechanism.

Instead, all modifications should be driven by configuration. This is feasible because there are only a few different kinds of desired state modifications. This makes it possible for package managers to optimize package installation.

Optional dependencies don’t work

2019-05-23T00:00:00+00:00

In the i3 projects, we have always tried hard to avoid optional dependencies. There are a number of reasons behind it, and as I have recently encountered some of the downsides of optional dependencies firsthand, I summarized my thoughts in this article.

What is a (compile-time) optional dependency?

When building software from source, most programming languages and build systems support conditional compilation: different parts of the source code are compiled based on certain conditions.

An optional dependency is conditional compilation hooked up directly to a knob (e.g. command line flag, configuration file, …), with the effect that the software can now be built without an otherwise required dependency.

Let’s walk through a few issues with optional dependencies.

Inconsistent experience in different environments

Software is usually not built by end users, but by packagers, at least when we are talking about Open Source.

Hence, end users don’t see the knob for the optional dependency, they are just presented with the fait accompli: their version of the software behaves differently than other versions of the same software.

Depending on the kind of software, this situation can be made obvious to the user: for example, if the optional dependency is needed to print documents, the program can produce an appropriate error message when the user tries to print a document.

Sometimes, this isn’t possible: when i3 introduced an optional dependency on cairo and pangocairo, the behavior itself (rendering window titles) worked in all configurations, but non-ASCII characters might break depending on whether i3 was compiled with cairo.

For users, it is frustrating to only discover in conversation that a program has a feature that the user is interested in, but it’s not available on their computer. For support, this situation can be hard to detect, and even harder to resolve to the user’s satisfaction.

Packaging is more complicated

Unfortunately, many build systems don’t stop the build when optional dependencies are not present. Instead, you sometimes end up with a broken build, or, even worse: with a successful build that does not work correctly at runtime.

This means that packagers need to closely examine the build output to know which dependencies to make available. In the best case, there is a summary of available and enabled options, clearly outlining what this build will contain. In the worst case, you need to infer the features from the checks that are done, or work your way through the --help output.

The better alternative is to configure your build system such that it stops when any dependency was not found, and thereby have packagers acknowledge each optional dependency by explicitly disabling the option.

Untested code paths bit rot

Code paths which are not used will inevitably bit rot. If you have optional dependencies, you need to test both the code path without the dependency and the code path with the dependency. It doesn’t matter whether the tests are automated or manual, the test matrix must cover both paths.

Interestingly enough, this principle seems to apply to all kinds of software projects (but it slows down as change slows down): one might think that important Open Source building blocks should have enough users to cover all sorts of configurations.

However, consider this example: building cairo without libxrender results in all GTK application windows, menus, etc. being displayed as empty grey surfaces. Cairo does not fail to build without libxrender, but the code path clearly is broken without libxrender.

Can we do without them?

I’m not saying optional dependencies should never be used. In fact, for bootstrapping, disabling dependencies can save a lot of work and can sometimes allow breaking circular dependencies. For example, in an early bootstrapping stage, binutils can be compiled with --disable-nls to disable internationalization.

However, optional dependencies are broken so often that I conclude they are overused. Read on and see for yourself whether you would rather commit to best practices or not introduce an optional dependency.

Best practices

If you do decide to make dependencies optional, please:

Set up automated testing for all code path combinations.
Fail the build until packagers explicitly pass a --disable flag.
Tell users their version is missing a dependency at runtime, e.g. in --version.

Winding down my Debian involvement

2019-03-10T00:00:00+00:00

This post is hard to write, both in the emotional sense but also in the “I would have written a shorter letter, but I didn’t have the time” sense. Hence, please assume the best of intentions when reading it—it is not my intention to make anyone feel bad about their contributions, but rather to provide some insight into why my frustration level ultimately exceeded the threshold.

Debian has been in my life for well over 10 years at this point.

A few weeks ago, I have visited some old friends at the Zürich Debian meetup after a multi-year period of absence. On my bike ride home, it occurred to me that the topics of our discussions had remarkable overlap with my last visit. We had a discussion about the merits of systemd, which took a detour to respect in open source communities, returned to processes in Debian and eventually culminated in democracies and their theoretical/practical failings. Admittedly, that last one might be a Swiss thing.

I say this not to knock on the Debian meetup, but because it prompted me to reflect on what feelings Debian is invoking lately and whether it’s still a good fit for me.

So I’m finally making a decision that I should have made a long time ago: I am winding down my involvement in Debian to a minimum.

What does this mean?

Over the coming weeks, I will:

transition packages to be team-maintained where it makes sense
remove myself from the Uploaders field on packages with other maintainers
orphan packages where I am the sole maintainer

I will try to keep up best-effort maintenance of the manpages.debian.org service and the codesearch.debian.net service, but any help would be much appreciated.

For all intents and purposes, please treat me as permanently on vacation. I will try to be around for administrative issues (e.g. permission transfers) and questions addressed directly to me, permitted they are easy enough to answer.

Why?

When I joined Debian, I was still studying, i.e. I had luxurious amounts of spare time. Now, over 5 years of full time work later, my day job taught me a lot, both about what works in large software engineering projects and how I personally like my computer systems. I am very conscious of how I spend the little spare time that I have these days.

The following sections each deal with what I consider a major pain point, in no particular order. Some of them influence each other—for example, if changes worked better, we could have a chance at transitioning packages to be more easily machine readable.

Change process in Debian

The last few years, my current team at work conducted various smaller and larger refactorings across the entire code base (touching thousands of projects), so we have learnt a lot of valuable lessons about how to effectively do these changes. It irks me that Debian works almost the opposite way in every regard. I appreciate that every organization is different, but I think a lot of my points do actually apply to Debian.

In Debian, packages are nudged in the right direction by a document called the Debian Policy, or its programmatic embodiment, lintian.

While it is great to have a lint tool (for quick, local/offline feedback), it is even better to not require a lint tool at all. The team conducting the change (e.g. the C++ team introduces a new hardening flag for all packages) should be able to do their work transparent to me.

Instead, currently, all packages become lint-unclean, all maintainers need to read up on what the new thing is, how it might break, whether/how it affects them, manually run some tests, and finally decide to opt in. This causes a lot of overhead and manually executed mechanical changes across packages.

Notably, the cost of each change is distributed onto the package maintainers in the Debian model. At work, we have found that the opposite works better: if the team behind the change is put in power to do the change for as many users as possible, they can be significantly more efficient at it, which reduces the total cost and time a lot. Of course, exceptions (e.g. a large project abusing a language feature) should still be taken care of by the respective owners, but the important bit is that the default should be the other way around.

Debian is lacking tooling for large changes: it is hard to programmatically deal with packages and repositories (see the section below). The closest to “sending out a change for review” is to open a bug report with an attached patch. I thought the workflow for accepting a change from a bug report was too complicated and started mergebot, but only Guido ever signaled interest in the project.

Culturally, reviews and reactions are slow. There are no deadlines. I literally sometimes get emails notifying me that a patch I sent out a few years ago (!!) is now merged. This turns projects from a small number of weeks into many years, which is a huge demotivator for me.

Interestingly enough, you can see artifacts of the slow online activity manifest itself in the offline culture as well: I don’t want to be discussing systemd’s merits 10 years after I first heard about it.

Lastly, changes can easily be slowed down significantly by holdouts who refuse to collaborate. My canonical example for this is rsync, whose maintainer refused my patches to make the package use debhelper purely out of personal preference.

Granting so much personal freedom to individual maintainers prevents us as a project from raising the abstraction level for building Debian packages, which in turn makes tooling harder.

How would things look like in a better world?

As a project, we should strive towards more unification. Uniformity still does not rule out experimentation, it just changes the trade-off from easier experimentation and harder automation to harder experimentation and easier automation.
Our culture needs to shift from “this package is my domain, how dare you touch it” to a shared sense of ownership, where anyone in the project can easily contribute (reviewed) changes without necessarily even involving individual maintainers.

To learn more about how successful large changes can look like, I recommend my colleague Hyrum Wright’s talk “Large-Scale Changes at Google: Lessons Learned From 5 Yrs of Mass Migrations”.

Fragmented workflow and infrastructure

Debian generally seems to prefer decentralized approaches over centralized ones. For example, individual packages are maintained in separate repositories (as opposed to in one repository), each repository can use any SCM (git and svn are common ones) or no SCM at all, and each repository can be hosted on a different site. Of course, what you do in such a repository also varies subtly from team to team, and even within teams.

In practice, non-standard hosting options are used rarely enough to not justify their cost, but frequently enough to be a huge pain when trying to automate changes to packages. Instead of using GitLab’s API to create a merge request, you have to design an entirely different, more complex system, which deals with intermittently (or permanently!) unreachable repositories and abstracts away differences in patch delivery (bug reports, merge requests, pull requests, email, …).

Wildly diverging workflows is not just a temporary problem either. I participated in long discussions about different git workflows during DebConf 13, and gather that there were similar discussions in the meantime.

Personally, I cannot keep enough details of the different workflows in my head. Every time I touch a package that works differently than mine, it frustrates me immensely to re-learn aspects of my day-to-day.

After noticing workflow fragmentation in the Go packaging team (which I started), I tried fixing this with the workflow changes proposal, but did not succeed in implementing it. The lack of effective automation and slow pace of changes in the surrounding tooling despite my willingness to contribute time and energy killed any motivation I had.

Old infrastructure: package uploads

When you want to make a package available in Debian, you upload GPG-signed files via anonymous FTP. There are several batch jobs (the queue daemon, unchecked, dinstall, possibly others) which run on fixed schedules (e.g. dinstall runs at 01:52 UTC, 07:52 UTC, 13:52 UTC and 19:52 UTC).

Depending on timing, I estimated that you might wait for over 7 hours (!!) before your package is actually installable.

What’s worse for me is that feedback to your upload is asynchronous. I like to do one thing, be done with it, move to the next thing. The current setup requires a many-minute wait and costly task switch for no good technical reason. You might think a few minutes aren’t a big deal, but when all the time I can spend on Debian per day is measured in minutes, this makes a huge difference in perceived productivity and fun.

The last communication I can find about speeding up this process is ganneff’s post from 2008.

How would things look like in a better world?

Anonymous FTP would be replaced by a web service which ingests my package and returns an authoritative accept or reject decision in its response.
For accepted packages, there would be a status page displaying the build status and when the package will be available via the mirror network.
Packages should be available within a few minutes after the build completed.

Old infrastructure: bug tracker

I dread interacting with the Debian bug tracker. debbugs is a piece of software (from 1994) which is only used by Debian and the GNU project these days.

Debbugs processes emails, which is to say it is asynchronous and cumbersome to deal with. Despite running on the fastest machines we have available in Debian (or so I was told when the subject last came up), its web interface loads very slowly.

Notably, the web interface at bugs.debian.org is read-only. Setting up a working email setup for reportbug(1) or manually dealing with attachments is a rather big hurdle.

For reasons I don’t understand, every interaction with debbugs results in many different email threads.

Aside from the technical implementation, I also can never remember the different ways that Debian uses pseudo-packages for bugs and processes. I need them rarely enough to establish a mental model of how they are set up, or working memory of how they are used, but frequently enough to be annoyed by this.

How would things look like in a better world?

Debian would switch from a custom bug tracker to a (any) well-established one.
Debian would offer automation around processes. It is great to have a paper-trail and artifacts of the process in the form of a bug report, but the primary interface should be more convenient (e.g. a web form).

Old infrastructure: mailing list archives

It baffles me that in 2019, we still don’t have a conveniently browsable threaded archive of mailing list discussions. Email and threading is more widely used in Debian than anywhere else, so this is somewhat ironic. Gmane used to paper over this issue, but Gmane’s availability over the last few years has been spotty, to say the least (it is down as I write this).

I tried to contribute a threaded list archive, but our listmasters didn’t seem to care or want to support the project.

Debian is hard to machine-read

While it is obviously possible to deal with Debian packages programmatically, the experience is far from pleasant. Everything seems slow and cumbersome. I have picked just 3 quick examples to illustrate my point.

debiman needs help from piuparts in analyzing the alternatives mechanism of each package to display the manpages of e.g. psql(1). This is because maintainer scripts modify the alternatives database by calling shell scripts. Without actually installing a package, you cannot know which changes it does to the alternatives database.

pk4 needs to maintain its own cache to look up package metadata based on the package name. Other tools parse the apt database from scratch on every invocation. A proper database format, or at least a binary interchange format, would go a long way.

Debian Code Search wants to ingest new packages as quickly as possible. There used to be a fedmsg instance for Debian, but it no longer seems to exist. It is unclear where to get notifications from for new packages, and where best to fetch those packages.

Complicated build stack

See my “Debian package build tools” post. It really bugs me that the sprawl of tools is not seen as a problem by others.

Developer experience pretty painful

Most of the points discussed so far deal with the experience in developing Debian, but as I recently described in my post “Debugging experience in Debian”, the experience when developing using Debian leaves a lot to be desired, too.

I have more ideas

At this point, the article is getting pretty long, and hopefully you got a rough idea of my motivation.

While I described a number of specific shortcomings above, the final nail in the coffin is actually the lack of a positive outlook. I have more ideas that seem really compelling to me, but, based on how my previous projects have been going, I don’t think I can make any of these ideas happen within the Debian project.

I intend to publish a few more posts about specific ideas for improving operating systems here. Stay tuned.

Lastly, I hope this post inspires someone, ideally a group of people, to improve the developer experience within Debian.

Debugging experience in Debian

2019-02-15T00:00:00+00:00

Recently, a user reported that they don’t see window titles in i3 when running i3 on a Raspberry Pi with Debian.

I copied the latest Raspberry Pi Debian image onto an SD card, booted it, and was able to reproduce the issue.

Conceptually, at this point, I should be able to install and start gdb, set a break point and step through the code.

Enabling debug symbols in Debian

Debian, by default, strips debug symbols when building packages to conserve disk space and network bandwidth. The motivation is very reasonable: most users will never need the debug symbols.

Unfortunately, obtaining debug symbols when you do need them is unreasonably hard.

We begin by configuring an additional apt repository which contains automatically generated debug packages:

raspi# cat >>/etc/apt/sources.list.d/debug.list <<'EOT'
deb http://deb.debian.org/debian-debug buster-debug main contrib non-free
EOT
raspi# apt update

Notably, not all Debian packages have debug packages. As the DebugPackage Debian Wiki page explains, debhelper/9.20151219 started generating debug packages (ending in -dbgsym) automatically. Packages which have not been updated might come with their own debug packages (ending in -dbg) or might not preserve debug symbols at all!

Now that we can install debug packages, how do we know which ones we need?

Finding debug symbol packages in Debian

For debugging i3, we obviously need at least the i3-dbgsym package, but i3 uses a number of other libraries through whose code we may need to step.

The debian-goodies package ships a tool called find-dbgsym-packages which prints the required packages to debug an executable, core dump or running process:

raspi# apt install debian-goodies
raspi# apt install $(find-dbgsym-packages $(which i3))

Now we should have symbol names and line number information available in gdb. But for effectively stepping through the program, access to the source code is required.

Obtaining source code in Debian

Naively, one would assume that apt source should be sufficient for obtaining the source code of any Debian package. However, apt source defaults to the package candidate version, not the version you have installed on your system.

I have addressed this issue with the pk4 tool, which defaults to the installed version.

Before we can extract any sources, we need to configure yet another apt repository:

raspi# cat >>/etc/apt/sources.list.d/source.list <<'EOT'
deb-src http://deb.debian.org/debian buster main contrib non-free
EOT
raspi# apt update

Regardless of whether you use apt source or pk4, one remaining problem is the directory mismatch: the debug symbols contain a certain path, and that path is typically not where you extracted your sources to. While debugging, you will need to tell gdb about the location of the sources. This is tricky when you debug a call across different source packages:

(gdb) pwd
Working directory /usr/src/i3.
(gdb) list main
229     * the main loop. */
230     ev_unref(main_loop);
231   }
232 }
233
234 int main(int argc, char *argv[]) {
235  /* Keep a symbol pointing to the I3_VERSION string constant so that
236   * we have it in gdb backtraces. */
237  static const char *_i3_version __attribute__((used)) = I3_VERSION;
238  char *override_configpath = NULL;
(gdb) list xcb_connect
484	../../src/xcb_util.c: No such file or directory.

See Specifying Source Directories in the gdb manual for the dir command which allows you to add multiple directories to the source path. This is pretty tedious, though, and does not work for all programs.

Positive example: Fedora

While Fedora conceptually shares all the same steps, the experience on Fedora is so much better: when you run gdb /usr/bin/i3, it will tell you what the next step is:

# gdb /usr/bin/i3
[…]
Reading symbols from /usr/bin/i3...(no debugging symbols found)...done.
Missing separate debuginfos, use: dnf debuginfo-install i3-4.16-1.fc28.x86_64

Watch what happens when we run the suggested command:

# dnf debuginfo-install i3-4.16-1.fc28.x86_64
enabling updates-debuginfo repository
enabling fedora-debuginfo repository
[…]
Installed:
  i3-debuginfo.x86_64 4.16-1.fc28
  i3-debugsource.x86_64 4.16-1.fc28
Complete!

A single command understood our intent, enabled the required repositories and installed the required packages, both for debug symbols and source code (stored in e.g. /usr/src/debug/i3-4.16-1.fc28.x86_64). Unfortunately, gdb doesn’t seem to locate the sources, which seems like a bug to me.

One downside of Fedora’s approach is that gdb will only print all required dependencies once you actually run the program, so you may need to run multiple dnf commands.

In an ideal world

Ideally, none of the manual steps described above would be necessary. It seems absurd to me that so much knowledge is required to efficiently debug programs in Debian. Case in point: I only learnt about find-dbgsym-packages a few days ago when talking to one of its contributors.

Installing gdb should be all that a user needs to do. Debug symbols and sources can be transparently provided through a lazy-loading FUSE file system. If our build/packaging infrastructure assured predictable paths and automated debug symbol extraction, we could have transparent, quick and reliable debugging of all programs within Debian.

NixOS’s dwarffs is an implementation of this idea: https://github.com/edolstra/dwarffs

Conclusion

While I agree with the removal of debug symbols as a general optimization, I think every Linux distribution should strive to provide an entirely transparent debugging experience: you should not even have to know that debug symbols are not present by default. Debian really falls short in this regard.

Getting Debian to a fully transparent debugging experience requires a lot of technical work and a lot of social convincing. In my experience, programmatically working with the Debian archive and packages is tricky, and ensuring that all packages in a Debian release have debug packages (let alone predictable paths) seems entirely unachievable due to the fragmentation of packaging infrastructure and holdouts blocking any progress.

My go-to example is rsync’s debian/rules, which intentionally (!) still has not adopted debhelper. It is not a surprise that there are no debug symbols for rsync in Debian.

TurboPFor: an analysis

2019-02-05T09:00:00+01:00

Motivation

I have recently been looking into speeding up Debian Code Search. As a quick reminder, search engines answer queries by consulting an inverted index: a map from term to documents containing that term (called a “posting list”). See the Debian Code Search Bachelor Thesis (PDF) for a lot more details.

Currently, Debian Code Search does not store positional information in its index, i.e. the index can only reveal that a certain trigram is present in a document, not where or how often.

From analyzing Debian Code Search queries, I knew that identifier queries (70%) massively outnumber regular expression queries (30%). When processing identifier queries, storing positional information in the index enables a significant optimization: instead of identifying the possibly-matching documents and having to read them all, we can determine matches from querying the index alone, no document reads required.

This moves the bottleneck: having to read all possibly-matching documents requires a lot of expensive random I/O, whereas having to decode long posting lists requires a lot of cheap sequential I/O.

Of course, storing positions comes with a downside: the index is larger, and a larger index takes more time to decode when querying.

Hence, I have been looking at various posting list compression/decoding techniques, to figure out whether we could switch to a technique which would retain (or improve upon!) current performance despite much longer posting lists and produce a small enough index to fit on our current hardware.

Literature

I started looking into this space because of Daniel Lemire’s Stream VByte post. As usual, Daniel’s work is well presented, easily digestible and accompanied by not just one, but multiple implementations.

I also looked for scientific papers to learn about the state of the art and classes of different approaches in general. The best I could find is Compression, SIMD, and Postings Lists. If you don’t have access to the paper, I hear that Sci-Hub is helpful.

The paper is from 2014, and doesn’t include all algorithms. If you know of a better paper, please let me know and I’ll include it here.

Eventually, I stumbled upon an algorithm/implementation called TurboPFor, which the rest of the article tries to shine some light on.

TurboPFor

If you’re wondering: PFor stands for Patched Frame Of Reference and describes a family of algorithms. The principle is explained e.g. in SIMD Compression and the Intersection of Sorted Integers (PDF).

The TurboPFor project’s README file claims that TurboPFor256 compresses with a rate of 5.04 bits per integer, and can decode with 9400 MB/s on a single thread of an Intel i7-6700 CPU.

For Debian Code Search, we use unsigned integers of 32 bit (uint32), which TurboPFor will compress into as few bits as required.

Dividing Debian Code Search’s file sizes by the total number of integers, I get similar values, at least for the docid index section:

5.49 bits per integer for the docid index section
11.09 bits per integer for the positions index section

I can confirm the order of magnitude of the decoding speed, too. My benchmark calls TurboPFor from Go via cgo, which introduces some overhead. To exclude disk speed as a factor, data comes from the page cache. The benchmark sequentially decodes all posting lists in the specified index, using as many threads as the machine has cores¹:

≈1400 MB/s on a 1.1 GiB docid index section
≈4126 MB/s on a 15.0 GiB position index section

I think the numbers differ because the position index section contains larger integers (requiring more bits). I repeated both benchmarks, capped to 1 GiB, and decoding speeds still differed, so it is not just the size of the index.

Compared to Streaming VByte, a TurboPFor256 index comes in at just over half the size, while still reaching 83% of Streaming VByte’s decoding speed. This seems like a good trade-off for my use-case, so I decided to have a closer look at how TurboPFor works.

① See cmd/gp4-verify/verify.go run on an Intel i9-9900K.

Methodology

To confirm my understanding of the details of the format, I implemented a pure-Go TurboPFor256 decoder. Note that it is intentionally not optimized as its main goal is to use simple code to teach the TurboPFor256 on-disk format.

If you’re looking to use TurboPFor from Go, I recommend using cgo. cgo’s function call overhead is about 51ns as of Go 1.8, which will easily be offset by TurboPFor’s carefully optimized, vectorized (SSE/AVX) code.

With that caveat out of the way, you can find my teaching implementation at https://github.com/stapelberg/goturbopfor

I verified that it produces the same results as TurboPFor’s p4ndec256v32 function for all posting lists in the Debian Code Search index.

On-disk format

Note that TurboPFor does not fully define an on-disk format on its own. When encoding, it turns a list of integers into a byte stream:

size_t p4nenc256v32(uint32_t *in, size_t n, unsigned char *out);

When decoding, it decodes the byte stream into an array of integers, but needs to know the number of integers in advance:

size_t p4ndec256v32(unsigned char *in, size_t n, uint32_t *out);

Hence, you’ll need to keep track of the number of integers and length of the generated byte streams separately. When I talk about on-disk format, I’m referring to the byte stream which TurboPFor returns.

The TurboPFor256 format uses blocks of 256 integers each, followed by a trailing block — if required — which can contain fewer than 256 integers:

SIMD bitpacking is used for all blocks but the trailing block (which uses regular bitpacking). This is not merely an implementation detail for decoding: the on-disk structure is different for blocks which can be SIMD-decoded.

Each block starts with a 2 bit header, specifying the type of the block:

11: constant
00: bitpacking
10: bitpacking with exceptions (bitmap)
01: bitpacking with exceptions (variable byte)

Each block type is explained in more detail in the following sections.

Note that none of the block types store the number of elements: you will always need to know how many integers you need to decode. Also, you need to know in advance how many bytes you need to feed to TurboPFor, so you will need some sort of container format.

Further, TurboPFor automatically choses the best block type for each block.

Constant block

A constant block (all integers of the block have the same value) consists of a single value of a specified bit width ≤ 32. This value will be stored in each output element for the block. E.g., after calling decode(input, 3, output) with input being the constant block depicted below, output is {0xB8912636, 0xB8912636, 0xB8912636}.

The example shows the maximum number of bytes (5). Smaller integers will use fewer bytes: e.g. an integer which can be represented in 3 bits will only use 2 bytes.

Bitpacking block

A bitpacking block specifies a bit width ≤ 32, followed by a stream of bits. Each value starts at the Least Significant Bit (LSB), i.e. the 3-bit values 0 (000b) and 5 (101b) are encoded as 101000b.

Bitpacking with exceptions (bitmap) block

The constant and bitpacking block types work well for integers which don’t exceed a certain width, e.g. for a series of integers of width ≤ 5 bits.

For a series of integers where only a few values exceed an otherwise common width (say, two values require 7 bits, the rest requires 5 bits), it makes sense to cut the integers into two parts: value and exception.

In the example below, decoding the third integer out2 (000b) requires combination with exception ex0 (10110b), resulting in 10110000b.

The number of exceptions can be determined by summing the 1 bits in the bitmap using the popcount instruction.

Bitpacking with exceptions (variable byte)

When the exceptions are not uniform enough, it makes sense to switch from bitpacking to a variable byte encoding:

Decoding: variable byte

The variable byte encoding used by the TurboPFor format is similar to the one used by SQLite, which is described, alongside other common variable byte encodings, at github.com/stoklund/varint.

Instead of using individual bits for dispatching, this format classifies the first byte (b[0]) into ranges:

[0—176]: the value is b[0]
[177—240]: a 14 bit value is in b[0] (6 high bits) and b[1] (8 low bits)
[241—248]: a 19 bit value is in b[0] (3 high bits), b[1] and b[2] (16 low bits)
[249—255]: a 32 bit value is in b[1], b[2], b[3] and possibly b[4]

Here is the space usage of different values:

[0—176] are stored in 1 byte (as-is)
[177—16560] are stored in 2 bytes, with the highest 6 bits added to 177
[16561—540848] are stored in 3 bytes, with the highest 3 bits added to 241
[540849—16777215] are stored in 4 bytes, with 0 added to 249
[16777216—4294967295] are stored in 5 bytes, with 1 added to 249

An overflow marker will be used to signal that encoding the values would be less space-efficient than simply copying them (e.g. if all values require 5 bytes).

This format is very space-efficient: it packs 0-176 into a single byte, as opposed to 0-128 (most others). At the same time, it can be decoded very quickly, as only the first byte needs to be compared to decode a value (similar to PrefixVarint).

Decoding: bitpacking

Regular bitpacking

In regular (non-SIMD) bitpacking, integers are stored on disk one after the other, padded to a full byte, as a byte is the smallest addressable unit when reading data from disk. For example, if you bitpack only one 3 bit int, you will end up with 5 bits of padding.

SIMD bitpacking (256v32)

SIMD bitpacking works like regular bitpacking, but processes 8 uint32 little-endian values at the same time, leveraging the AVX instruction set. The following illustration shows the order in which 3-bit integers are decoded from disk:

In Practice

For a Debian Code Search index, 85% of posting lists are short enough to only consist of a trailing block, i.e. no SIMD instructions can be used for decoding.

The distribution of block types looks as follows:

72% bitpacking with exceptions (bitmap)
19% bitpacking with exceptions (variable byte)
5% constant
4% bitpacking

Constant blocks are mostly used for posting lists with just one entry.

Conclusion

The TurboPFor on-disk format is very flexible: with its 4 different kinds of blocks, chances are high that a very efficient encoding will be used for most integer series.

Of course, the flip side of covering so many cases is complexity: the format and implementation take quite a bit of time to understand — hopefully this article helps a little! For environments where the C TurboPFor implementation cannot be used, smaller algorithms might be simpler to implement.

That said, if you can use the TurboPFor implementation, you will benefit from a highly optimized SIMD code base, which will most likely be an improvement over what you’re currently using.

Looking for a new Raspberry Pi image maintainer

2018-06-03T08:43:00+02:00

This is taken care of: Gunnar Wolf has taken on maintenance of the Raspberry Pi image. Thank you!

(Cross-posting this message I sent to pkg-raspi-maintainers for broader visibility.)

I started building Raspberry Pi images because I thought there should be an easy, official way to install Debian on the Raspberry Pi.

I still believe that, but I’m not actually using Debian on any of my Raspberry Pis anymore¹, so my personal motivation to do any work on the images is gone.

On top of that, I realize that my commitments exceed my spare time capacity, so I need to get rid of responsibilities.

Therefore, I’m looking for someone to take up maintainership of the Raspberry Pi images. Numerous people have reached out to me with thank you notes and questions, so I think the user interest is there. Also, I’ll be happy to answer any questions that you might have and that I can easily answer. Please reply here (or in private) if you’re interested.

If I can’t find someone within the next 7 days, I’ll put up an announcement message in the raspi3-image-spec README, wiki page, and my blog posts, stating that the image is unmaintained and looking for a new maintainer.

Thanks for your understanding,

① just in case you’re curious, I’m now running cross-compiled Go programs directly under a Linux kernel and minimal userland, see https://gokrazy.org/

sbuild-debian-developer-setup(1)

2018-03-19T08:00:00+01:00

I have heard a number of times that sbuild is too hard to get started with, and hence people don’t use it.

To reduce hurdles from using/contributing to Debian, I wanted to make sbuild easier to set up.

sbuild ≥ 0.74.0 provides a Debian package called sbuild-debian-developer-setup. Once installed, run the sbuild-debian-developer-setup(1) command to create a chroot suitable for building packages for Debian unstable.

On a system without any sbuild/schroot bits installed, a transcript of the full setup looks like this:

% sudo apt install -t unstable sbuild-debian-developer-setup
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libsbuild-perl sbuild schroot
Suggested packages:
  deborphan btrfs-tools aufs-tools | unionfs-fuse qemu-user-static
Recommended packages:
  exim4 | mail-transport-agent autopkgtest
The following NEW packages will be installed:
  libsbuild-perl sbuild sbuild-debian-developer-setup schroot
0 upgraded, 4 newly installed, 0 to remove and 1454 not upgraded.
Need to get 1.106 kB of archives.
After this operation, 3.556 kB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://localhost:3142/deb.debian.org/debian unstable/main amd64 libsbuild-perl all 0.74.0-1 [129 kB]
Get:2 http://localhost:3142/deb.debian.org/debian unstable/main amd64 sbuild all 0.74.0-1 [142 kB]
Get:3 http://localhost:3142/deb.debian.org/debian testing/main amd64 schroot amd64 1.6.10-4 [772 kB]
Get:4 http://localhost:3142/deb.debian.org/debian unstable/main amd64 sbuild-debian-developer-setup all 0.74.0-1 [62,6 kB]
Fetched 1.106 kB in 0s (5.036 kB/s)
Selecting previously unselected package libsbuild-perl.
(Reading database ... 276684 files and directories currently installed.)
Preparing to unpack .../libsbuild-perl_0.74.0-1_all.deb ...
Unpacking libsbuild-perl (0.74.0-1) ...
Selecting previously unselected package sbuild.
Preparing to unpack .../sbuild_0.74.0-1_all.deb ...
Unpacking sbuild (0.74.0-1) ...
Selecting previously unselected package schroot.
Preparing to unpack .../schroot_1.6.10-4_amd64.deb ...
Unpacking schroot (1.6.10-4) ...
Selecting previously unselected package sbuild-debian-developer-setup.
Preparing to unpack .../sbuild-debian-developer-setup_0.74.0-1_all.deb ...
Unpacking sbuild-debian-developer-setup (0.74.0-1) ...
Processing triggers for systemd (236-1) ...
Setting up schroot (1.6.10-4) ...
Created symlink /etc/systemd/system/multi-user.target.wants/schroot.service → /lib/systemd/system/schroot.service.
Setting up libsbuild-perl (0.74.0-1) ...
Processing triggers for man-db (2.7.6.1-2) ...
Setting up sbuild (0.74.0-1) ...
Setting up sbuild-debian-developer-setup (0.74.0-1) ...
Processing triggers for systemd (236-1) ...

% sudo sbuild-debian-developer-setup
The user `michael' is already a member of `sbuild'.
I: SUITE: unstable
I: TARGET: /srv/chroot/unstable-amd64-sbuild
I: MIRROR: http://localhost:3142/deb.debian.org/debian
I: Running debootstrap --arch=amd64 --variant=buildd --verbose --include=fakeroot,build-essential,eatmydata --components=main --resolve-deps unstable /srv/chroot/unstable-amd64-sbuild http://localhost:3142/deb.debian.org/debian
I: Retrieving InRelease 
I: Checking Release signature
I: Valid Release signature (key id 126C0D24BD8A2942CC7DF8AC7638D0442B90D010)
I: Retrieving Packages 
I: Validating Packages 
I: Found packages in base already in required: apt 
I: Resolving dependencies of required packages...
[…]
I: Successfully set up unstable chroot.
I: Run "sbuild-adduser" to add new sbuild users.
ln -s /usr/share/doc/sbuild/examples/sbuild-update-all /etc/cron.daily/sbuild-debian-developer-setup-update-all
Now run `newgrp sbuild', or log out and log in again.

% newgrp sbuild

% sbuild -d unstable hello
sbuild (Debian sbuild) 0.74.0 (14 Mar 2018) on x1

+==============================================================================+
| hello (amd64)                                Mon, 19 Mar 2018 07:46:14 +0000 |
+==============================================================================+

Package: hello
Distribution: unstable
Machine Architecture: amd64
Host Architecture: amd64
Build Architecture: amd64
Build Type: binary
[…]

I hope you’ll find this useful.

dput usability changes

2018-03-10T10:00:00+01:00

dput-ng ≥ 1.16 contains two usability changes which make uploading easier:

When no arguments are specified, dput-ng auto-selects the most recent .changes file (with confirmation).
Instead of erroring out when detecting an unsigned .changes file, debsign(1) is invoked to sign the .changes file before proceeding.

With these changes, after building a package, you just need to type dput (in the correct directory of course) to sign and upload it.

Michael Stapelbergs Website: posts tagged debian

Debian Code Search: OpenAPI now available

Getting started

curl example

Web browser example

Code example (Go)

Feedback?

Migration status

a new distri linux (fast package management) release

Hermetic packages (in distri)

Out of scope: plugins are not hermetic by design

Implementation of hermetic packages in distri

ELF interpreter and dynamic libraries

Environment variable setup wrapper programs

Shebang interpreter patching

Performance requirements

Downside: rebuild of packages required to pick up changes

Downside: long env variables are cumbersome to deal with

Edge cases

Issue: accidental ABI breakage in plugin mechanisms

Issue: wrapper bypass when a program re-executes itself

Misc smaller issues

Appendix: Could other distributions adopt hermetic packages?

Appendix: demo (in distri)

distri: 20x faster initramfs (initrd) from scratch

Motivation

What does an initramfs do?

1. Load kernel modules to access the block devices with the root file system

2. Console settings: font, keyboard layout

3. Block device identification

4. LUKS full-disk encryption unlocking (only on encrypted systems)

5. Continuing the boot process (switch_root)

How is an initramfs generated?

minitrd Go userland

Linux kernel modules

Console Fonts and Keymaps

cryptsetup, setfont, loadkeys

time zone data

minitrd outside of distri?

Conclusion

Appendix: qemu development environment

Debian Code Search: positional index, TurboPFor-compressed

Motivation

Integer compression: TurboPFor

Disk space

Query speed (regexp, cold page cache)

Query speed (regexp, warm page cache)

Positional index: trade more disk for faster queries

Disk space

Bytes read

Query speed

Conclusion

distri: a Linux distribution to research fast package management

Motivation

Key idea: packages are images, not archives

Key idea: separate hierarchies

Key idea: exchange directories

Search paths sometimes need to be fixed

FHS compatibility

Fast package manager

Fast package builder

Package stores

distri’s first release

Project outlook

There’s more to come: subscribe to the distri feed

Feedback or questions?

Linux package managers are slow

Measurements

ack (small Perl program)

qemu (large C program)

Pain point: too much metadata

Pain point: no concurrency

Thought experiment: further speed-ups

Current landscape

Conclusion

Appendix A: related work

Appendix D: measurement details (2021)

ack

Linux distributions: Can we do without hooks and triggers?

Hooks preclude concurrent package installation