Table of contents
In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition:
“[…] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process […] to make preparations before the real root file system can be mounted.”
Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory.
Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.
Motivation
Thus far, building a distri disk image was quite slow:
This is on an AMD Ryzen 3900X 12-core processor (2019):
distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd
Of these 50 seconds,
dracut
’s initramfs
generation accounts for 31 seconds (62%)!
Initramfs generation time drops to 8.7 seconds once dracut
no longer needs to
use the single-threaded gzip(1)
, but the
multi-threaded replacement pigz(1)
:
This brings the total time to build a distri disk image down to:
distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd
Clearly, when you use dracut
on any modern computer, you should make pigz
available. dracut
should fail to compile unless one explicitly opts into the
known-slower gzip. For more thoughts on optional dependencies, see “Optional
dependencies don’t work”.
But why does it take 8.7 seconds still? Can we go faster?
The answer is Yes! I recently built a distri-specific initramfs I’m calling
minitrd
. I wrote both big parts from scratch:
- the initramfs generator program (
distri initrd
) - a custom Go userland (
cmd/minitrd
), running as/init
in the initramfs.
minitrd
generates the initramfs image in ≈400ms, bringing the total time down
to:
distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd
(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.)
How can minitrd
be 20 times faster than dracut
?
dracut
is mainly written in shell, with a C helper program. It drives the
generation process by spawning lots of external dependencies (e.g. ldd
or the
dracut-install
helper program). I assume that the combination of using an
interpreted language (shell) that spawns lots of processes and precludes a
concurrent architecture is to blame for the poor performance.
minitrd
is written in Go, with speed as a goal. It leverages concurrency and
uses no external dependencies; everything happens within a single process (but
with enough threads to saturate modern hardware).
Measuring early boot time using qemu, I measured the dracut
-generated
initramfs taking 588ms to display the full disk encryption passphrase prompt,
whereas minitrd
took only 195ms.
The rest of this article dives deeper into how minitrd
works.
What does an initramfs do?
Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:
1. Load kernel modules to access the block devices with the root file system
Depending on the system, the block devices with the root file system might
already be present when the initramfs runs, or some kernel modules might need to
be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already
present when the initramfs starts, whereas in qemu, we need to load the
virtio_pci
module, followed by the virtio_scsi
module.
How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:
initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci
Devices in sysfs
have a modalias
file whose content can be matched against
these declarations to identify the module to load:
initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[…]
Hence, for the initial round of module loading, it is sufficient to locate all
modalias
files within sysfs
and load the responsible modules.
Loading a kernel module can result in new devices appearing. When that happens,
the kernel sends a
uevent,
which the uevent consumer in userspace receives via a netlink socket. Typically,
this consumer is udev(7)
, but in our case, it’s
minitrd
.
For each uevent messages that comes with a MODALIAS
variable, minitrd
will
load the relevant kernel module(s).
When loading a kernel module, its dependencies need to be loaded
first. Dependency information is stored in the modules.dep
file in a
Makefile
-like syntax:
initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko
To load a module, we can open its file and then call the Linux-specific finit_module(2)
system call. Some modules are expected to
return an error code, e.g. ENODEV
or ENOENT
when some hardware device is not
actually present.
Side note: next to the textual versions, there are also binary versions of the
modules.alias
and modules.dep
files. Presumably, those can be queried more
quickly, but for simplicitly, I have not (yet?) implemented support in
minitrd
.
2. Console settings: font, keyboard layout
Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:
initrd# setfont latarcyrheb-sun32
Setting the user’s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:
initrd# loadkeys neo
3. Block device identification
In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer’s device topology (e.g. connect a new disk to a formerly unused port).
Hence, it is good style to refer to disks and their partitions with stable
identifiers. This also applies to boot loader configuration, and so most
distributions will set a kernel parameter such as
root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121
.
Identifying the block device or partition with the specified UUID
is the
initramfs’s job.
Depending on what the device contains, the UUID comes from a different
place. For example, ext4
file systems have a UUID field in their file system
superblock, whereas LUKS volumes have a UUID in their LUKS header.
Canonically, probing a device to extract the UUID is done by libblkid
from the
util-linux
package, but the logic can easily be re-implemented in other
languages
and changes rarely. minitrd
comes with its own implementation to avoid
cgo or running the blkid(8)
program.
4. LUKS full-disk encryption unlocking (only on encrypted systems)
Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.
initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1
After the user entered their passphrase, the root file system can be mounted:
initrd# mount /dev/dm-0 /mnt
5. Continuing the boot process (switch_root)
Now that everything is set up, we need to pass execution to the init program on
the root file system with a careful sequence of chdir(2)
, mount(2)
, chroot(2)
, chdir(2)
and execve(2)
system calls that is explained in this busybox switch_root
comment.
initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init
To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.
How is an initramfs generated?
An initramfs “image” (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4.
Generators typically prepare a temporary directory and feed it to the cpio(1)
program. In minitrd
, we read the files into memory
and generate the cpio archive using the
go-cpio package. We use the
pgzip package for parallel gzip
compression.
The following files need to go into the cpio archive:
minitrd Go userland
The minitrd
binary is copied into the cpio archive as /init
and will be run
by the kernel after extracting the archive.
Like the rest of distri, minitrd
is built statically without cgo, which means
it can be copied as-is into the cpio archive.
Linux kernel modules
Aside from the modules.alias
and modules.dep
metadata files, the kernel
modules themselves reside in e.g. /lib/modules/5.4.6/kernel
and need to be
copied into the cpio archive.
Copying all modules results in a ≈80 MiB archive, so it is common to only copy modules that are relevant to the initramfs’s features. This reduces archive size to ≈24 MiB.
The filtering relies on hard-coded patterns and module names. For example, disk
encryption related modules are all kernel modules underneath kernel/crypto
,
plus kernel/drivers/md/dm-crypt.ko
.
When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.
Console Fonts and Keymaps
The kbd
package’s setfont(8)
and loadkeys(1)
programs load console fonts and keymaps from
/usr/share/consolefonts
and /usr/share/keymaps
, respectively.
Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.
cryptsetup, setfont, loadkeys
These programs are (currently) required because minitrd
does not implement
their functionality.
As they are dynamically linked, not only the programs themselves need to be
copied, but also the ELF dynamic linking loader (path stored in the .interp
ELF section) and any ELF library dependencies.
For example, cryptsetup
in distri declares the ELF interpreter
/ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
and declares dependencies
on shared libraries libcryptsetup.so.12
, libblkid.so.1
and others. Luckily,
in distri, packages contain a lib
subdirectory containing symbolic links to
the resolved shared library paths (hermetic packaging), so it is sufficient to
mirror the lib directory into the cpio archive, recursing into shared library
dependencies of shared libraries.
cryptsetup
also requires the GCC runtime library libgcc_s.so.1
to be present
at runtime, and will abort with an error message about not being able to call
pthread_cancel(3)
if it is unavailable.
time zone data
To print log messages in the correct time zone, we copy /etc/localtime
from
the host into the cpio archive.
minitrd outside of distri?
I currently have no desire to make minitrd
available outside of
distri. While the technical challenges (such as extending
the generator to not rely on distri’s hermetic packages) are surmountable, I
don’t want to support people’s initramfs remotely.
Also, I think that people’s efforts should in general be spent on rallying
behind dracut
and making it work faster, thereby benefiting all Linux
distributions that use dracut (increasingly more). With minitrd
, I have
demonstrated that significant speed-ups are achievable.
Conclusion
It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as “debug why the encrypted root file system is not unlocked” to more complicated tasks such as “set up a root file system on DRBD for a high-availability setup”. But even with that sort of experience, I didn’t know all the details, until I was forced to implement every little thing.
As I suspected going into this exercise, dracut
is much slower than it needs
to be. Re-implementing its generation stage in a modern language instead of
shell helps a lot.
Of course, my minitrd
does a bit less than dracut
, but not drastically
so. The overall architecture is the same.
I hope my effort helps with two things:
-
As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, …), people can learn about how an initramfs works in a single place.
-
I hope the significant time difference motivates people to improve
dracut
.
Appendix: qemu development environment
Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I’m sharing my notes here.
First, I copied all kernel modules and a statically built busybox binary:
% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh
To generate an initramfs from the current directory, I used:
% find . | cpio -o -H newc | pigz > /tmp/initrd
In distri’s Makefile
, I append these flags to the QEMU
invocation:
-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"
The vga=
mode parameter is required for loading font latarcyrheb-sun32
.
Once in the busybox
shell, I manually prepared the required mount points and
kernel modules:
ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi
As a next step, I copied cryptsetup
and dependencies into the initramfs directory:
% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full | sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/
In busybox
, I used the following commands to unlock the root file system:
modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt
I run a blog since 2005, spreading knowledge and experience for almost 20 years! :)
If you want to support my work, you can buy me a coffee.
Thank you for your support! ❤️