Before the BSD Kernel Starts: Part Two ARMv8

Note: This article was written in 2020 as a companion to Part One (AMD64) but was never published. The technical content reflects the state of NetBSD and the RK3399 platform at that time. Published here now without major changes.

Introduction

In Part One we covered legacy initialization of an AMD64-based system. This time we look at ARM processors.

ARM architecture is well known from smaller devices — embedded systems, IoT, smartphones — but ARMv8 already powers modern servers and laptops. The architecture has many revisions with backward compatibility considerations. To keep things focused, we will concentrate on ARMv8 (64-bit), which can also execute the 32-bit instruction set in compatibility mode. We will run 64-bit NetBSD throughout.

There are two main ways manufacturers obtain ARM processors: via RTL design (ARM sells the core design directly) or as an IP license (the company buys the specification). This business model makes the architecture highly flexible and implementation-specific, which also makes learning it harder — there are many chip-specific details that the architecture itself leaves undefined.

For this article we look at generic, high-level ARMv8 details first, then go deeper using a specific hardware platform. I originally considered Raspberry Pi 3 or 4, but the Broadcom GPU plays an essential and poorly-documented role in its boot process — after reset it is the GPU that starts first and later releases the ARM cores. The lack of public documentation makes it a poor teaching example.

Instead I use the RK3399Pro from Rockchip, the chip behind PINE64 devices such as the ROCKPro64 single-board computer and the Pinebook Pro laptop. It has fully open-sourced documentation and real-world relevance.

This article intentionally avoids the Secure Boot process and UEFI internals — those will get their own dedicated part.

The Bigger Picture

Exception Levels

ARMv8 introduces four privilege levels called Exception Levels (EL), numbered 0 to 3. Every implementation must support EL0 and EL1; EL2 and EL3 are optional.

The name “exception level” can be confusing — “privilege level” is the clearer term, and was used in earlier ARM specifications. The best way to understand them is by what runs at each level:

EL0 — unprivileged user-space applications. No direct access to system registers, page tables, or hardware devices.
EL1 — the OS kernel. Responsible for hardware configuration, page table setup, and managing EL0 processes.
EL2 — hypervisor. UEFI also runs here during boot — not because UEFI itself needs hypervisor privileges, but because running at EL2 preserves EL2 availability for the OS. If UEFI ran at EL1, the OS would have nowhere to put a hypervisor later.
EL3 — Secure Monitor. The most privileged level, used by ARM Trusted Firmware.

EL0 and EL1 are the minimum required to run a modern OS such as NetBSD or Linux.

ARMv8 only allows a decrease in execution state going toward less privileged levels: a 64-bit EL2 can host a 64-bit or 32-bit EL1, a 64-bit EL1 can run 64-bit or 32-bit EL0 applications, but a 32-bit OS cannot run 64-bit applications. The execution state of each level is controlled by bits in the higher level’s configuration registers: SCR_EL3.RW determines whether EL2 runs AArch64 or AArch32, and HCR_EL2.RW determines the same for EL1. The reset execution state of EL3 itself is controlled by RMR_EL3.AA64 — but that register affects the state after a warm reset, not the running state of lower ELs.

Exception Levels

Out of Reset

Unlike AMD64 where the reset vector is a fixed physical address (0xFFFFFFF0), ARMv8 processors do not have a fixed reset vector. After reset, the processor reads RVBAR_ELx (where x is the highest implemented exception level, typically EL3) to obtain the implementation-defined reset vector address, then fetches instructions from that address.

For the rest of this article, “EL3” refers to EL3 if implemented, or the highest available exception level otherwise. Most ARMv8 chips do implement EL3, but it is not architecturally mandatory.

After reset, the processor state is architecturally defined as follows:

All interrupt masks are set (DAIF = 0b1111 — all asynchronous exceptions masked)
MMU is off; instruction and data caches are disabled (SCTLR_ELx.M, .I, .C = 0)
The CPU executes physical addresses directly
TLB and cache contents are implementation-defined (boot ROM typically invalidates both)
Memory is unconfigured; the interrupt controller state is unknown

General Boot Sequence

Many architectures — including ARMv7 — start code execution from the first entry of an exception table (the reset vector). In ARMv8 the reset vector is no longer part of the exception table. Instead, after reset the processor fetches from the implementation-defined address in RVBAR_ELx (where x is the highest implemented level: 3–1).

The first instructions executed are typically initialization code in a small on-chip ROM (or OTP — One-Time Programmable memory). Most peripherals — memory controllers, bus controllers — are disabled right after reset, so the processor needs some immediately accessible memory. That memory must be inside the chip itself, which makes it expensive and therefore small (order of kilobytes).

Boot ROM prepares the processor and handles chip-specific details: memory and cache initialization, branch predictor configuration. Since TLB and cache state after reset is implementation-defined, boot ROM typically performs explicit invalidations before enabling them.

Boot Stages

The ARMv8 boot process typically has three stages:

First stage — Boot ROM: read-only code on-chip, executes immediately after reset
Second stage — Boot Loader: loaded by Boot ROM from external storage
Third stage — Advanced Boot Loader: loads the operating system (where UEFI lives)

Boot Stages

BootROM vs. Bootloader

BootROM is the small program in read-only memory (ROM or OTP), typically provided by the chip manufacturer since it handles low-level, platform-specific initialization. Its main job is to prepare the CPU to execute external code from storage (flash, SD card, NVMe). It may also provide cryptographic features for a chain of trust and handle partition layout parsing. ARM Trusted Firmware defines requirements for this stage, but for non-secure boot there is no mandated BootROM interface standard.

Bootloader is a program (or chain of programs) whose primary task is to load the OS. It typically lives on external storage, is modifiable, and can be upgraded. Some systems have a BootROM advanced enough to load a kernel or UEFI image directly; others have a minimal BootROM that can only fetch and execute from a fixed memory location, requiring the bootloader to be split into multiple stages due to size constraints.

The second-stage bootloader, loaded by BootROM from external storage, is platform-specific and sometimes split further into sub-programs due to hardware constraints. Its goal is to initialize system RAM so the third-stage bootloader has full platform access.

The third stage prepares the entire platform to boot the OS. UEFI operates here at EL2. DRAM is operational at this point. The bootloader reads the OS from storage, verifies the image if secure boot is active, passes boot parameters, and jumps to the kernel entry point.

Before the OS: Device Tree and Platform Differences

ARM hardware covers a wide range — embedded devices through enterprise servers — and the boot environment differs accordingly.

For embedded devices there is usually no universal hardware description standard. What is common is a device tree: a structured description of the peripherals connected to the chip (I2C, SPI, UART, CAN buses and their devices) that the OS cannot discover at runtime. The kernel reads this binary at boot and configures peripherals accordingly. The Embedded Base Boot Requirements (EBBR) specification defines the format.

For server-class systems, the Server Base Boot Requirements (SBBR) apply instead. SBBR-compliant systems must not expose a device tree to the OS — hardware discovery follows ACPI and UEFI conventions.

ROCKPro64 and RK3399

RK3399 Diagram

The Rockchip RK3399 is a System-on-Chip with two ARM processor clusters:

Dual-core Cortex-A72 (big cores)
Quad-core Cortex-A53 (little cores)

The clusters are connected via ARM Coherent Interconnect (CCI). Both have L1 and L2 caches and support all exception levels (EL0–EL3). The SoC contains a main interconnect and a peripheral interconnect (with low- and high-performance domains), connected through a 128-bit/64-bit/32-bit multi-layer AXI/AHB/APB bus architecture.

Embedded SRAM comes in two units: 8 KB (accessible via the PMU for power management) and 192 KB (accessible via the peripheral bus, protected by TZMA for security). The larger 192 KB block is remapped and accessible at 0xFFFF_0000–0xFFFF_FFFF and is where the bootloader stages execute.

RockPro64 Boot Sequence

RK3399 supports normal and secure boot. This article covers normal boot only.

After power-on reset, the RK3399 Boot ROM starts at 0xFFFF0000. It searches for a valid ID block header at the following locations, in order:

Offset 0          in SPI flash
Offset 0x8000     on eMMC   (sector 64, where a sector is 512 bytes)
Offset 0x8000     on SD card (sector 64)

These are fixed offsets — not partition-table entries. The Boot ROM looks for an ID header at each location and executes from the first one that matches.

Full boot chain using U-Boot:

Boot ROM loads U-Boot TPL into SRAM. TPL initializes main system RAM.
Control returns from TPL to Boot ROM (return-to-bootrom).
Boot ROM loads U-Boot SPL.
SPL loads ARM Trusted Firmware (ATF) and U-Boot into main memory.
ATF runs U-Boot.
U-Boot loads bootaa64.efi from the FAT boot partition (\EFI\BOOT\).
The UEFI loader finds and executes the NetBSD kernel.

Boot Sequence

Glossary for this chain:

TPL (Tertiary Program Loader): A minimal subset of SPL. Some boards have tight size constraints on early-stage code; TPL handles DDR initialization only, then hands off to SPL.
SPL (Secondary Program Loader): A small binary generated from U-Boot source that fits in SRAM and loads the main U-Boot into system RAM.
MLO (Memory Loader): A broader term for any second-stage program that loads the next bootloader into memory. SPL is the U-Boot-specific variant; some boards use their own MLO.
ATF (ARM Trusted Firmware): The reference implementation for secure boot on ARM. Divides the boot process into stages BL1 (Boot ROM), BL2 (verified by BL1), and BL3 (loads U-Boot/GRUB or similar). Establishes the chain of trust.

A note on return-to-bootrom: this is a Rockchip-specific mechanism. The Boot ROM leaves a function pointer in memory before transferring control to TPL. TPL can call this pointer — roughly equivalent to a longjmp — to return control to the Boot ROM, which then continues to the next stage. If the second-stage loader fails and returns to Boot ROM with the appropriate error code, Boot ROM enters an upgrade mode accessible over USB. In the normal boot path, TPL uses this mechanism to hand off to SPL cleanly.

Disk Layout

U-Boot is flexible and can boot from many sources. For NetBSD on RockPro64 we format the boot medium with an MBR partition table containing two partitions: a FAT32 system partition holding the EFI loader, and a NetBSD partition (ID 0xa9) holding the kernel and root filesystem.

file ./NetBSD-9-aarch64-202101091800Z-pinebook-pro.img:
DOS/MBR boot sector;
partition 1 : ID=0xc, active, start-CHS (0x2,10,9),  end-CHS (0xc,60,48),   startsector 32768, 163840 sectors;
partition 2 : ID=0xa9,        start-CHS (0xc,60,49), end-CHS (0x92,123,41), startsector 196608, 2156672 sectors

The layout of sectors on the medium before the partitions:

Start (sectors) | Size (sectors) | Name                  | Description
----------------|----------------|-----------------------|---------------------------
64              | 16320          | IDBLoader             | SoC init code (TPL + SPL)
16384           | 8192           | OS loader             | Main U-Boot
24576           | 8192           | TrustedFirmware-A     | ARM Trusted Firmware
32768           | 163840         | FAT32 system partition | EFI loader
196608          | 2156672        | NetBSD partition       | FFS filesystem with kernel

The EFI loader (provided in NetBSD sources) finds the kernel file, loads it into RAM, and calls aarch64_exec_kernel. Before transferring control, it invalidates instruction and data caches to ensure the kernel image is coherently placed in RAM, and disables the MMU — the kernel starts with physical addresses and sets up its own address space from scratch.

start.S, locore.S and Machine-Dependent Code

From UEFI to the Kernel

When a multiprocessor SoC powers on, all cores become active after the power-on reset. Early-stage firmware designates one as the boot core and holds the others with a wfi (wait for interrupt) instruction. NetBSD’s early initialization assumes exactly this: one active core, the rest idle.

Execution begins in start.S inside /sys/arch/aarch64. This file is a thin proxy to aarch64_start in locore.S; its main job is extracting boot arguments.

The processor arrives here at EL1 — the EFI loader has already dropped from EL2 to EL1 before jumping to the kernel. This drop is done via the ERET instruction: the EFI loader sets SPSR_EL2 to describe the target PSTATE (EL1h, all interrupts masked), sets ELR_EL2 to the kernel entry address, and executes ERET. The processor atomically switches to EL1 and begins fetching from the kernel entry point. The MMU is off at this point — the EFI loader disabled it before the jump, so the kernel executes physical addresses until it builds its own page tables.

The steps ahead:

Initialize system registers
Build MMU translation tables
Enable the MMU
Jump into virtual address space

The PState Register

PState (Processing Element State) holds the processor’s current execution state, including the active exception level. In AArch64, PState is not a single accessible register — it is a collection of fields readable through dedicated system registers (CurrentEL, SPSR_ELx, DAIF, etc.). Each exception level has its own copies of the relevant registers.

PState Register

The overall sequence in locore.S before main is called:

        /* Disable MMU (ensure clean state) */
        bl      mmu_disable

        /* Initialize system registers and build MMU tables */
        bl      init_sysregs
        bl      init_mmutable
        bl      save_ttbrs

        /* Enable MMU */
        bl      mmu_enable

        /* Load virtual address of vstart into x20 */
        ldr     x20, =vstart

        /* Jump to virtual address space */
        br      x20

/*
 * vstart executes in kernel virtual address space
 */
vstart:

mmu_disable is called first even though the EFI loader already disabled the MMU — it is a defensive measure in case the kernel is entered through a different path (e.g. a kexec-like mechanism or a bootloader that does not follow the EFI handoff convention).

save_ttbrs stores the translation table base registers TTBR0_EL1 and TTBR1_EL1 after they are configured. ARMv8 uses two TTBRs: TTBR0 covers the lower virtual address range (user space, configured per-process), and TTBR1 covers the upper range (kernel space, shared). Saving them at this point lets secondary CPUs reuse the same page tables when they are brought online later.

ARMv8 branch instructions quick reference:

B label — unconditional branch to a PC-relative 26-bit signed offset
BL label — branch and link; saves return address in x30
BR xN — branch to address in register xN (not a subroutine return; use BLR for that)

Initializing System Registers

init_sysregs configures several EL1 system registers before the MMU is enabled:

init_sysregs:
        stp     x0, lr, [sp, #-16]!

        /* Configure debug event register */
        ldr     x0, mdscr_setting
        msr     mdscr_el1, x0

        /* Unlock OS lock (allows external debugger access) */
        msr     oslar_el1, xzr

        /* Clear context ID register */
        msr     contextidr_el1, xzr

        /* Trap FP/SIMD at both EL0 and EL1 until FP context is ready */
        msr     cpacr_el1, xzr

        /* Allow EL0 to read the virtual counter and frequency */
        mrs     x0, cntkctl_el1
        orr     x0, x0, #CNTKCTL_EL0VCTEN
        msr     cntkctl_el1, x0

        /* Unmask all exceptions */
        msr     daif, xzr

        ldp     x0, lr, [sp], #16
        ret

Key registers touched here:

Register	Purpose
`MDSCR_EL1`	Monitor Debug System Control — configures debug event behavior
`OSLAR_EL1`	OS Lock Access Register — writing zero clears the OS Lock, enabling external debugger access to OS save/restore registers
`CONTEXTIDR_EL1`	Process context identifier used by debug and trace infrastructure; zeroed here as no process context exists yet
`CPACR_EL1`	Architectural Feature Access Control — `FPEN=0b00` traps all FP/SIMD instructions at both EL0 and EL1 until the kernel sets up FP context management
`CNTKCTL_EL1`	Counter-timer Kernel Control — setting `EL0VCTEN` allows user-space to read the virtual counter directly without trapping to EL1
`DAIF`	Interrupt mask bits (Debug, SError, IRQ, FIQ) — zeroing unmasks all

A note on CPACR_EL1: setting it to zero means FPEN = 0b00, which causes any FP or SIMD instruction at either EL0 or EL1 to trap. This is intentional — until the kernel has set up FP register save/restore in its exception handlers, allowing FP instructions would silently corrupt the FP register file across context switches. The kernel re-enables FP access later once the infrastructure is ready.

A note on CONTEXTIDR_EL1: this register holds a process identifier for debug and trace hardware (e.g. ETM). It does not control ASID-based TLB tagging — ASIDs live in bits [63:48] of TTBR0_EL1 and TTBR1_EL1.

Enabling the MMU

Out of reset (and after the EFI loader’s handoff), the MMU is off. The CPU executes physical addresses directly with no access permission enforcement and no memory attribute differentiation. Both instruction and data caches are architecturally disabled after reset (SCTLR_EL1.I = 0, SCTLR_EL1.C = 0). The instruction cache can be enabled independently of the MMU, but enabling the data cache without the MMU requires careful setup — without address translation, the cache cannot enforce memory ordering or access permissions correctly, which makes it unsafe in any general-purpose OS context.

mmu_enable performs the following steps:

mmu_enable:
        dsb     sy

        /* Invalidate all EL1 TLB entries */
        dsb     ishst
        tlbi    vmalle1
        dsb     ish
        isb

        /* Set memory attributes */
        ldr     x0, mair_setting
        msr     mair_el1, x0

        /* Configure translation control; set IPS from hardware capability */
        ldr     x0, tcr_setting
        mrs     x1, id_aa64mmfr0_el1
        bfi     x0, x1, #32, #3
        msr     tcr_el1, x0

        /* Configure and enable via SCTLR_EL1 */
        mrs     x0, sctlr_el1
        ldr     x1, sctlr_clear
        bic     x0, x0, x1
        ldr     x1, sctlr_pac       /* disable PAC */
        bic     x0, x0, x1
        ldr     x1, sctlr_set
        orr     x0, x0, x1

        ldr     x1, sctlr_ee
#ifdef __AARCH64EB__
        orr     x0, x0, x1          /* big-endian */
#else
        bic     x0, x0, x1          /* little-endian */
#endif
        msr     sctlr_el1, x0       /* M bit set — MMU now enabled */
        isb

        ret

Breaking down the key parts:

MAIR_EL1 (Memory Attribute Indirection Register): defines up to 8 memory attribute slots that page table entries reference by index (AttrIndx field). NetBSD sets up five:

mair_setting:
        .quad (                                          \
            __SHIFTIN(MAIR_NORMAL_WB,     MAIR_ATTR0) | \
            __SHIFTIN(MAIR_NORMAL_NC,     MAIR_ATTR1) | \
            __SHIFTIN(MAIR_NORMAL_WT,     MAIR_ATTR2) | \
            __SHIFTIN(MAIR_DEVICE_MEM,    MAIR_ATTR3) | \
            __SHIFTIN(MAIR_DEVICE_MEM_SO, MAIR_ATTR4))

Normal Write-Back, Normal Non-Cacheable, Normal Write-Through, Device memory (Device-nGRE), and MAIR_DEVICE_MEM_SO — the most restrictive device type, corresponding to AArch64 Device-nGnRnE (non-Gathering, non-Reordering, non-Early-write-acknowledgement). This is the AArch64 equivalent of what ARMv7 called “Strongly Ordered.” Page table descriptors point to these slots via their AttrIndx[2:0] field.

TCR_EL1 (Translation Control Register): controls granule size, address space size, and the intermediate physical address size (IPS) at bits [34:32]. The instruction bfi x0, x1, #32, #3 inserts bits [2:0] of ID_AA64MMFR0_EL1.PARange into bits [34:32] of TCR_EL1 — matching the IPS configuration to the physical address range the hardware actually supports. Getting this wrong would cause translation faults on systems with more than 32-bit physical address space.

SCTLR_EL1 (System Control Register): the M bit (bit 0) enables the MMU. The code also clears PAC (Pointer Authentication Code) bits and configures endianness before writing the final value. The isb instruction barrier after the write ensures the pipeline is flushed and the MMU is fully active before any subsequent instruction fetch proceeds.

After mmu_enable returns, the CPU is translating virtual addresses. The br x20 instruction then jumps to vstart, which is a virtual address — this is the moment the kernel fully enters its own virtual address space.

Inside Virtual Memory

Once the MMU is enabled and execution is in virtual address space, vstart completes the remaining initialization before calling main:

Exception Vector — installs the kernel exception vector table by writing to VBAR_EL1
Process 0 stack — sets up the initial kernel stack for the idle/init process
PAC setup — enables Pointer Authentication if ID_AA64ISAR1_EL1 indicates support
CPU topology — arm_cpu_topology_set reads MPIDR_EL1 to determine cluster and core IDs
Cache info — aarch64_getcacheinfo reads CLIDR_EL1 and CCSIDR_EL1 to discover cache geometry (levels, associativity, line size)
initarm — machine-dependent initialization: GIC interrupt controller, clocks, device tree parsing, early console
main — the C entry point for the rest of kernel initialization

Summary

This part covered the ARMv8 boot process from reset to the kernel’s C entry point, using NetBSD on the RK3399-based ROCKPro64 as a concrete example.

The key structural difference from AMD64 is that ARMv8 provides a formal privilege hierarchy (exception levels) from the start, and the boot chain reflects this: BootROM runs at EL3, ATF establishes secure world services, UEFI runs at EL2, and the OS kernel settles into EL1. The transition between each level is explicit — done via ERET with carefully prepared SPSR and ELR registers. On AMD64, privilege rings exist but the boot chain does not engage them until the OS sets them up.

The other major difference is the flexibility — and the resulting complexity. Rockchip’s return-to-bootrom mechanism, the TPL/SPL split, the fixed-offset sector layout: none of this is mandated by the ARM architecture. Each vendor solves these problems differently, which is both ARM’s strength and the main reason ARM boot sequences are notoriously hard to document generically.

In Part Three we will look at ARM Secure Boot and ATF in more detail, and after that compare both architectures in the context of UEFI.

Introduction

The Bigger Picture

Exception Levels

Out of Reset

General Boot Sequence

Boot Stages

BootROM vs. Bootloader

Before the OS: Device Tree and Platform Differences

ROCKPro64 and RK3399

RockPro64 Boot Sequence

Disk Layout

start.S, locore.S and Machine-Dependent Code

From UEFI to the Kernel

The PState Register

Initializing System Registers

Enabling the MMU

Inside Virtual Memory

Summary

Resources