Before the BSD Kernel Starts: Part Two ARMv8
Note: This article was written in 2020 as a companion to Part One (AMD64) but was never published. The technical content reflects the state of NetBSD and the RK3399 platform at that time. Published here now without major changes.
Introduction
In Part One we covered legacy initialization of an AMD64-based system. This time we look at ARM processors.
ARM architecture is well known from smaller devices — embedded systems, IoT, smartphones — but ARMv8 already powers modern servers and laptops. The architecture has many revisions with backward compatibility considerations. To keep things focused, we will concentrate on ARMv8 (64-bit), which can also execute the 32-bit instruction set in compatibility mode. We will run 64-bit NetBSD throughout.
There are two main ways manufacturers obtain ARM processors: via RTL design (ARM sells the core design directly) or as an IP license (the company buys the specification). This business model makes the architecture highly flexible and implementation-specific, which also makes learning it harder — there are many chip-specific details that the architecture itself leaves undefined.
For this article we look at generic, high-level ARMv8 details first, then go deeper using a specific hardware platform. I originally considered Raspberry Pi 3 or 4, but the Broadcom GPU plays an essential and poorly-documented role in its boot process — after reset it is the GPU that starts first and later releases the ARM cores. The lack of public documentation makes it a poor teaching example.
Instead I use the RK3399Pro from Rockchip, the chip behind PINE64 devices such as the ROCKPro64 single-board computer and the Pinebook Pro laptop. It has fully open-sourced documentation and real-world relevance.
This article intentionally avoids the Secure Boot process and UEFI internals — those will get their own dedicated part.
The Bigger Picture
Exception Levels
ARMv8 introduces four privilege levels called Exception Levels (EL), numbered 0 to 3. Every implementation must support EL0 and EL1; EL2 and EL3 are optional.
The name “exception level” can be confusing — “privilege level” is the clearer term, and was used in earlier ARM specifications. The best way to understand them is by what runs at each level:
- EL0 — unprivileged user-space applications. No direct access to system registers, page tables, or hardware devices.
- EL1 — the OS kernel. Responsible for hardware configuration, page table setup, and managing EL0 processes.
- EL2 — hypervisor. UEFI also runs here during boot — not because UEFI itself needs hypervisor privileges, but because running at EL2 preserves EL2 availability for the OS. If UEFI ran at EL1, the OS would have nowhere to put a hypervisor later.
- EL3 — Secure Monitor. The most privileged level, used by ARM Trusted Firmware.
EL0 and EL1 are the minimum required to run a modern OS such as NetBSD or Linux.
ARMv8 only allows a decrease in execution state going toward less privileged levels: a
64-bit EL2 can host a 64-bit or 32-bit EL1, a 64-bit EL1 can run 64-bit or 32-bit EL0
applications, but a 32-bit OS cannot run 64-bit applications. The execution state of each
level is controlled by bits in the higher level’s configuration registers: SCR_EL3.RW
determines whether EL2 runs AArch64 or AArch32, and HCR_EL2.RW determines the same for
EL1. The reset execution state of EL3 itself is controlled by RMR_EL3.AA64 — but that
register affects the state after a warm reset, not the running state of lower ELs.
Out of Reset
Unlike AMD64 where the reset vector is a fixed physical address (0xFFFFFFF0), ARMv8
processors do not have a fixed reset vector. After reset, the processor reads RVBAR_ELx
(where x is the highest implemented exception level, typically EL3) to obtain the
implementation-defined reset vector address, then fetches instructions from that address.
For the rest of this article, “EL3” refers to EL3 if implemented, or the highest available exception level otherwise. Most ARMv8 chips do implement EL3, but it is not architecturally mandatory.
After reset, the processor state is architecturally defined as follows:
- All interrupt masks are set (DAIF = 0b1111 — all asynchronous exceptions masked)
- MMU is off; instruction and data caches are disabled (
SCTLR_ELx.M,.I,.C= 0) - The CPU executes physical addresses directly
- TLB and cache contents are implementation-defined (boot ROM typically invalidates both)
- Memory is unconfigured; the interrupt controller state is unknown
General Boot Sequence
Many architectures — including ARMv7 — start code execution from the first entry of an
exception table (the reset vector). In ARMv8 the reset vector is no longer part of the
exception table. Instead, after reset the processor fetches from the implementation-defined
address in RVBAR_ELx (where x is the highest implemented level: 3–1).
The first instructions executed are typically initialization code in a small on-chip ROM (or OTP — One-Time Programmable memory). Most peripherals — memory controllers, bus controllers — are disabled right after reset, so the processor needs some immediately accessible memory. That memory must be inside the chip itself, which makes it expensive and therefore small (order of kilobytes).
Boot ROM prepares the processor and handles chip-specific details: memory and cache initialization, branch predictor configuration. Since TLB and cache state after reset is implementation-defined, boot ROM typically performs explicit invalidations before enabling them.
Boot Stages
The ARMv8 boot process typically has three stages:
- First stage — Boot ROM: read-only code on-chip, executes immediately after reset
- Second stage — Boot Loader: loaded by Boot ROM from external storage
- Third stage — Advanced Boot Loader: loads the operating system (where UEFI lives)
BootROM vs. Bootloader
BootROM is the small program in read-only memory (ROM or OTP), typically provided by the chip manufacturer since it handles low-level, platform-specific initialization. Its main job is to prepare the CPU to execute external code from storage (flash, SD card, NVMe). It may also provide cryptographic features for a chain of trust and handle partition layout parsing. ARM Trusted Firmware defines requirements for this stage, but for non-secure boot there is no mandated BootROM interface standard.
Bootloader is a program (or chain of programs) whose primary task is to load the OS. It typically lives on external storage, is modifiable, and can be upgraded. Some systems have a BootROM advanced enough to load a kernel or UEFI image directly; others have a minimal BootROM that can only fetch and execute from a fixed memory location, requiring the bootloader to be split into multiple stages due to size constraints.
The second-stage bootloader, loaded by BootROM from external storage, is platform-specific and sometimes split further into sub-programs due to hardware constraints. Its goal is to initialize system RAM so the third-stage bootloader has full platform access.
The third stage prepares the entire platform to boot the OS. UEFI operates here at EL2. DRAM is operational at this point. The bootloader reads the OS from storage, verifies the image if secure boot is active, passes boot parameters, and jumps to the kernel entry point.
Before the OS: Device Tree and Platform Differences
ARM hardware covers a wide range — embedded devices through enterprise servers — and the boot environment differs accordingly.
For embedded devices there is usually no universal hardware description standard. What is common is a device tree: a structured description of the peripherals connected to the chip (I2C, SPI, UART, CAN buses and their devices) that the OS cannot discover at runtime. The kernel reads this binary at boot and configures peripherals accordingly. The Embedded Base Boot Requirements (EBBR) specification defines the format.
For server-class systems, the Server Base Boot Requirements (SBBR) apply instead. SBBR-compliant systems must not expose a device tree to the OS — hardware discovery follows ACPI and UEFI conventions.
ROCKPro64 and RK3399
The Rockchip RK3399 is a System-on-Chip with two ARM processor clusters:
- Dual-core Cortex-A72 (big cores)
- Quad-core Cortex-A53 (little cores)
The clusters are connected via ARM Coherent Interconnect (CCI). Both have L1 and L2 caches and support all exception levels (EL0–EL3). The SoC contains a main interconnect and a peripheral interconnect (with low- and high-performance domains), connected through a 128-bit/64-bit/32-bit multi-layer AXI/AHB/APB bus architecture.
Embedded SRAM comes in two units: 8 KB (accessible via the PMU for power management) and
192 KB (accessible via the peripheral bus, protected by TZMA for security). The larger
192 KB block is remapped and accessible at 0xFFFF_0000–0xFFFF_FFFF and is where the
bootloader stages execute.
RockPro64 Boot Sequence
RK3399 supports normal and secure boot. This article covers normal boot only.
After power-on reset, the RK3399 Boot ROM starts at 0xFFFF0000. It searches for a valid
ID block header at the following locations, in order:
Offset 0 in SPI flash
Offset 0x8000 on eMMC (sector 64, where a sector is 512 bytes)
Offset 0x8000 on SD card (sector 64)
These are fixed offsets — not partition-table entries. The Boot ROM looks for an ID header at each location and executes from the first one that matches.
Full boot chain using U-Boot:
- Boot ROM loads U-Boot TPL into SRAM. TPL initializes main system RAM.
- Control returns from TPL to Boot ROM (
return-to-bootrom). - Boot ROM loads U-Boot SPL.
- SPL loads ARM Trusted Firmware (ATF) and U-Boot into main memory.
- ATF runs U-Boot.
- U-Boot loads
bootaa64.efifrom the FAT boot partition (\EFI\BOOT\). - The UEFI loader finds and executes the NetBSD kernel.
Glossary for this chain:
- TPL (Tertiary Program Loader): A minimal subset of SPL. Some boards have tight size constraints on early-stage code; TPL handles DDR initialization only, then hands off to SPL.
- SPL (Secondary Program Loader): A small binary generated from U-Boot source that fits in SRAM and loads the main U-Boot into system RAM.
- MLO (Memory Loader): A broader term for any second-stage program that loads the next bootloader into memory. SPL is the U-Boot-specific variant; some boards use their own MLO.
- ATF (ARM Trusted Firmware): The reference implementation for secure boot on ARM. Divides the boot process into stages BL1 (Boot ROM), BL2 (verified by BL1), and BL3 (loads U-Boot/GRUB or similar). Establishes the chain of trust.
A note on return-to-bootrom: this is a Rockchip-specific mechanism. The Boot ROM
leaves a function pointer in memory before transferring control to TPL. TPL can call this
pointer — roughly equivalent to a longjmp — to return control to the Boot ROM, which then
continues to the next stage. If the second-stage loader fails and returns to Boot ROM with
the appropriate error code, Boot ROM enters an upgrade mode accessible over USB. In the
normal boot path, TPL uses this mechanism to hand off to SPL cleanly.
Disk Layout
U-Boot is flexible and can boot from many sources. For NetBSD on RockPro64 we format the
boot medium with an MBR partition table containing two partitions: a FAT32 system partition
holding the EFI loader, and a NetBSD partition (ID 0xa9) holding the kernel and root filesystem.
file ./NetBSD-9-aarch64-202101091800Z-pinebook-pro.img:
DOS/MBR boot sector;
partition 1 : ID=0xc, active, start-CHS (0x2,10,9), end-CHS (0xc,60,48), startsector 32768, 163840 sectors;
partition 2 : ID=0xa9, start-CHS (0xc,60,49), end-CHS (0x92,123,41), startsector 196608, 2156672 sectors
The layout of sectors on the medium before the partitions:
Start (sectors) | Size (sectors) | Name | Description
----------------|----------------|-----------------------|---------------------------
64 | 16320 | IDBLoader | SoC init code (TPL + SPL)
16384 | 8192 | OS loader | Main U-Boot
24576 | 8192 | TrustedFirmware-A | ARM Trusted Firmware
32768 | 163840 | FAT32 system partition | EFI loader
196608 | 2156672 | NetBSD partition | FFS filesystem with kernel
The EFI loader (provided in NetBSD sources) finds the kernel file, loads it into RAM, and
calls aarch64_exec_kernel. Before transferring control, it invalidates instruction and data
caches to ensure the kernel image is coherently placed in RAM, and disables the MMU — the
kernel starts with physical addresses and sets up its own address space from scratch.
start.S, locore.S and Machine-Dependent Code
From UEFI to the Kernel
When a multiprocessor SoC powers on, all cores become active after the power-on reset.
Early-stage firmware designates one as the boot core and holds the others with a
wfi (wait for interrupt) instruction. NetBSD’s early initialization assumes exactly this:
one active core, the rest idle.
Execution begins in start.S inside /sys/arch/aarch64. This file is a thin proxy to
aarch64_start in locore.S; its main job is extracting boot arguments.
The processor arrives here at EL1 — the EFI loader has already dropped from EL2 to EL1
before jumping to the kernel. This drop is done via the ERET instruction: the EFI loader
sets SPSR_EL2 to describe the target PSTATE (EL1h, all interrupts masked), sets ELR_EL2
to the kernel entry address, and executes ERET. The processor atomically switches to EL1
and begins fetching from the kernel entry point. The MMU is off at this point — the EFI
loader disabled it before the jump, so the kernel executes physical addresses until it
builds its own page tables.
The steps ahead:
- Initialize system registers
- Build MMU translation tables
- Enable the MMU
- Jump into virtual address space
The PState Register
PState (Processing Element State) holds the processor’s current execution state, including
the active exception level. In AArch64, PState is not a single accessible register — it is
a collection of fields readable through dedicated system registers (CurrentEL, SPSR_ELx,
DAIF, etc.). Each exception level has its own copies of the relevant registers.
The overall sequence in locore.S before main is called:
/* Disable MMU (ensure clean state) */
bl mmu_disable
/* Initialize system registers and build MMU tables */
bl init_sysregs
bl init_mmutable
bl save_ttbrs
/* Enable MMU */
bl mmu_enable
/* Load virtual address of vstart into x20 */
ldr x20, =vstart
/* Jump to virtual address space */
br x20
/*
* vstart executes in kernel virtual address space
*/
vstart:
mmu_disable is called first even though the EFI loader already disabled the MMU — it is
a defensive measure in case the kernel is entered through a different path (e.g. a kexec-like
mechanism or a bootloader that does not follow the EFI handoff convention).
save_ttbrs stores the translation table base registers TTBR0_EL1 and TTBR1_EL1 after
they are configured. ARMv8 uses two TTBRs: TTBR0 covers the lower virtual address range
(user space, configured per-process), and TTBR1 covers the upper range (kernel space, shared).
Saving them at this point lets secondary CPUs reuse the same page tables when they are
brought online later.
ARMv8 branch instructions quick reference:
B label— unconditional branch to a PC-relative 26-bit signed offsetBL label— branch and link; saves return address inx30BR xN— branch to address in registerxN(not a subroutine return; useBLRfor that)
Initializing System Registers
init_sysregs configures several EL1 system registers before the MMU is enabled:
init_sysregs:
stp x0, lr, [sp, #-16]!
/* Configure debug event register */
ldr x0, mdscr_setting
msr mdscr_el1, x0
/* Unlock OS lock (allows external debugger access) */
msr oslar_el1, xzr
/* Clear context ID register */
msr contextidr_el1, xzr
/* Trap FP/SIMD at both EL0 and EL1 until FP context is ready */
msr cpacr_el1, xzr
/* Allow EL0 to read the virtual counter and frequency */
mrs x0, cntkctl_el1
orr x0, x0, #CNTKCTL_EL0VCTEN
msr cntkctl_el1, x0
/* Unmask all exceptions */
msr daif, xzr
ldp x0, lr, [sp], #16
ret
Key registers touched here:
| Register | Purpose |
|---|---|
MDSCR_EL1 |
Monitor Debug System Control — configures debug event behavior |
OSLAR_EL1 |
OS Lock Access Register — writing zero clears the OS Lock, enabling external debugger access to OS save/restore registers |
CONTEXTIDR_EL1 |
Process context identifier used by debug and trace infrastructure; zeroed here as no process context exists yet |
CPACR_EL1 |
Architectural Feature Access Control — FPEN=0b00 traps all FP/SIMD instructions at both EL0 and EL1 until the kernel sets up FP context management |
CNTKCTL_EL1 |
Counter-timer Kernel Control — setting EL0VCTEN allows user-space to read the virtual counter directly without trapping to EL1 |
DAIF |
Interrupt mask bits (Debug, SError, IRQ, FIQ) — zeroing unmasks all |
A note on CPACR_EL1: setting it to zero means FPEN = 0b00, which causes any FP or SIMD
instruction at either EL0 or EL1 to trap. This is intentional — until the kernel has set
up FP register save/restore in its exception handlers, allowing FP instructions would
silently corrupt the FP register file across context switches. The kernel re-enables FP
access later once the infrastructure is ready.
A note on CONTEXTIDR_EL1: this register holds a process identifier for debug and trace
hardware (e.g. ETM). It does not control ASID-based TLB tagging — ASIDs live in bits
[63:48] of TTBR0_EL1 and TTBR1_EL1.
Enabling the MMU
Out of reset (and after the EFI loader’s handoff), the MMU is off. The CPU executes physical
addresses directly with no access permission enforcement and no memory attribute
differentiation. Both instruction and data caches are architecturally disabled after reset
(SCTLR_EL1.I = 0, SCTLR_EL1.C = 0). The instruction cache can be enabled independently
of the MMU, but enabling the data cache without the MMU requires careful setup — without address
translation, the cache cannot enforce memory ordering or access permissions correctly,
which makes it unsafe in any general-purpose OS context.
mmu_enable performs the following steps:
mmu_enable:
dsb sy
/* Invalidate all EL1 TLB entries */
dsb ishst
tlbi vmalle1
dsb ish
isb
/* Set memory attributes */
ldr x0, mair_setting
msr mair_el1, x0
/* Configure translation control; set IPS from hardware capability */
ldr x0, tcr_setting
mrs x1, id_aa64mmfr0_el1
bfi x0, x1, #32, #3
msr tcr_el1, x0
/* Configure and enable via SCTLR_EL1 */
mrs x0, sctlr_el1
ldr x1, sctlr_clear
bic x0, x0, x1
ldr x1, sctlr_pac /* disable PAC */
bic x0, x0, x1
ldr x1, sctlr_set
orr x0, x0, x1
ldr x1, sctlr_ee
#ifdef __AARCH64EB__
orr x0, x0, x1 /* big-endian */
#else
bic x0, x0, x1 /* little-endian */
#endif
msr sctlr_el1, x0 /* M bit set — MMU now enabled */
isb
ret
Breaking down the key parts:
MAIR_EL1 (Memory Attribute Indirection Register): defines up to 8 memory attribute
slots that page table entries reference by index (AttrIndx field). NetBSD sets up five:
mair_setting:
.quad ( \
__SHIFTIN(MAIR_NORMAL_WB, MAIR_ATTR0) | \
__SHIFTIN(MAIR_NORMAL_NC, MAIR_ATTR1) | \
__SHIFTIN(MAIR_NORMAL_WT, MAIR_ATTR2) | \
__SHIFTIN(MAIR_DEVICE_MEM, MAIR_ATTR3) | \
__SHIFTIN(MAIR_DEVICE_MEM_SO, MAIR_ATTR4))
Normal Write-Back, Normal Non-Cacheable, Normal Write-Through, Device memory
(Device-nGRE), and MAIR_DEVICE_MEM_SO — the most restrictive device type,
corresponding to AArch64 Device-nGnRnE (non-Gathering, non-Reordering,
non-Early-write-acknowledgement). This is the AArch64 equivalent of what ARMv7 called
“Strongly Ordered.” Page table descriptors point to these slots via their AttrIndx[2:0]
field.
TCR_EL1 (Translation Control Register): controls granule size, address space size, and
the intermediate physical address size (IPS) at bits [34:32]. The instruction
bfi x0, x1, #32, #3 inserts bits [2:0] of ID_AA64MMFR0_EL1.PARange into bits [34:32]
of TCR_EL1 — matching the IPS configuration to the physical address range the hardware
actually supports. Getting this wrong would cause translation faults on systems with more
than 32-bit physical address space.
SCTLR_EL1 (System Control Register): the M bit (bit 0) enables the MMU. The code
also clears PAC (Pointer Authentication Code) bits and configures endianness before writing
the final value. The isb instruction barrier after the write ensures the pipeline is
flushed and the MMU is fully active before any subsequent instruction fetch proceeds.
After mmu_enable returns, the CPU is translating virtual addresses. The br x20
instruction then jumps to vstart, which is a virtual address — this is the moment the
kernel fully enters its own virtual address space.
Inside Virtual Memory
Once the MMU is enabled and execution is in virtual address space, vstart completes
the remaining initialization before calling main:
- Exception Vector — installs the kernel exception vector table by writing to
VBAR_EL1 - Process 0 stack — sets up the initial kernel stack for the idle/init process
- PAC setup — enables Pointer Authentication if
ID_AA64ISAR1_EL1indicates support - CPU topology —
arm_cpu_topology_setreads MPIDR_EL1 to determine cluster and core IDs - Cache info —
aarch64_getcacheinforeadsCLIDR_EL1andCCSIDR_EL1to discover cache geometry (levels, associativity, line size) initarm— machine-dependent initialization: GIC interrupt controller, clocks, device tree parsing, early consolemain— the C entry point for the rest of kernel initialization
Summary
This part covered the ARMv8 boot process from reset to the kernel’s C entry point, using NetBSD on the RK3399-based ROCKPro64 as a concrete example.
The key structural difference from AMD64 is that ARMv8 provides a formal privilege hierarchy
(exception levels) from the start, and the boot chain reflects this: BootROM runs at EL3,
ATF establishes secure world services, UEFI runs at EL2, and the OS kernel settles into EL1.
The transition between each level is explicit — done via ERET with carefully prepared
SPSR and ELR registers. On AMD64, privilege rings exist but the boot chain does not engage
them until the OS sets them up.
The other major difference is the flexibility — and the resulting complexity. Rockchip’s
return-to-bootrom mechanism, the TPL/SPL split, the fixed-offset sector layout: none of
this is mandated by the ARM architecture. Each vendor solves these problems differently,
which is both ARM’s strength and the main reason ARM boot sequences are notoriously hard
to document generically.
In Part Three we will look at ARM Secure Boot and ATF in more detail, and after that compare both architectures in the context of UEFI.