Introduction

Most of today’s peripherals found in computers and servers are PCIe devices. PCI Express has become the standard for device manufacturers, and despite its broad adoption, it remains a surprisingly complicated protocol to understand when things go wrong.

When a device causes problems, it can disappear from the system or malfunction and flood the kernel logs with cryptic errors. The most common of these are reported via AER. In this post we won’t go deep into the protocol — we’ll focus on how to make sense of those error messages and extract useful information from the raw TLP headers they expose.

PCI Express Advanced Error Reporting

PCI Express Advanced Error Reporting (AER) is a capability built into PCIe devices that allows hardware to report errors to the OS in a structured way. When an error occurs — say a malformed TLP, a completion timeout, or an uncorrectable error — the device records details in its AER registers, including a copy of the offending TLP header in the Header Log.

AER errors are split into two categories:

  • Correctable errors — the hardware recovered on its own (e.g. BadTLP, RxErr)
  • Uncorrectable errors — these require OS or driver intervention (e.g. MalfTLP, FCP)

The Header Log is where it gets interesting: it gives you the raw 16-byte TLP header that caused the error, and from that you can figure out exactly what the device was trying to do.

Where TLP Headers can be found

Kernel logs

The first place to check when a PCIe device misbehaves is the kernel log. You can read it with:

dmesg | grep -i pcie
dmesg | grep -i aer

A typical AER report in dmesg looks something like:

pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received
nvme 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0100(Requester ID)
nvme 0000:01:00.0:   device [144d:a80a] error status/mask=00100000/00000000
nvme 0000:01:00.0:    [20] UnsupReq               (First)

This tells you the type of error and which device triggered it, but not the full picture. For the raw TLP header, you need to look at the AER registers directly.

Header Log via lspci

The Header Log register stores the first 16 bytes of the TLP that caused the error. You can read it with lspci -vv. The lspci utility is part of the pciutils package — install it with your distro’s package manager (apt, dnf, yum).

In this example we have an NVMe device at BDF 01:00.0:

lspci -s 01:00.0 -vv
01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E16 PCIe4 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
...
        Capabilities: [200 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000001 0000220f 01070000 9eece789

The last line is what we care about: HeaderLog: 00000001 0000220f 01070000 9eece789. Those are four 32-bit DWORDs (16 bytes total) of raw TLP header.

Parsing the Header Log

Let’s break down 00000001 0000220f 01070000 9eece789 manually.

DW0 — 00000001

The first byte (0x00) encodes the TLP format and type:

Byte 0: 0x00 = 0b00000000
  [7:5] FMT  = 0b000  → 3DW header, no data
  [4:0] TYPE = 0b00000 → Memory Read (MRd)

The remaining bytes of DW0:

Byte 1: 0x00  → TC=0, Attr[2]=0, LN=0, TH=0
Byte 2: 0x00  → TD=0, EP=0 (not poisoned), Attr[1:0]=00, AT=00
Byte 3: 0x01  → Length[9:0] = 1 DW (4 bytes)

So this is a 3DW Memory Read Request, TC0, 1 DW payload.

DW1 — 0000220f

For a Memory Read, DW1 contains the requester and tag:

Bytes [3:2]: 0x0000 → Requester ID: Bus 0x00, Dev 0x00, Fn 0x0  (Root Complex)
Byte  [1]:   0x22   → Tag: 0x22
Byte  [0]:   0x0f   → Last DW BE: 0x0 (none), First DW BE: 0xF (all 4 bytes valid)

DW2 — 01070000

For a 3DW header, DW2 is the 32-bit address:

Address: 0x01070000

DW3 — 9eece789

This is outside the 3DW header — for a Memory Read there is no data payload, so DW3 is undefined and can be ignored.

Summary

Field Value Meaning
Type MRd 3DW Memory Read, 32-bit address
Requester 00:00.0 Root Complex
Tag 0x22 Outstanding request ID
Length 1 DW (4 bytes)  
Address 0x01070000 Target address
First DW BE 0xF All 4 bytes requested

The Root Complex issued a Memory Read to 0x01070000. If this ended up in an AER log, something went wrong with the completion — possibly a timeout or the device returned an error.

Automating TLP Parsing

Doing this by hand for every log entry gets tedious quickly. I wrote a small Rust library to handle TLP parsing programmatically: rust_tlplib.

It covers the common TLP types — Memory Read/Write, Configuration, Completions, Messages — and gives you structured access to all the header fields without bit-shifting by hand. Note that it targets the classic non-flit framing (PCIe 1.0–5.0); flit mode introduced in PCIe 6.0 is a different story.

Writing the parser in Rust was an interesting experience — the type system pushes you towards making invalid TLP states unrepresentable, which is exactly what you want when decoding hardware protocols. That topic deserves its own post though.

Wrapping up

When you see a raw HeaderLog in lspci or dmesg, it’s not as opaque as it looks. Four DWORDs, and the first byte already tells you the TLP type and addressing mode. From there it’s just field extraction.

The tricky part is usually the context — understanding why that particular TLP caused an error, and what the device was supposed to be doing. That requires correlating with driver state, device registers, and sometimes a logic analyzer. But at least now you know what you’re looking at.