How to Parse PCIe TLPs
Introduction
Most of today’s peripherals found in computers and servers are PCIe devices. PCI Express has become the standard for device manufacturers, and despite its broad adoption, it remains a surprisingly complicated protocol to understand when things go wrong.
When a device causes problems, it can disappear from the system or malfunction and flood the kernel logs with cryptic errors. The most common of these are reported via AER. In this post we won’t go deep into the protocol — we’ll focus on how to make sense of those error messages and extract useful information from the raw TLP headers they expose.
PCI Express Advanced Error Reporting
PCI Express Advanced Error Reporting (AER) is a capability built into PCIe devices that allows hardware to report errors to the OS in a structured way. When an error occurs — say a malformed TLP, a completion timeout, or an uncorrectable error — the device records details in its AER registers, including a copy of the offending TLP header in the Header Log.
AER errors are split into two categories:
- Correctable errors — the hardware recovered on its own (e.g.
BadTLP,RxErr) - Uncorrectable errors — these require OS or driver intervention (e.g.
MalfTLP,FCP)
The Header Log is where it gets interesting: it gives you the raw 16-byte TLP header that caused the error, and from that you can figure out exactly what the device was trying to do.
Where TLP Headers can be found
Kernel logs
The first place to check when a PCIe device misbehaves is the kernel log. You can read it with:
dmesg | grep -i pcie
dmesg | grep -i aer
A typical AER report in dmesg looks something like:
pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received
nvme 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0100(Requester ID)
nvme 0000:01:00.0: device [144d:a80a] error status/mask=00100000/00000000
nvme 0000:01:00.0: [20] UnsupReq (First)
This tells you the type of error and which device triggered it, but not the full picture. For the raw TLP header, you need to look at the AER registers directly.
Header Log via lspci
The Header Log register stores the first 16 bytes of the TLP that caused the error.
You can read it with lspci -vv. The lspci utility is part of the pciutils package —
install it with your distro’s package manager (apt, dnf, yum).
In this example we have an NVMe device at BDF 01:00.0:
lspci -s 01:00.0 -vv
01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E16 PCIe4 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
...
Capabilities: [200 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000001 0000220f 01070000 9eece789
The last line is what we care about: HeaderLog: 00000001 0000220f 01070000 9eece789.
Those are four 32-bit DWORDs (16 bytes total) of raw TLP header.
Parsing the Header Log
Let’s break down 00000001 0000220f 01070000 9eece789 manually.
DW0 — 00000001
The first byte (0x00) encodes the TLP format and type:
Byte 0: 0x00 = 0b00000000
[7:5] FMT = 0b000 → 3DW header, no data
[4:0] TYPE = 0b00000 → Memory Read (MRd)
The remaining bytes of DW0:
Byte 1: 0x00 → TC=0, Attr[2]=0, LN=0, TH=0
Byte 2: 0x00 → TD=0, EP=0 (not poisoned), Attr[1:0]=00, AT=00
Byte 3: 0x01 → Length[9:0] = 1 DW (4 bytes)
So this is a 3DW Memory Read Request, TC0, 1 DW payload.
DW1 — 0000220f
For a Memory Read, DW1 contains the requester and tag:
Bytes [3:2]: 0x0000 → Requester ID: Bus 0x00, Dev 0x00, Fn 0x0 (Root Complex)
Byte [1]: 0x22 → Tag: 0x22
Byte [0]: 0x0f → Last DW BE: 0x0 (none), First DW BE: 0xF (all 4 bytes valid)
DW2 — 01070000
For a 3DW header, DW2 is the 32-bit address:
Address: 0x01070000
DW3 — 9eece789
This is outside the 3DW header — for a Memory Read there is no data payload, so DW3 is undefined and can be ignored.
Summary
| Field | Value | Meaning |
|---|---|---|
| Type | MRd 3DW | Memory Read, 32-bit address |
| Requester | 00:00.0 | Root Complex |
| Tag | 0x22 | Outstanding request ID |
| Length | 1 DW (4 bytes) | |
| Address | 0x01070000 | Target address |
| First DW BE | 0xF | All 4 bytes requested |
The Root Complex issued a Memory Read to 0x01070000. If this ended up in an AER log,
something went wrong with the completion — possibly a timeout or the device returned an error.
Automating TLP Parsing
Doing this by hand for every log entry gets tedious quickly. I wrote a small Rust library to handle TLP parsing programmatically: rust_tlplib.
It covers the common TLP types — Memory Read/Write, Configuration, Completions, Messages — and gives you structured access to all the header fields without bit-shifting by hand. Note that it targets the classic non-flit framing (PCIe 1.0–5.0); flit mode introduced in PCIe 6.0 is a different story.
Writing the parser in Rust was an interesting experience — the type system pushes you towards making invalid TLP states unrepresentable, which is exactly what you want when decoding hardware protocols. That topic deserves its own post though.
Wrapping up
When you see a raw HeaderLog in lspci or dmesg, it’s not as opaque as it looks. Four DWORDs, and the first byte already tells you the TLP type and addressing mode. From there it’s just field extraction.
The tricky part is usually the context — understanding why that particular TLP caused an error, and what the device was supposed to be doing. That requires correlating with driver state, device registers, and sometimes a logic analyzer. But at least now you know what you’re looking at.