Title: Understanding and Profiling CXL.mem Using PathFinder
Authors: Xiao Li (University of Wisconsin-Madison, Beihang University); Zerui Guo (University of Wisconsin-Madison); Yuebin Bai (Beihang University); Mahesh Ketkar, Hugh Wilkinson (Intel); Ming Liu (University of Wisconsin-Madison)
Scribe: Siyong Huang (Xiamen University)
Introduction
This paper studies the problem of understanding and analyzing CXL.mem execution. CXL.mem, a core protocol of the Compute Express Link (CXL) standard, allows CPUs to directly access remote memory, which is promising for memory pooling and resource elasticity in datacenters. However, compared to local memory, CXL.mem suffers from much higher latency, leading to processor pipeline stalls, disrupted cache hierarchy behavior, and underutilized computing resources. Existing tools are fragmented—each focusing on microarchitecture, memory subsystem, or PCIe—but none provides a holistic, end-to-end view. Hence, there is a strong need for a systematic profiler that can diagnose CXL.mem bottlenecks comprehensively.
Key idea and contribution:
This paper studies the problem of understanding and analyzing CXL.mem execution. CXL.mem, a core protocol of the Compute Express Link (CXL) standard, allows CPUs to directly access remote memory, which is promising for memory pooling and resource elasticity in datacenters. However, compared to local memory, CXL.mem suffers from much higher latency, leading to processor pipeline stalls, disrupted cache hierarchy behavior, and underutilized computing resources. Existing tools are fragmented—each focusing on microarchitecture, memory subsystem, or PCIe—but none provides a holistic, end-to-end view. Hence, there is a strong need for a systematic profiler that can diagnose CXL.mem bottlenecks comprehensively. Specifically, PathFinder introduces four key techniques:
PFBuilder – Path Reconstruction
PFBuilder leverages diverse PMU counters across cores, caches, uncore, and CXL devices to reconstruct the exact request paths that load/store instructions take. Unlike traditional profilers that only give aggregate statistics, PFBuilder can map demand reads (DRd), writes (DWr), read-for-ownership (RFO), and prefetches (HW/SW PF) onto precise hardware trajectories. It essentially builds a path map, showing how requests propagate through core caches, CHA, interconnect, and DIMMs, providing the foundation for end-to-end profiling.
PFEstimator – Stall Attribution
CXL accesses introduce long-latency stalls that propagate back through the pipeline. PFEstimator develops a back-propagation algorithm that walks from the CXL DIMM backwards through FlexBus, uncore, caches, and finally to cores, proportionally attributing observed stall cycles to specific modules. This enables fine-grained localization of bottlenecks—for example, determining whether stalls stem from cache congestion, interconnect contention, or the memory device itself.
PFAnalyzer – Contention and Culprit Flow Detection
When multiple memory streams (local + CXL) coexist, they compete for shared hardware resources. PFAnalyzer introduces a delay-based queueing model, inspired by Little’s Law and networking analysis, to quantify the queue occupancy per path. It can identify which flow is the culprit (causing congestion) versus which is the victim (suffering from interference). This allows PathFinder to not only detect that contention exists, but also pinpoint who is responsible.
PFMaterializer – Multi-Snapshot and Temporal Analysis
Execution behaviors change over time. PFMaterializer introduces a time-series database (e.g., InfluxDB) to store per-snapshot digests of path-level telemetry. It then performs cross-snapshot analysis to uncover persistent patterns such as data locality shifts, recurring contention phases, and underutilized resources. It also supports time-series clustering and correlation analysis, allowing users to relate memory behaviors to application phases or co-located workloads.
Together, these techniques allow PathFinder to dissect CXL.mem accesses end-to-end and reveal insights into latency, contention, and locality.
Evaluation
The authors evaluate PathFinder on Intel Sapphire Rapids and Emerald Rapids platforms with 77 applications (SPEC CPU2017, PARSEC, Redis, etc.). Results show that PathFinder successfully classifies different memory paths, breaks down stall cycles, analyzes interference between local and remote memory flows, and diagnoses bandwidth allocation issues.
This result is significant because it provides the community with the first systematic, end-to-end tool to understand CXL.mem behaviors, enabling both researchers and practitioners to optimize applications and system design in disaggregated memory environments.
Q: Whether PathFinder’s techniques can be applied to other complex networked memory systems, such as RDMA?
A: The authors clarified that their current work focuses only on CXL memory directly connected to the host via PCIe. They have not explored RDMA or IDM scenarios yet, although they acknowledged it as an interesting possible direction.
Personal thoughts
I particularly like that this paper bridges microarchitectural profiling and networking-inspired analysis techniques, offering a fresh perspective on how to study emerging memory systems. PathFinder is not only a tool but also a generalizable framework that could be extended to other interconnects and heterogeneous memory setups. On the downside, the reliance on PMU support may limit portability across vendors and platforms.

