Zymtrace GPU Profiler Architecture
Overview​
zymtrace provides visibility down to CUDA kernels, disassembles CUDA SASS mnemonics, exposes GPU stall reasons, and correlates it all back to the CPU traces that launched the kernels, without requiring recompilation.
Correlating CPU stack traces to CUDA kernels is a key differentiator in our approach, because most GPU idle cycles happen when the GPU is waiting on the CPU to finish execution.
GPU profiling provides detailed, CUDA kernel-level analysis for deep optimization insights. For high-level monitoring, consider starting with GPU Metrics. Metrics help identify performance trends and anomalies, while profiling validates hypotheses with detailed CUDA kernel execution analysis. Together, they enable a complete GPU performance analysis workflow.
How the CUDA profiler works​
The CUDA profiler is not a standalone component. It's a module within the zymtrace profiler, which can be enabled by simply setting a flag. It consists of libzymtracecudaprofiler.so, a library that is shipped with the zymtrace profiler. If enabled, the library is extracted into a specified directory upon startup and can be loaded into your CUDA workload via CUDA's CUDA_INJECTION64_PATH environment variable. Refer to the Kubernetes and Docker examples here.
The zymtrace CUDA profiler collects information about kernel launches and completions. The profiler also samples high-granularity information about precisely which GPU instructions (SASS) are running on the GPU's compute cores and the reasons for what is currently preventing the kernel from making progress (stall reasons). These stall reasons provide clear indications on why the kernel is slow. For example, there are stall reasons indicating that the GPU is waiting for a slow read from global memory or when it is waiting for an oversubscribed math pipeline. This information is pre-aggregated within the CUDA profiler and then sent out to zymtrace-profiler.
Within libzymtracecudaprofiler.so, we place callbacks on all CUDA functions that can launch a CUDA kernel. zymtrace profiler detects the CUDA profiler and places a uprobe on an exported function that is called each time that a kernel is launched. The uprobe, in turn, invokes the CPU stack unwinder that we're also using for CPU profiling in general. The BPF code additionally collects the necessary information about the kernel launch that is sent along with the stack trace to enable correlation with the data that is collected in the CUDA profiler library.
Once both the stack traces collected in BPF and the kernel launch/completion data arrive in zymtrace profiler, they are merged together and sent out as a single combined stack trace that spans from the user-mode code that enqueued the kernel into the CUDA kernel that got invoked on the GPU and, if enabled, the individual instructions and stall reasons.
These traces are then sent out to our backend, ready to be visualized as a flamegraph within our UI.