zymtrace profiler resource guide
Our profiler consists of two main components:
zymtrace profiler
: The host agent that manages our BPF unwinders and implements CPU profiling. It's the zymtrace distribution of the OTel eBPF agent.zymtrace cuda profiler
: This is the GPU profiler. A library loaded into your CUDA workload via theCUDA_INJECTION64_PATH
environment variable.
The zymtrace profiler ships with the CUDA profiler, so you only need to enable GPU profiling during installation. Refer to install zymtrace profiler
Resource requirements​
Our agents are designed to run with minimal overhead. Here are the resource impacts for each component:
zymtrace profiler​
Resource | zymtrace profiler |
---|---|
CPU Usage | Maximum 1% overhead in testing, typically much lower |
Host Memory | Up to 256MB, with Java workloads using slightly more |
Storage | ~8 bytes/event, ~13.8 MB/day/core (at 20 Hz sampling) |
zymtrace cuda profiler​
On lightly loaded systems or small-to-medium workloads, the impact is usually negligible. For high-throughput or multi-GPU systems, this overhead may be more noticeable and should be factored into performance planning.
Resource | zymtrace cuda profiler |
---|---|
CPU Usage | One thread (up to ~1 logical core); ~25 µs per GPU kernel launch (e.g. 0.25 cores for 10k kernels/sec) |
Host Memory | ~314 MB (hard limit for profiler heap) |
GPU Memory | — |
Storage | ~17.4 bytes/event, ~105.6 MB/day (at ~70 events/second) |
Illustrative Example​
Consider a high-throughput system launching 10,000 GPU kernels per second:
- At ~25 µs overhead per kernel, the GPU profiler introduces approximately 250 ms of extra CPU time per second, or ~0.25 additional CPU cores
- This is in addition to the one dedicated thread the profiler always uses
- CUPTI may add some host memeory overhead, though this is dependent on workload type.