Skip to main content

zymtrace profiler resource guide

Our profiler consists of two main components:

  • zymtrace profiler: The host agent that manages our BPF unwinders and implements CPU profiling. It's the zymtrace distribution of the OTel eBPF agent.
  • zymtrace cuda profiler: This is the GPU profiler. A library loaded into your CUDA workload via the CUDA_INJECTION64_PATH environment variable.

The zymtrace profiler ships with the CUDA profiler, so you only need to enable GPU profiling during installation. Refer to install zymtrace profiler

Resource requirements​

Our agents are designed to run with minimal overhead. Here are the resource impacts for each component:

zymtrace profiler​

Resourcezymtrace profiler
CPU UsageMaximum 1% overhead in testing, typically much lower
Host MemoryUp to 256MB, with Java workloads using slightly more
Storage~8 bytes/event, ~13.8 MB/day/core (at 20 Hz sampling)

zymtrace cuda profiler​

On lightly loaded systems or small-to-medium workloads, the impact is usually negligible. For high-throughput or multi-GPU systems, this overhead may be more noticeable and should be factored into performance planning.

Resourcezymtrace cuda profiler
CPU UsageOne thread (up to ~1 logical core); ~25 µs per GPU kernel launch (e.g. 0.25 cores for 10k kernels/sec)
Host Memory~314 MB (hard limit for profiler heap)
GPU Memory—
Storage~17.4 bytes/event, ~105.6 MB/day (at ~70 events/second)

Illustrative Example​

Consider a high-throughput system launching 10,000 GPU kernels per second:

  • At ~25 µs overhead per kernel, the GPU profiler introduces approximately 250 ms of extra CPU time per second, or ~0.25 additional CPU cores
  • This is in addition to the one dedicated thread the profiler always uses
  • CUPTI may add some host memeory overhead, though this is dependent on workload type.