zymtrace profiler resource guide

Our profiler consists of two main components:

zymtrace profiler: The host agent that manages our BPF unwinders and implements CPU profiling. It's the zymtrace distribution of the OTel eBPF agent.
zymtrace cuda profiler: This is the GPU profiler. A library loaded into your CUDA workload via the CUDA_INJECTION64_PATH environment variable.

The zymtrace profiler ships with the CUDA profiler, so you only need to enable GPU profiling during installation. Refer to install zymtrace profiler

Resource requirements

Our agents are designed to run with minimal overhead. Here are the resource impacts for each component:

zymtrace profiler

Resource	zymtrace profiler
CPU Usage	Maximum 1% overhead in testing, typically much lower
Host Memory	Up to 256MB, with Java workloads using slightly more
Storage	~8 bytes/event, ~13.8 MB/day/core (at 20 Hz sampling)

zymtrace cuda profiler

On lightly loaded systems or small-to-medium workloads, the impact is usually negligible. For high-throughput or multi-GPU systems, this overhead may be more noticeable and should be factored into performance planning.

Resource	zymtrace cuda profiler
CPU Usage	One thread (up to ~1 logical core); ~25 µs per GPU kernel launch (e.g. 0.25 cores for 10k kernels/sec)
Host Memory	~314 MB (hard limit for profiler heap)
GPU Memory	—
Storage	~17.4 bytes/event, ~105.6 MB/day (at ~70 events/second)

Illustrative Example

Consider a high-throughput system launching 10,000 GPU kernels per second:

At ~25 µs overhead per kernel, the GPU profiler introduces approximately 250 ms of extra CPU time per second, or ~0.25 additional CPU cores
This is in addition to the one dedicated thread the profiler always uses
CUPTI may add some host memory overhead, though this is dependent on workload type.

Resource requirements​

zymtrace profiler​

zymtrace cuda profiler​

Illustrative Example​

Resource requirements

zymtrace profiler

zymtrace cuda profiler

Illustrative Example