Manage storage

All zymtrace backend services are designed to be stateless, allowing you to leverage Horizontal Pod Autoscaling (HPA) through zymtrace Helm chart to dynamically adjust compute resources based on demand. While scaling compute resources is straightforward, storage scaling requires more careful planning and configuration. This guide provides detailed recommendations for efficiently scaling your zymtrace storage to accommodate various workload sizes and retention requirements.

Storage Estimation

Use the following guidelines to estimate your storage needs during deployment planning:

CPU Profiling Only

zymtrace profiler stores ~13.8 MB/Core/day at 20 Hz sampling.

Formula:

Storage = Number of CPU cores × 13.8 MB × Days of retention

For example, to profile 1000 cores with 14 days of data retention:

1000 cores × 13.8 MB × 14 days = 193,200 MB ≈ 188.7 GB

The data grows incrementally, so the above figure is primarily for capacity planning purposes.

GPU Profiling Enabled

zymtrace CUDA profiler stores ~105.6 MB/GPU/day under typical workloads (~70 events/second).

Formula:

Storage = Number of GPUs × 105.6 MB × Days of retention

For example, to profile 10 GPU workloads with 14 days of data retention:

10 GPUs × 105.6 MB × 14 days = 14,784 MB ≈ 14.4 GB

tip

GPU profiling generates approximately 7× more data than CPU-only profiling.

Configure max data retention

Use the Helm chart to define how long profiling data is retained:

--set global.dataRetentionDays=7  # Set to 0 to retain data indefinitely

Reduce generated data

There are two effective strategies for controlling the volume of profiling data generated by zymtrace profiler:

Reduce the sampling rate

Lowering the sampling rate is a simple and direct way to cut down on data volume. For example, reducing the rate to 10Hz effectively halves the amount of data collected. The primary trade-off is a loss of granularity—especially when inspecting short time windows on individual hosts. For instance, zooming into a 5-minute window on a specific machine may not yield a statistically significant flamegraph at lower sampling rates. However, if your goal is to gain a general understanding of where your fleet is spending CPU cycles, this reduction is unlikely to impact the overall insights.

Enable Probabilistic sampling

Probabilistic sampling allows you to reduce storage costs by collecting a representative sample of profiling data. This method decreases storage costs with a visibility trade-off, as not all Profiling Host Agents will have profile collection enabled at all times.

When configured, it causes a die to be rolled every n seconds (configurable), deciding whether profiling should be enabled on the host or not depending on an equally configurable percentage chance. This keeps the 20Hz sampling rate, but will result in random periods of the profiler being enabled or disabled. This has the upside that it also sheds the number of packages received at the ingest service at the cost of making the sample graphs a bit jumpy if you zoom into the time axis too much. This also requires that your workload is balanced uniformly across your fleet, i.e. all hosts essentially run the same workload. Otherwise, it may skew statistical accuracy.

Configure probabilistic sampling

To configure probabilistic sampling, set the -probabilistic-threshold and -probabilistic-interval options.

Set the -probabilistic-threshold option to a unsigned integer between 1 and 99 to enable probabilistic profiling. At every probabilistic interval, a random number between 0 and 99 is chosen. If the probabilistic threshold that you’ve set is greater than this random number, the agent collects profiles from this system for the duration of the interval. The default value is 100.

Set the -probabilistic-interval option to a time duration to define the time interval for which probabilistic profiling is either enabled or disabled. The default value is 1 minute.

For example:

sudo ./zymtrace-profiler -probabilistic-threshold=70 -probabilistic-interval=5m30s

Sets a threshold of 70% and an interval of 5 minutes and 30 seconds:

It is also possible to use the environment variables ZYMTRACE_PROBABILISTIC_THRESHOLD=70 and ZYMTRACE_PROBABILISTIC_INTERVAL=5m30s to define this configuration.

Storage Estimation​

CPU Profiling Only​

GPU Profiling Enabled​

Configure max data retention​

Reduce generated data​

Reduce the sampling rate​

Enable Probabilistic sampling​

Configure probabilistic sampling​