Architecture

The zymtrace platform is only available for on-premises/self-hosted installation, allowing you to host and manage zymtrace entirely within your infrastructure for full control over your data and setup. The platform consists of two main components:

zymtrace profiler: This is the eBPF-based agent that needs to be installed on the machines running your applications. It collects performance profiles from both GPU & CPU-bound workloads.
Backend services: The backend services store, process, and analyze performance profiles. All our core backend services are written in Rust 🦀. The front-end is a combination of ReactJS and WASM.

The diagram below depicts a high-level architecture of the components:

info

Need a hosted zymtrace backend?

We can provision a dedicated SaaS version of the zymtrace backend for you. Email us at [email protected] or request access here

Components overview

zymtrace profiler

The zymtrace profiler runs on each node, deployed either as a Kubernetes DaemonSet or as a standalone binary on a standard VM. It collects performance profiles of resource-intensive processes on the node, aggregates and compresses them, and sends them to the backend via gRPC. TLS is supported by default, with an option to disable it if needed. Here's a more detailed description of how it works:

Unwinder eBPF programs are loaded into the kernel.
The kernel verifies the safety of the BPF program. If approved, the program is attached to probes and triggered upon specific events.
The eBPF programs collect data and pass it to userspace via maps.
The agent retrieves the collected data from maps. This data includes process-specific and interpreter-specific meta-information, which helps the eBPF unwinder programs perform mixed-stack unwinding across different languages (e.g., Python calling into C libraries).
The agent pushes stack traces, metrics, and metadata to the zymtrace backend for analysis.
Easily identify & optimize the most inefficient functions across your entire infrastructure. Refer to the user guide for supported profiling visualizations.

zymtrace GPU profiler

The GPU profiler is a library that is loaded into your CUDA workload via CUDA’s CUDA_INJECTION64_PATH environment variable. It collects information about kernel launches and completions. The profiler also samples high-granularity information about precisely which GPU instructions (SASS) are running on the GPU’s compute cores and the reasons for what is currently preventing the kernel from making progress (stall reasons). These stall reasons provide clear indications on why the kernel is slow. For example, there are stall reasons indicating that the GPU is waiting for a slow read from global memory or when it is waiting for an oversubscribed math pipeline. This information is pre-aggregated within the CUDA profiler and then sent out to zymtrace-profiler.

zymtrace backend services

The zymtrace backend is designed to store, process and visualize profiling data efficiently. Below is an overview of the key backend services and their roles:

ingest service

The ingest service receives profiling data from the zymtrace profiler. The ui/gateway service route profiling data to the ingest service. The service is also responsible for storing profiling events in ClickHouse, a high-performance database designed for ultra-fast querying and analysis.

symDB service

The symDB service handles symbol resolution upon request. It retrieves native symbols stored in S3/Minio or fetches them from the global symbolization service. It also uses debuginfod as a fallback if symbols are not in the global bucket. This service is critical for converting raw profiling data into meaningful stack traces by resolving both native and interpreted symbols.

identity service

zymtrace provides the granularity to segregate your profiling data into different projects within an organization. The identity service currently manages these projects. It associates incoming profiling data from the ingest service with the correct project, laying the foundation for future user authentication and role-based access control.

Storage

ClickHouse: Stores all profiling events and interpreted symbols.
Postgres: Stores user data, meta-data and project information.
S3/Minio: Stores native symbols.

Global Symbolization

We provide a public service that collects and maintains symbol information for all packages in the repositories of various popular Linux distributions. Our system continuously crawls these distributions to ensure up-to-date symbol data. Applications built from these repositories are automatically symbolized, requiring no action from the user.

This service is hosted on Google Cloud Storage (GCS). Customers also have the option to clone the bucket for on-premise use—doing so is particularly useful in environments without internet access.

zymtrace supports the following Linux distributions:

Alpine Linux
Debian
Ubuntu

Get started

zymtrace backend (On-Premises)

Refer to our on-premises installation guide for detailed instructions.

zymtrace profiler

Refer to the profiler host agent installation guide for more details.

Components overview​

zymtrace profiler​

zymtrace GPU profiler​

zymtrace backend services​

ingest service​

symDB service​

identity service​

Storage​

Global Symbolization​

Get started​

zymtrace backend (On-Premises)​

zymtrace profiler​

Components overview

zymtrace profiler

zymtrace GPU profiler

zymtrace backend services

ingest service

symDB service

identity service

Storage

Global Symbolization

Get started

zymtrace backend (On-Premises)

zymtrace profiler