Benchmarks
This document describes the benchmark methodology, how to run each benchmark, and the target performance numbers.
Overview
Section titled “Overview”ZigBolt ships a comprehensive benchmark suite covering latency, throughput, codec performance, and data structure operations:
| Benchmark | Binary | What it Measures |
|---|---|---|
| Ping-Pong | bench_ping_pong | IPC round-trip latency (RTT) |
| Throughput | bench_throughput | IPC single-direction message rate |
| UDP RTT | bench_udp_rtt | UDP loopback round-trip latency |
| Codec Throughput | bench_codec_throughput | WireCodec encode/decode rate (single + batch) |
| SPSC Latency | bench_spsc_latency | SPSC ring buffer write/read latency |
| MPSC Latency | bench_mpsc_latency | MPSC ring buffer contention latency |
| LogBuffer Throughput | bench_logbuffer_throughput | LogBuffer claim/commit/read rate |
| IPC Multi-Size | bench_ipc_multisize | IPC latency across message sizes |
| Full Suite | bench_run_all | All-in-one suite with JSON output |
All benchmarks are compiled with -OReleaseFast for maximum optimization.
Building
Section titled “Building”zig build benchThis compiles all benchmarks and places them in zig-out/bin/.
To build individually:
zig build && ./zig-out/bin/bench_ping_pongzig build && ./zig-out/bin/bench_throughputzig build && ./zig-out/bin/bench_udp_rttzig build && ./zig-out/bin/bench_codec_throughputzig build && ./zig-out/bin/bench_spsc_latencyzig build && ./zig-out/bin/bench_mpsc_latencyzig build && ./zig-out/bin/bench_logbuffer_throughputzig build && ./zig-out/bin/bench_ipc_multisizezig build && ./zig-out/bin/bench_run_allThe bench_run_all binary runs all benchmarks in sequence and outputs a summary
table plus a bench/results.json file for CI integration.
Methodology
Section titled “Methodology”Ping-Pong (IPC RTT)
Section titled “Ping-Pong (IPC RTT)”What: Measures the time between publishing a message into an IPC channel and immediately polling it back in the same process. This captures the raw shared memory write/read latency without cross-process scheduling overhead.
Procedure:
- Create an IPC channel (
/zigbolt_bench_pp) with 1 MB term length, pre-faulted pages - Warm up with 10,000 messages (discarded)
- Recreate the channel to start clean
- For each of 100,000 measurement iterations:
- Record
send_timeviatimestampNs() - Publish a 32-byte message containing the timestamp
- Poll it back immediately
- Record
recv_time, computertt = recv_time - send_time - Add RTT to HDR histogram
- Record
- Report percentiles: min, mean, p50, p90, p99, p99.9, p99.99, max
Configuration:
- Message size: 32 bytes
- Term length: 1 MB (1,048,576 bytes)
- Warmup: 10,000 messages
- Measurement: 100,000 messages
- Pre-fault: enabled
Target:
- p50 < 200 ns
- p99 < 1,000 ns
Throughput (IPC)
Section titled “Throughput (IPC)”What: Measures the maximum sustained message publish rate through an IPC channel, with periodic polling to prevent buffer exhaustion.
Procedure:
- Create an IPC channel (
/zigbolt_bench_tp) with 4 MB term length - Record start timestamp
- Publish 10,000,000 messages of 64 bytes each
- On publish failure (buffer full): poll 1,024 messages, retry
- Every 10,000 publishes: poll up to 10,000 messages
- Record end timestamp
- Compute:
msg/sec = count / elapsed,MB/sec = msg/sec * msg_size / 1MB
Configuration:
- Message size: 64 bytes
- Term length: 4 MB
- Message count: 10,000,000
- Pre-fault: enabled
Target:
- > 50 million messages/second
UDP RTT (Loopback)
Section titled “UDP RTT (Loopback)”What: Measures UDP round-trip latency over the loopback interface. Sends a datagram from one socket and receives it on another, both bound to localhost.
Procedure:
- Create sender UDP channel (port 44445, non-blocking)
- Create receiver UDP channel (port 44444, non-blocking)
- Warm up with 5,000 messages (discarded)
- Drain any remaining datagrams
- For each of 50,000 measurement iterations:
- Record
send_time, embed in 32-byte message - Send via sender socket to receiver’s port
- Busy-poll receiver socket (up to 10,000 attempts)
- Record
recv_time, compute RTT - Add to HDR histogram
- Record
- Report percentiles
Configuration:
- Message size: 32 bytes
- Ports: 44444 (receiver), 44445 (sender)
- Warmup: 5,000 messages
- Measurement: 50,000 messages
- Non-blocking: enabled
Target:
- p50 < 5 us (expected to be lower with io_uring on Linux)
WireCodec Throughput
Section titled “WireCodec Throughput”What: Measures the raw encode/decode throughput of the comptime WireCodec
for TickMessage (32B) and OrderMessage (48B), including both single-message
and batch (64-message) modes.
Procedure:
- Warm up with 100,000 encode operations (discarded)
- For 10,000,000 iterations:
- Encode a message with varying fields (prevents constant-folding)
- Accumulate a sink byte to prevent dead-code elimination
- Repeat for decode with
doNotOptimizeAwayon the result - Repeat for batch encode/decode (64 messages per batch)
- Report: ns/msg, M/sec, MB/sec bandwidth
Configuration:
- Message types: TickMessage (32B), OrderMessage (48B)
- Iterations: 10,000,000
- Batch size: 64 messages
- Anti-optimization: varying input fields + sink accumulator
Target:
- Encode: < 10 ns/msg (> 100M msg/sec)
- Decode: < 10 ns/msg (> 100M msg/sec)
- Batch encode: > 150M msg/sec
SPSC Ring Buffer Latency
Section titled “SPSC Ring Buffer Latency”What: Measures the single-producer single-consumer ring buffer write/read round-trip latency across multiple message sizes.
Procedure:
- Initialize a 64K-entry SPSC ring buffer
- Warm up with 10,000 write/read pairs (discarded)
- For 100,000 measurement samples:
- Batch 100 write/read pairs
- Record per-operation average in HDR histogram
- Report percentiles for each message size
Configuration:
- Ring capacity: 65,536 entries
- Message sizes: 8B, 32B, 64B, 256B
- Warmup: 10,000 ops
- Samples: 100,000 (x100 batch = 10M ops)
Target:
- p50 < 50 ns (8B-64B messages)
- p99 < 200 ns
MPSC Ring Buffer Latency
Section titled “MPSC Ring Buffer Latency”What: Measures the multi-producer single-consumer ring buffer latency under contention from multiple writer threads.
Configuration:
- Multiple producer threads writing concurrently
- Single consumer thread reading
- Measures contention overhead vs SPSC baseline
Target:
- p50 < 100 ns (under moderate contention)
- p99 < 500 ns
LogBuffer Throughput
Section titled “LogBuffer Throughput”What: Measures the LogBuffer claim/commit/read cycle latency, which is the foundation of the Aeron-style term buffer used by IPC channels.
Procedure:
- Initialize a LogBuffer with 64K term length
- Warm up with 10,000 claim/commit/read cycles
- Reset the buffer
- For 50,000 measurement samples:
- Batch 100 claim/commit/read cycles
- On claim failure: drain 4,096 messages and retry
- Record per-operation average in HDR histogram
- Report percentiles for each message size
Configuration:
- Term length: 65,536 bytes
- Message sizes: 32B, 64B, 256B
- Warmup: 10,000 ops
- Samples: 50,000 (x100 batch = 5M ops)
Target:
- p50 < 100 ns
- p99 < 500 ns
IPC Multi-Size
Section titled “IPC Multi-Size”What: Measures IPC channel latency across different message sizes to characterize how payload size affects publish/poll performance.
Configuration:
- Message sizes: 64B, 256B, 1024B
- Term length: 4 MB
- Pre-fault: enabled
Target:
- 64B: p50 < 200 ns
- 1024B: p50 < 500 ns
Broadcast Buffer Throughput
Section titled “Broadcast Buffer Throughput”What: Measures the 1-to-N broadcast buffer transmit/receive throughput. The broadcast buffer uses lossy semantics with lapping detection, making it suitable for market data distribution where latest-value-wins.
Configuration:
- Buffer size: 1 MB
- Message sizes: 32B, 64B
- Single transmitter, multiple receivers
- Measures both transmit rate and receive-with-lapping rate
Target:
- Transmit: > 30M msg/sec (64B messages)
- Receive: > 25M msg/sec per receiver
SBE Encode/Decode Throughput
Section titled “SBE Encode/Decode Throughput”What: Measures the SBE (Simple Binary Encoding) codec encode/decode throughput for FIX trading messages. SBE provides zero-allocation encoding with schema-driven field layout, suitable for FIX Trading Community wire format.
Configuration:
- Message types: NewOrderSingle, ExecutionReport, MarketDataIncrementalRefresh
- Encoding: schema-driven with MessageHeader (8B) + root block + repeating groups
- Iterations: 1,000,000 per message type
- Measures: encode ns/msg, decode ns/msg, round-trip validation
Target:
- NewOrderSingle encode: < 50 ns/msg
- ExecutionReport decode: < 50 ns/msg
- Full round-trip (encode + decode): < 100 ns/msg
Compression Throughput
Section titled “Compression Throughput”What: Measures the LZ4-style compression/decompression throughput for archive segment data. Compression is used by the archive subsystem to reduce storage requirements for recorded message streams.
Configuration:
- Input sizes: 1KB, 4KB, 16KB blocks
- Data patterns: market data (partially compressible), random (incompressible)
- Framed API with CRC32 validation
Target:
- Compression: > 500 MB/sec
- Decompression: > 1 GB/sec
- Compression ratio: > 2x for structured market data
Results Format
Section titled “Results Format”All latency benchmarks output HDR histogram percentiles:
=== Results === Total samples: 100000 Min: 45 ns Mean: 132.7 ns p50: 120 ns p90: 180 ns p99: 450 ns p99.9: 1200 ns p99.99: 3500 ns Max: 15000 ns
[PASS] p50 = 120 ns (target: <200 ns) [PASS] p99 = 450 ns (target: <1000 ns)Throughput benchmark output:
=== Throughput Results === Published: 10000000 msgs Elapsed: 0.150 sec Throughput: 66.7 M/sec Bandwidth: 4053.3 MB/sec
[PASS] > 50M msg/sec target met!WireCodec benchmark output:
=== ZigBolt WireCodec Throughput Benchmark === Iterations: 10000000 Batch size: 64
[TickMessage (32B)] Encode: 3.2 ns/msg (312 M/sec) Decode: 2.8 ns/msg (357 M/sec) Batch encode: 450 M/sec Batch decode: 420 M/sec Bandwidth: 9536 MB/sec (encode) [PASS] encode < 10 ns/msgFull suite (bench_run_all) summary output:
╔═══════════════════════════════════════════════════════════════════════════════╗║ Benchmark Summary ║╠════════════════╦═══════╦═════════╦═════════╦═════════╦═════════╦══════════════╣║ Transport ║ Size ║ p50 ║ p99 ║ p99.9 ║ Max ║ Throughput ║╠════════════════╬═══════╬═════════╬═════════╬═════════╬═════════╬══════════════╣║ SPSC ║ 8B ║ 12 ns ║ 45 ns ║ 120 ns ║ 500 ns ║ 83.3 M/s ║║ SPSC ║ 32B ║ 15 ns ║ 50 ns ║ 150 ns ║ 600 ns ║ 66.7 M/s ║║ IPC ║ 64B ║ 120 ns ║ 350 ns ║ 900 ns ║ 3000 ns ║ 8.3 M/s ║║ Codec-Enc ║ 32B ║ 3 ns ║ 0 ns ║ 0 ns ║ 0 ns ║ 333.3 M/s ║║ Codec-Dec ║ 32B ║ 2 ns ║ 0 ns ║ 0 ns ║ 0 ns ║ 500.0 M/s ║║ LogBuffer ║ 64B ║ 35 ns ║ 120 ns ║ 300 ns ║ 1500 ns ║ 28.6 M/s ║╚════════════════╩═══════╩═════════╩═════════╩═════════╩═════════╩══════════════╝The full suite also writes bench/results.json with structured data for CI
integration and automated regression detection.
Performance Targets vs Expected Actuals
Section titled “Performance Targets vs Expected Actuals”| Benchmark | Metric | Target | Expected (Apple M2) | Expected (Linux x86_64) |
|---|---|---|---|---|
| IPC Ping-Pong | p50 RTT | < 200 ns | ~50-150 ns | ~40-120 ns |
| IPC Ping-Pong | p99 RTT | < 1,000 ns | ~200-500 ns | ~150-400 ns |
| IPC Throughput | msg/sec | > 50M | ~60-80M | ~70-100M |
| IPC Throughput | bandwidth | > 3 GB/s | ~4-5 GB/s | ~5-6 GB/s |
| UDP RTT | p50 | < 5 us | ~2-4 us | ~1-3 us (io_uring) |
| WireCodec Encode | ns/msg | < 10 ns | ~2-5 ns | ~1-4 ns |
| WireCodec Decode | ns/msg | < 10 ns | ~2-5 ns | ~1-4 ns |
| WireCodec Batch | msg/sec | > 150M | ~200-400M | ~300-500M |
| SPSC Ring | p50 | < 50 ns | ~10-30 ns | ~8-25 ns |
| SPSC Ring | p99 | < 200 ns | ~50-150 ns | ~40-100 ns |
| MPSC Ring | p50 | < 100 ns | ~30-80 ns | ~20-60 ns |
| LogBuffer | p50 | < 100 ns | ~30-80 ns | ~25-60 ns |
| Broadcast Tx | msg/sec | > 30M | ~40-60M | ~50-80M |
| SBE Encode | ns/msg | < 50 ns | ~15-30 ns | ~10-25 ns |
| SBE Decode | ns/msg | < 50 ns | ~15-30 ns | ~10-25 ns |
| LZ4 Compress | bandwidth | > 500 MB/s | ~600-900 MB/s | ~800-1200 MB/s |
| LZ4 Decompress | bandwidth | > 1 GB/s | ~1.5-2.5 GB/s | ~2-4 GB/s |
Performance varies by:
- CPU architecture and cache hierarchy
- OS kernel version and scheduler configuration
- NUMA topology (for multi-socket systems)
- Core isolation (
isolcpus,nohz_full) on Linux - Background system load
Tuning for Best Results
Section titled “Tuning for Best Results”# Isolate CPU cores for benchmarkssudo grubby --update-kernel=ALL --args="isolcpus=2,3 nohz_full=2,3"
# Pin benchmark to isolated coretaskset -c 2 ./zig-out/bin/bench_ping_pong
# Disable frequency scalingecho performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Increase socket buffer sizessudo sysctl -w net.core.rmem_max=16777216sudo sysctl -w net.core.wmem_max=16777216# Ensure Xcode command-line tools are installedxcode-select --install
# Disable Spotlight indexing on benchmark pathssudo mdutil -i off /tmp
# Close unnecessary applications to reduce noiseHDR Histogram
Section titled “HDR Histogram”The benchmarks use a custom lightweight HDR (High Dynamic Range) histogram
implementation in bench/hdr_histogram.zig. It provides:
- Constant memory footprint (bucket array)
- O(1) recording
- Accurate percentile computation
- No allocations during measurement
This avoids measurement perturbation that would occur with a heap-allocating histogram.