Monte Carlo Benchmarking Engine
High-performance SIMD Monte Carlo engine (AVX2/NEON) with custom memory allocators and perf logging.
 
Loading...
Searching...
No Matches
About

CI

A high-performance SIMD Monte Carlo engine to estimate PI. written in C++17, benchmarked via perf and validated via CI. Engine is built to be highly optimized through SIMD aware programming (AVX2 / Neon), multi-threading, custom bump allocator, and cache-aligned memory strategies, bitmasking.

Benchmarked using an in-house perf suite, and tested via CI.

πŸ”¬ Originally based on a Spring 2025 CSULB CECS 325 (Systems Programming) assignment by Neal Terrell. Highly extended and tuned by Leon - just for fun.

Ο€ is estimated by randomly sampling (x, y) points in a unit square and counting how many fall inside the unit circle β€” the ratio approximates Ο€/4. Which is then multiplied by 4 to get our approximation of Ο€.


β†’ View Online Documentation


πŸ“š Table of Contents

  1. Features
  2. Core Optimization Techniques
  3. Benchmark-Oriented Design Summary
  4. Tips for Reviewers / Recruiters & Real World Examples
  5. Requirements
  6. Setup Instructions
  7. Building & Running
  8. Benchmark Suite (Optional)
  9. Docker + Grafana Integration
  10. Perf Dashboard Setup (Docker + Makefile)
  11. Environment Configuration
  12. GitHub Actions CI
  13. Sample Benchmark Logs

🧩 Features

  • Execution Models:
    • Sequential
    • Heap-allocated w/ multi threading
    • Custom bump memory pool allocator (thread-local, reset-based) w/ multi threading
    • SIMD-accelerated (AVX2 / NEON) w/ memory pool & multi threading
  • Memory Optimization:
    • Preallocated, thread-local memory pools for allocation-free reuse
    • Structure-of-Arrays layout for SIMD-friendly access patterns
    • Optimized to reduce false sharing through thread-local state and separation of write paths
    • Improved L1 cache locality via contiguous memory and low-fragmentation allocation
  • Performance Profiling (optional):
    • perf stat integration: IPC, cache/TLB misses, branch mispredictions
    • Tracks cycles-per-trial and miss-per-trial metrics
  • Logging & Analysis:
    • Zstandard-compressed Parquet output
    • Auto-generated CSV performance tables for quick inspection
  • CI Tested: GitHub Actions verifies builds and logs per commit
  • Cross-platform:
    • βœ… Linux, Windows (WSL), and macOS supported
    • ⚠️ Windows support (MSVC + Ninja) is experimental

πŸ”§ Core Optimization Techniques

This section breaks down the internal optimizations that power the engine, with special focus on SIMD, memory, and cache-sensitive design.

1. Distance Check Formula

Traditional distance check uses a square root:

if (sqrt(x*x + y*y) <= 1.0)

This is replaced by:

if (x*x + y*y) <= 1.0

Why?

  • Removes expensive sqrt() call
  • Enables vectorization (SIMD) and compiler auto-vectorization

2. SIMD-Accelerated Trial Execution

AVX2 (x86-64)

  • Uses 256-bit registers (__m256d)
  • 4 double-precision floats per SIMD batch
  • Circle check is vectorized:

    _mm256_add_pd(_mm256_mul_pd(x, x), _mm256_mul_pd(y, y)) <= 1.0
  • Result: hit counts are derived via bitmask (_mm256_movemask_pd) and __builtin_popcount

NEON (ARM64)

  • 128-bit registers (float64x2_t)
  • 2 double-precision floats per batch
  • Performs vcleq_f64 and vmulq_f64 in parallel

Both methods avoid conditional branches, pipeline stalls, and scalar overhead.


3. Bitmasking: Fast Hit Counting with No Branches

After computing whether each dart is inside the circle, SIMD comparisons produce a bitmask:

// For AVX2
__m256d cmp = _mm256_cmp_pd(...);
int mask = _mm256_movemask_pd(cmp); // e.g., 0b1010 = 2 hits
int hits = __builtin_popcount(mask); // Fast hardware popcount

Why bitmasking?

  • Replaces 4 conditional ifs with a single integer mask
  • Enables branchless counting in constant time
  • Avoids CPU misprediction penalties from random dart throws
  • Maps directly to native CPU instructions (fast + deterministic)

Bitmasking is a classic SIMD pattern β€” perfect for Monte Carlo trials where each outcome is independent and binary.


4. Memory Optimization: Pool Allocator

What it does:

  • Allocates from a fixed-size buffer using a bump pointer
  • No malloc, free, or heap metadata
  • Memory is reused every frame via reset()

Why it matters:

  • Heap allocation causes:
    • Fragmentation
    • Lock contention in multithreaded workloads
    • Unpredictable latency due to OS-level bookkeeping
  • PoolAllocator offers:
    • O(1) allocation
    • Zero fragmentation
    • Cache-aligned access (64-byte default)

⚠️ Note: For small types like int, the performance difference between heap and pool allocation is minimal, as the allocation cost is quickly amortized.

However, as object size increasesβ€”especially with larger structs, arrays, or cache-heavy dataβ€”the heap introduces significant overhead due to metadata, fragmentation, and poor locality.

In contrast, PoolAllocator maintains constant-time, contiguous allocation, making it dramatically faster and more cache-friendly for large or frequently reused types.

Performance Impact:

Metric new / malloc PoolAllocator
Allocation Latency Variable Constant (1 pointer bump)
Cache Locality Unpredictable Strong (contiguous memory)
SIMD Alignment Manual / fragile Default (64B, AVX/NEON safe)
GC/Freeing Per-object Bulk reset (O(1))

5. Thread-Local Everything (No Locks)

Each simulation thread gets:

  • Its own RNG (std::mt19937_64)
  • Its own PoolAllocator
  • No shared writes, no atomics

Benefits:

  • No false sharing: Allocated data is 64B-aligned (cache line size)
  • No heap contention: Each thread manages its own memory space
  • No synchronization: All computation and memory state is isolated

Threads join only at final result aggregation β€” no locks or mutexes are required at any step.


6. Scalar Fallback for Remainders

SIMD only works on batches (4 trials for AVX2, 2 for NEON).

To maintain precision:

  • Remainder trials (n % batch) fall back to scalar logic
  • Ensures all n trials are run without vector masking complexity

Benchmark-Oriented Design Summary

Feature Reason
PoolAllocator Minimal latency per alloc, avoids fragmentation
SIMD (AVX2/NEON) Process 2–4 trials per instruction cycle
Thread-Local PRNG Avoids locking / shared access
No heap in hot path Eliminates OS allocator variability
Aligned memory (64B) Safe for AVX/NEON, avoids false sharing
reset() reuse model Fast GC-like memory clearing in O(1)
Scalar fallback Completes remaining trials without SIMD masking

πŸ’‘ Tip for Reviewers / Recruiters + Real World Examples

The memory, parallelism, and vectorization strategies in this engine directly reflect patterns used in:

High Frequency Trading HRT (Hudson River Trading), a high frequency trading firm, uses a pre-allocated pool to allow better usage huge pages to shave off nano seconds in trading operations. This is crucial as in HFT, every tiny unit of time can determine whether or not a profit or loss is made.

>HRT carefully manages memory allocation patterns and alignment, recognizing that poor memory management can lead to cache thrashing and significantly impact performance in latency-critical trading operations.

This project mirrors this by using a bump-style memory pool allocator for linear memory growth, avoiding frequent dynamic allocations, as well as aligning memory to cache lines to reduce L1/L2 thrashing.

πŸ”— Low Latency Optimizations by HRT Part 1 πŸ”— Low Latency Optimizations by HRT Part 2

Order book matching engines LMAX, a London-based forex exchange, uses pre-allocation memory for ring buffers instead of tradtional queues for event handling.

>As a result of these practices, overhead from GC (garbage collector) pressure and "contention in multi-threaded implementations." has been minimized.

This project applies the same principle through pre-allocated memory, minimizing thread contention through thread-local memory pool allocators.

πŸ”— LMAX Disruptor's Whitepaper

Machine Learning Compilers (eg., Pytorch JIT) In the section, Best Practices for Backends, from the PyTorch developer docs, it is stated that

>"Compiled workloads are usually optimized by Single Instruction Multiple Data (SIMD) instruction sets."

Similarly, this project employs SIMD insruction sets, specifically AVX2 instrinsics (eg: _m256_*), to batch compute dart hits, resulting in massive performance boost.

πŸ”— Pytoch's Best Practices for Compiler Design

Similarity Search (Meta's Faiss Library) Faiss AKA Facebook AI Similarity Search is a hihgly optimized library that allows developers to search multi media documents in ways that tradtional database engines (SQL) do not support as strongly.

This project follows Meta's low latency optimization strategies and for the very same purpose, computing distances.

>Namely, Meta uses multi-threading, SIMD vectorization, and popcount (bitmasking) to speed up distance computations.

Exactly the same methods this project employs and for calculating distance, though Meta's distance computations are more complex!

πŸ”— Meta's Blog on Faiss Library

Video Processing (Netflix's VMAF) Netflix uses VMAF as a video quality metric, originally designed purely for Netflix's streaming use case before its open source. Netflix's engineers has taken great efforts to successfully improve VMAF's performance:

>"Through low-level code optimization and vectorization, we sped up VMAF’s execution by more than 4x in the past."

Specifically, these speed improvements were carried out by AVX2 & AVX-512 intrisics. Note that AVX2 & AVX-512 are respectively 256-bit and 512-bit registers.

Through out this project, _mm256_* intrinsics are heavily used during the Monte Carlo simulation process and come from the AVX2 instruction set.

πŸ”— Netflix's VMAF Optimizations Blog


Requirements

  • CMake 3.15+
  • Ninja
  • Clang++ (recommended) or GCC 12+

Setup Instructions

Note: perf is not supported on macOS or WSL. Use a bare metal Linux setup if you want benchmarking

🐧 Arch Linux

sudo pacman -S cmake ninja clang

πŸŒ€ Linux

sudo apt update
sudo apt install cmake ninja-build clang

🍎 macOS (with Homebrew)

brew install cmake ninja

πŸͺŸ Windows (WSL2 β€” βœ… Recommended)

WSL (Windows Subsystem for Linux) is fully supported. Setup is identical to Linux:

sudo apt update
sudo apt install cmake ninja-build clang python3-pip

βœ… Clang and AVX2 work on WSL with native Linux tooling. ❌ Direct Windows builds are not supported due to lack of std::aligned_alloc and allocator trait compatibility.

πŸͺŸ Windows (MSVC β€” ⚠️ Experimental)

Partial support via Ninja inside Developer Prompt:

cmake -G Ninja -B build
ninja -C build

Some allocators or SIMD intrinsics may require patching.

πŸͺŸ Windows (MinGW β€” ❌ Not Supported)

MinGW is not supported due to lack of std::aligned_alloc and std::allocator_traits compatibility.


Building & Running

Build (Linux/macOS/WSL)

cmake -G Ninja -B build
ninja -C build
./build/montecarlo [TRIALS] [METHOD]

Running just ./build/montecarlo runs all methods with 1,000,000,000 trials each by default.

Note that [TRIALS] and [METHOD] are optional parameters and default to running 1,000,000,000 trials and all execution methods respectively.

You can also singly test other execution models:

./build/montecarlo 100000000 Sequential
./build/montecarlo 100000000 Heap
./build/montecarlo 100000000 Pool
./build/montecarlo 100000000 SIMD

πŸ“Š Running Benchmark Suite (Optional)

This uses Linux perf and requires bare metal Linux.

Results are logged in .parquet (primarily for effiency) and partial support for .csv (for readability).

chmod +x scripts/run_perf.sh
./scripts/run_perf.sh # runs all methods with 1,000,000,000 trials
./scripts/run_perf.sh [TRIALS] [METHODS]
./scripts/run_perf.sh 50000000 SIMD insert_db=false # Skip ClickHouse inser

By default:

  • All methods are benchmarked
  • Results are exported as .parquet files
  • Results are inserted into ClickHouse automatically each run.

Thus you must have Clickhouse already running with the specified configs from your .env file. You can do this by running make init or make init_demo if its your first time. Otherwise run make up (docker-compose up -d).

Pass insert_db=false to skip inserting (e.g., for CI or dry runs).

Note that /scripts/run_perf.sh [TRIALS] [METHODS] is to be treated the same as running ./build/montecarlo [TRIALS] [METHODS]

πŸ‹ Docker (Optional for ClickHouse + Grafana Setup / Data Visualization)

Used for dashboard visualization and log ingestion.

If you want to use the full monitoring pipeline:

  1. Install Docker + Docker Compose πŸ‘‰ Follow instructions for your OS:

Benchmark Suite Snapshot


πŸ› οΈ Perf Dashboard Setup (Docker + Makefile)

This project includes a Makefile to manage your local Docker environment for setting up Clickhouse and loading data, log ingestion + visualization via ClickHouse and Grafana.

🧼 These commands manage everything from booting the stack to loading demo data and resetting volumes.

πŸ“‹ Makefile Command Table

Command Description
make start 🐳 Start Docker containers (ClickHouse, Grafana)
make stop πŸ“¦ Stop containers, preserve data
make rebuild πŸ”„ Restart + rebuild images (data preserved)
make reset_all 🧼 Full reset (⚠️ deletes volumes) and rebuilds
make clean_all 🧹 Remove Docker volumes + local data (dangerous!)
make clear_data πŸ“ Deletes local simulation data only (db/)
make clear_parquets 🧽 Deletes all local .parquet logs
make logs πŸ“œ Streams Docker logs from all containers
make init 🌱 Start stack and initialize ClickHouse schema
make init_demo 🌸 Load sample data (db_sample.parquet) into ClickHouse
make load_data πŸ“₯ Load your current simulation log (db/db.parquet) into ClickHouse
make load_demo 🧺 Load demo Parquet into db/, then into ClickHouse
make setup_clickhouse πŸ› οΈ Manually reinitialize ClickHouse schema

⚠️ Important: Any of the following commands will overwrite all current ClickHouse data by reloading from DB_PATH: make init_demo, make load_data, make load_demo, or any command that invokes scripts.setup --load-*.


πŸ“„ Environment Configuration (.env)

You must configure your environment before running the Docker stack.

A .env.sample file is included. To get started:

cp .env.sample .env

You may tweak the values, but they should work out of the box if ports 9000, 8123, and 3000 are free.

Default .env Variables

# ClickHouse
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=
CLICKHOUSE_UID=clickhouse-datasource
# For Python client
CLICKHOUSE_HOST=localhost
CLICKHOUSE_TCP_PORT=9000
CLICKHOUSE_HTTP_PORT=8123
# For Docker containers
CLICKHOUSE_HOST_DOCKER=clickhouse
CLICKHOUSE_TCP_PORT_DOCKER=9000
CLICKHOUSE_HTTP_PORT_DOCKER=8123
# Grafana
GRAFANA_PORT=3000
# Data paths
DB_PATH=db/db.parquet
SAMPLE_PATH=samples/db_sample.parquet

If you experience port conflicts (e.g., port 3000 is in use), either:

  • Kill the conflicting service
  • Or edit .env to use alternate ports (and reflect those changes in your docker-compose.yml)

πŸ€– GitHub Actions CI

Every push or PR to main is automatically tested via GitHub Actions:

  • Builds using Clang + CMake + Ninja on Ubuntu
  • Runs a dry smoke test for all methods:

    ./build/montecarlo 10000 Sequential
    ./build/montecarlo 10000 Heap
    ./build/montecarlo 10000 Pool
    ./build/montecarlo 10000 SIMD
  • CI ensures correctness, not benchmarking

CI badge and logs: View GitHub Actions


πŸ“„ Performance Benchmark Snapshot

A sample snapshot is included for reference: πŸ‘‰ samples/perf_results_all_4ca37783.md

Note that markdowk files are not generated automatically, the markdown file was generated for ease of display for this readme.

To inspect raw .parquet logs directly, explore files in:

samples/*.parquet

You can also view them visually via https://www.tablab.app β€” just drag and drop any .parquet file.