A high-performance SIMD Monte Carlo engine to estimate PI. written in C++17, benchmarked via perf
and validated via CI. Engine is built to be highly optimized through SIMD aware programming (AVX2 / Neon), multi-threading, custom bump allocator, and cache-aligned memory strategies, bitmasking.
Benchmarked using an in-house perf
suite, and tested via CI.
π¬ Originally based on a Spring 2025 CSULB CECS 325 (Systems Programming) assignment by Neal Terrell. Highly extended and tuned by Leon - just for fun.
Ο is estimated by randomly sampling (x, y) points in a unit square and counting how many fall inside the unit circle β the ratio approximates Ο/4. Which is then multiplied by 4 to get our approximation of Ο.
perf stat
integration: IPC, cache/TLB misses, branch mispredictionsThis section breaks down the internal optimizations that power the engine, with special focus on SIMD, memory, and cache-sensitive design.
Traditional distance check uses a square root:
This is replaced by:
Why?
sqrt()
call__m256d
)Circle check is vectorized:
_mm256_movemask_pd
) and __builtin_popcount
float64x2_t
)vcleq_f64
and vmulq_f64
in parallelBoth methods avoid conditional branches, pipeline stalls, and scalar overhead.
After computing whether each dart is inside the circle, SIMD comparisons produce a bitmask:
Why bitmasking?
if
s with a single integer maskBitmasking is a classic SIMD pattern β perfect for Monte Carlo trials where each outcome is independent and binary.
malloc
, free
, or heap metadatareset()
β οΈ Note: For small types like int, the performance difference between heap and pool allocation is minimal, as the allocation cost is quickly amortized.
However, as object size increasesβespecially with larger structs, arrays, or cache-heavy dataβthe heap introduces significant overhead due to metadata, fragmentation, and poor locality.
In contrast, PoolAllocator maintains constant-time, contiguous allocation, making it dramatically faster and more cache-friendly for large or frequently reused types.
Metric | new / malloc | PoolAllocator |
---|---|---|
Allocation Latency | Variable | Constant (1 pointer bump) |
Cache Locality | Unpredictable | Strong (contiguous memory) |
SIMD Alignment | Manual / fragile | Default (64B, AVX/NEON safe) |
GC/Freeing | Per-object | Bulk reset (O(1)) |
Each simulation thread gets:
std::mt19937_64
)PoolAllocator
Benefits:
Threads join only at final result aggregation β no locks or mutexes are required at any step.
SIMD only works on batches (4 trials for AVX2, 2 for NEON).
To maintain precision:
n % batch
) fall back to scalar logicn
trials are run without vector masking complexityFeature | Reason |
---|---|
PoolAllocator | Minimal latency per alloc, avoids fragmentation |
SIMD (AVX2/NEON) | Process 2β4 trials per instruction cycle |
Thread-Local PRNG | Avoids locking / shared access |
No heap in hot path | Eliminates OS allocator variability |
Aligned memory (64B) | Safe for AVX/NEON, avoids false sharing |
reset() reuse model | Fast GC-like memory clearing in O(1) |
Scalar fallback | Completes remaining trials without SIMD masking |
The memory, parallelism, and vectorization strategies in this engine directly reflect patterns used in:
High Frequency Trading HRT (Hudson River Trading), a high frequency trading firm, uses a pre-allocated pool to allow better usage huge pages to shave off nano seconds in trading operations. This is crucial as in HFT, every tiny unit of time can determine whether or not a profit or loss is made.
>HRT carefully manages memory allocation patterns and alignment, recognizing that poor memory management can lead to cache thrashing and significantly impact performance in latency-critical trading operations.
This project mirrors this by using a bump-style memory pool allocator for linear memory growth, avoiding frequent dynamic allocations, as well as aligning memory to cache lines to reduce L1/L2 thrashing.
π Low Latency Optimizations by HRT Part 1 π Low Latency Optimizations by HRT Part 2
Order book matching engines LMAX, a London-based forex exchange, uses pre-allocation memory for ring buffers instead of tradtional queues for event handling.
>As a result of these practices, overhead from GC (garbage collector) pressure and "contention in multi-threaded implementations." has been minimized.
This project applies the same principle through pre-allocated memory, minimizing thread contention through thread-local memory pool allocators.
π LMAX Disruptor's Whitepaper
Machine Learning Compilers (eg., Pytorch JIT) In the section, Best Practices for Backends, from the PyTorch developer docs, it is stated that
>"Compiled workloads are usually optimized by Single Instruction Multiple Data (SIMD) instruction sets."
Similarly, this project employs SIMD insruction sets, specifically AVX2 instrinsics (eg: _m256_*
), to batch compute dart hits, resulting in massive performance boost.
π Pytoch's Best Practices for Compiler Design
Similarity Search (Meta's Faiss Library) Faiss AKA Facebook AI Similarity Search is a hihgly optimized library that allows developers to search multi media documents in ways that tradtional database engines (SQL) do not support as strongly.
This project follows Meta's low latency optimization strategies and for the very same purpose, computing distances.
>Namely, Meta uses multi-threading, SIMD vectorization, and popcount (bitmasking) to speed up distance computations.
Exactly the same methods this project employs and for calculating distance, though Meta's distance computations are more complex!
π Meta's Blog on Faiss Library
Video Processing (Netflix's VMAF) Netflix uses VMAF as a video quality metric, originally designed purely for Netflix's streaming use case before its open source. Netflix's engineers has taken great efforts to successfully improve VMAF's performance:
>"Through low-level code optimization and vectorization, we sped up VMAFβs execution by more than 4x in the past."
Specifically, these speed improvements were carried out by AVX2 & AVX-512 intrisics. Note that AVX2 & AVX-512 are respectively 256-bit and 512-bit registers.
Through out this project, _mm256_*
intrinsics are heavily used during the Monte Carlo simulation process and come from the AVX2 instruction set.
π Netflix's VMAF Optimizations Blog
Note: perf
is not supported on macOS or WSL. Use a bare metal Linux setup if you want benchmarking
WSL (Windows Subsystem for Linux) is fully supported. Setup is identical to Linux:
β Clang and AVX2 work on WSL with native Linux tooling. β Direct Windows builds are not supported due to lack of
std::aligned_alloc
and allocator trait compatibility.
Partial support via Ninja inside Developer Prompt:
Some allocators or SIMD intrinsics may require patching.
MinGW is not supported due to lack of std::aligned_alloc
and std::allocator_traits
compatibility.
Running just ./build/montecarlo
runs all methods with 1,000,000,000
trials each by default.
Note that [TRIALS]
and [METHOD]
are optional parameters and default to running 1,000,000,000
trials and all execution methods respectively.
You can also singly test other execution models:
This uses Linux perf and requires bare metal Linux.
Results are logged in .parquet (primarily for effiency) and partial support for .csv (for readability).
By default:
.parquet
filesThus you must have Clickhouse already running with the specified configs from your .env file. You can do this by running
make init
ormake init_demo
if its your first time. Otherwise runmake up
(docker-compose up -d).
Pass
insert_db=false
to skip inserting (e.g., for CI or dry runs).
Note that /scripts/run_perf.sh [TRIALS] [METHODS]
is to be treated the same as running ./build/montecarlo [TRIALS] [METHODS]
Used for dashboard visualization and log ingestion.
If you want to use the full monitoring pipeline:
Benchmark Suite Snapshot
This project includes a Makefile
to manage your local Docker environment for setting up Clickhouse and loading data, log ingestion + visualization via ClickHouse and Grafana.
π§Ό These commands manage everything from booting the stack to loading demo data and resetting volumes.
Command | Description |
---|---|
make start | π³ Start Docker containers (ClickHouse, Grafana) |
make stop | π¦ Stop containers, preserve data |
make rebuild | π Restart + rebuild images (data preserved) |
make reset_all | π§Ό Full reset (β οΈ deletes volumes) and rebuilds |
make clean_all | π§Ή Remove Docker volumes + local data (dangerous!) |
make clear_data | π Deletes local simulation data only (db/ ) |
make clear_parquets | π§½ Deletes all local .parquet logs |
make logs | π Streams Docker logs from all containers |
make init | π± Start stack and initialize ClickHouse schema |
make init_demo | πΈ Load sample data (db_sample.parquet ) into ClickHouse |
make load_data | π₯ Load your current simulation log (db/db.parquet ) into ClickHouse |
make load_demo | π§Ί Load demo Parquet into db/ , then into ClickHouse |
make setup_clickhouse | π οΈ Manually reinitialize ClickHouse schema |
β οΈ Important: Any of the following commands will overwrite all current ClickHouse data by reloading from
DB_PATH
:make init_demo
,make load_data
,make load_demo
, or any command that invokesscripts.setup --load-*
.
.env
)You must configure your environment before running the Docker stack.
A .env.sample
file is included. To get started:
You may tweak the values, but they should work out of the box if ports
9000
,8123
, and3000
are free.
.env
VariablesIf you experience port conflicts (e.g., port 3000 is in use), either:
- Kill the conflicting service
- Or edit
.env
to use alternate ports (and reflect those changes in yourdocker-compose.yml
)
Every push or PR to main
is automatically tested via GitHub Actions:
Runs a dry smoke test for all methods:
CI badge and logs: View GitHub Actions
A sample snapshot is included for reference: π samples/perf_results_all_4ca37783.md
Note that markdowk files are not generated automatically, the markdown file was generated for ease of display for this readme.
To inspect raw .parquet
logs directly, explore files in:
You can also view them visually via https://www.tablab.app β just drag and drop any .parquet
file.