Combines multiple per-method parquet log files into a single file. More...

Namespaces
namespace	pipeline

namespace	pipeline.combine_batch_parquets

Variables
	pipeline.combine_batch_parquets.batch_dir = Path(sys.argv[1])

	pipeline.combine_batch_parquets.output_path = Path(sys.argv[2])

	pipeline.combine_batch_parquets.global_db_path = Path(DB_PATH)

list	pipeline.combine_batch_parquets.files = [f for f in batch_dir.glob("perf_results_*.parquet") if f.name != output_path.name]

	pipeline.combine_batch_parquets.merged = pl.concat([pl.read_parquet(f) for f in files], how="vertical_relaxed").sort("Timestamp")

	pipeline.combine_batch_parquets.compression

	pipeline.combine_batch_parquets.db = pl.read_parquet(global_db_path)

Detailed Description

Combines multiple per-method parquet log files into a single file.

Description:: Scans a given batch directory for all perf_results_*.parquet files, excluding the output file itself. Merges them using vertical concat, sorts by "Timestamp", and saves the result as a single compressed parquet.

After merging, the result is also appended to a global historical file specified by DB_PATH, which is configured via .env and loaded in scripts/config.py.

Compression:

All output parquet files are compressed using Zstandard (zstd).

Usage:: $ python3 combine_batch_parquets.py <batch_dir> <output_file>

Arguments:: <batch_dir> Folder containing individual .parquet logs <output_file> Path to final combined .parquet file

Output:

A merged parquet file containing all batch results
Updated global Parquet DB with new rows appended

Notes:

Global Parquet path is loaded from .env via scripts/config.py
This script uses polars for fast DataFrame operations and I/O
Intended to be run after run_perf.sh completes all method benchmarks

Definition in file combine_batch_parquets.py.

Namespaces

Variables

Detailed Description