This project provides a comprehensive test framework for running integration tests and benchmarks on Embucket (a Snowflake-compatible database) using industry-standard datasets: ClickBench and TPC-H.
The test framework consists of three main scripts:
make.sh- Core utilities for database setup, Docker management, and data operationsclickbench.sh- ClickBench dataset management and benchmarkingtpch.sh- TPC-H dataset management and benchmarking
The framework uses Docker Compose to orchestrate the following services:
- Embucket (port 3000) - The main database being tested, Snowflake-compatible interface
- MinIO (ports 9000, 9001) - S3-compatible object storage for testing cloud data integration
- Toxiproxy (port 8474) - Network proxy for simulating latency and failures
- MC (MinIO Client) - Automated MinIO bucket setup
All data is stored in the storage/ directory, which is mounted into Docker containers and gitignored.
- Docker and Docker Compose
- Python with virtual environment support
- Snowflake CLI
-
Start Docker services:
sh make.sh up
-
Initialize database and schema:
sh make.sh setup
-
Load benchmark data (choose one):
ClickBench (web analytics benchmark):
sh clickbench.sh clickbench_partitioned
TPC-H (decision support benchmark):
# First, manually place TPC-H parquet files in storage/tpch/100/ sh tpch.sh tpch_setup
sh make.sh install_snowflake- Install Snowflake CLIsh make.sh up- Start Docker Compose servicessh make.sh down- Stop Docker Compose servicessh make.sh volume- Create S3-based external volumesh make.sh volume_file- Create file-based external volumesh make.sh database- Create demo databasesh make.sh schema- Create schemash make.sh setup- Run complete database, schema setupsh make.sh snowsql "query"- Execute Snowflake SQLsh make.sh sparksql "query"- Execute Spark SQLsh make.sh equality table1 table2- Compare data between tables
sh clickbench.sh cp_download_partitioned- Download partitioned ClickBench datash clickbench.sh cb_download_single- Download single ClickBench filesh clickbench.sh cb_create_table- Create ClickBench table schemash clickbench.sh cb_copy_into_partitioned- Load partitioned datash clickbench.sh cb_copy_into_single- Load single file datash clickbench.sh clickbench_partitioned- Full partitioned setupsh clickbench.sh clickbench_single- Full single file setupsh clickbench.sh clickbench_spark_partitioned- Create Spark Iceberg tablesh clickbench.sh benchmark- Run ClickBench queries and measure performance
sh tpch.sh volume_local_file- Create local file-based external volumesh tpch.sh tpch_create_tables- Create all TPC-H table schemassh tpch.sh tpch_copy_into_tables- Load data from mounted storage (/storage/tpch/100/)sh tpch.sh tpch_copy_into_tables_file- Load data from local filesystem (tpch/100/)sh tpch.sh tpch_setup- Create tables and load data (complete setup)sh tpch.sh benchmark- Run TPC-H queries from tpch/queries/ and measure performance
Data Preparation:
TPC-H data must be manually placed in the storage/tpch/100/ or tpch/100/ directory as Parquet files. The script expects these files:
- customer.parquet
- orders.parquet
- lineitem.parquet
- nation.parquet
- region.parquet
- part.parquet
- supplier.parquet
- partsupp.parquet
Example Usage:
# Ensure TPC-H data files are in place
# (manually copy parquet files to storage/tpch/100/ or tpch/100/)
# Create tables and load data
sh tpch.sh tpch_setup
# Run benchmark
sh tpch.sh benchmarkTest files should follow the pattern in tests/example.sh:
#!/bin/bash
source ./make.sh
source ./clickbench.sh
# Start services
up
# Initialize Snowflake
setup
# Load test data
clickbench_partitioned
clickbench_spark_partitioned
# Run test queries
snowsql "SELECT watchid FROM demo.spark.hits LIMIT 100;"
sparksql "SELECT watchid FROM demo.embucket.hits LIMIT 100;"
# Verify data equality
equality demo.embucket.hits demo.spark.hits
# Cleanup
down- Start services - Use
sh make.sh upto start Docker containers - Initialize - Run
sh make.sh setupto create Snowflake resources - Load data - Choose appropriate data loading function
- Execute tests - Run your specific test queries
- Verify results - Use
sh make.sh equalityor custom validation - Cleanup - Use
sh make.sh downto stop services
sh clickbench.sh clickbench_partitioned- Load all 100 partitioned filessh clickbench.sh clickbench_partitioned_small- Load only first partition for testingsh clickbench.sh clickbench_single- Load single large filesh clickbench.sh clickbench_spark_partitioned- Create corresponding Spark tables
Two storage types are configured:
- S3 storage (
mybucket) - MinIO-based object storage - File storage (
local) - Local filesystem access
Both point to the same data location for testing different ingestion paths.
sh tests/example.sh# Create new test file
cp tests/example.sh tests/my_test.sh
# Edit to add your specific test logic
# Run your test
sh tests/my_test.sh# Start only the infrastructure
sh make.sh up
sh make.sh setup
# Load specific dataset
sh clickbench.sh clickbench_single
# Run custom queries
sh make.sh snowsql "SELECT COUNT(*) FROM demo.embucket.hits"
sh make.sh sparksql "SELECT COUNT(*) FROM demo.spark.hits"
# Compare results
sh make.sh equality demo.embucket.hits demo.spark.hits
# Cleanup
sh make.sh downSnowflake CLI configuration pointing to the local Embucket instance:
[connections.dev]
host = "localhost"
port = 3000
user = "user"
password = "password"
database = "demo"
schema = "embucket"
warehouse = "warehouse"Defines all services (Embucket, MinIO, Toxiproxy, MC) with port mappings and volume mounts to the storage/ directory.
Sets environment variables for S3/MinIO access:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGION
Source this file with . ./s3.sh when needed for manual S3 operations.
For detailed information about the test suite and available test scripts, see TESTING.md.
Available test files:
tests/example.sh- Basic integration testtests/clickbench.sh- ClickBench benchmarktests/clickbench_file.sh- File-based storage testtests/tpch.sh- TPC-H benchmarktests/merge.sh- MERGE operations test
The scripts automatically handle:
SNOWFLAKE_HOME- Set to current project directory- Virtual environment activation via
venv.sh