Single-node data lake foundation for batch ingestion, running on Docker Compose.
| Component | Role |
|---|---|
| RustFS | S3-compatible object storage, backs all Iceberg table data |
| Nessie | Iceberg REST catalog with git-like versioning |
| Spark | Distributed compute engine, executes PySpark jobs submitted by Airflow |
| Airflow | Pipeline orchestration and scheduling |
| Scrapredis | Dedicated Redis instance used as a job queue between Airflow and external workers |
| Scrapworker | External HTTP ingestion worker. Receives jobs from Scrapredis, fetches data from APIs, and writes raw results to RustFS. Decoupled from Airflow to keep rate limiting and crawl lifecycle outside the orchestration layer |
- Cloud vm with 4vCPU, 16GB RAM
- Docker with Compose v2
sudoaccess (required for RustFS directory ownership)- Python >=3.14 (for Scrapworker, runs on host)
# Create the shared Docker network
docker network create data-platform
# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh && ./init.shStart services in this order (shutdown in reverse):
cd rustfs && docker compose up -dThe rustfs-init sidecar runs once after RustFS is healthy and creates the warehouse bucket automatically.
cd nessie && docker compose up -dcd spark
docker compose build # run once before first start
docker compose up -dcd scrapredis && docker compose up -dcd airflow-docker
docker compose build # run once before first start
docker compose up -dRun once after Nessie is up. Required before triggering any pipeline:
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
-H "Content-Type: application/json" \
-d '{"namespace": ["default"]}'
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
-H "Content-Type: application/json" \
-d '{"namespace": ["scraper"]}'| Service | URL | Credentials |
|---|---|---|
| RustFS S3 API | http://localhost:9000 | rustfsadmin / rustfsadmin |
| RustFS Console | http://localhost:9001 | rustfsadmin / rustfsadmin |
| Nessie REST catalog | http://localhost:19120/iceberg | |
| Nessie API | http://localhost:19120/api/v2 | |
| Nessie health | http://localhost:9090/q/health | |
| Spark Master UI | http://localhost:8081 | |
| Spark Worker UI | http://localhost:8082 | |
| Airflow UI | http://localhost:8080 | airflow / airflow |
All DAGs are paused at creation. Unpause each one in the Airflow UI before triggering.
| DAG | Description |
|---|---|
spark_static_data_v1_skeleton |
Minimal DAG, no Spark. Confirms Airflow scheduler and worker are healthy |
spark_static_data_v2_submit |
Writes a static dataset to an Iceberg table via Nessie |
spark_partitioned_data_v1 |
Extends step2 with time-based partitioning derived from the scheduled slot |
scraper_pipeline_v1 |
Full ingestion flow via Scrapworker. Requires Scrapworker running (see below) |
Only required for scraper_pipeline_v1. Runs on the host directly (not dockerized):
cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworkerStop in reverse order:
cd airflow-docker && docker compose down
cd scrapredis && docker compose down
cd spark && docker compose down
cd nessie && docker compose down
cd rustfs && docker compose downTo remove all data (irreversible):
sudo rm -rf data/sudo is required because RustFS data directories are owned by uid=10001.