Data Platform

Single-node data lake foundation for batch ingestion, running on Docker Compose.

Component	Role
RustFS	S3-compatible object storage, backs all Iceberg table data
Nessie	Iceberg REST catalog with git-like versioning
Spark	Distributed compute engine, executes PySpark jobs submitted by Airflow
Airflow	Pipeline orchestration and scheduling
Scrapredis	Dedicated Redis instance used as a job queue between Airflow and external workers
Scrapworker	External HTTP ingestion worker. Receives jobs from Scrapredis, fetches data from APIs, and writes raw results to RustFS. Decoupled from Airflow to keep rate limiting and crawl lifecycle outside the orchestration layer

Prerequisites

Cloud vm with 4vCPU, 16GB RAM
Docker with Compose v2
sudo access (required for RustFS directory ownership)
Python >=3.14 (for Scrapworker, runs on host)

One-time setup

# Create the shared Docker network
docker network create data-platform

# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh && ./init.sh

Start Order

Start services in this order (shutdown in reverse):

1. RustFS

cd rustfs && docker compose up -d

The rustfs-init sidecar runs once after RustFS is healthy and creates the warehouse bucket automatically.

2. Nessie

cd nessie && docker compose up -d

3. Spark

cd spark
docker compose build  # run once before first start
docker compose up -d

4. Scrapredis

cd scrapredis && docker compose up -d

5. Airflow

cd airflow-docker
docker compose build  # run once before first start
docker compose up -d

Bootstrap Namespaces

Run once after Nessie is up. Required before triggering any pipeline:

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'

Service URLs

Service	URL	Credentials
RustFS S3 API	http://localhost:9000	rustfsadmin / rustfsadmin
RustFS Console	http://localhost:9001	rustfsadmin / rustfsadmin
Nessie REST catalog	http://localhost:19120/iceberg
Nessie API	http://localhost:19120/api/v2
Nessie health	http://localhost:9090/q/health
Spark Master UI	http://localhost:8081
Spark Worker UI	http://localhost:8082
Airflow UI	http://localhost:8080	airflow / airflow

Pipelines

All DAGs are paused at creation. Unpause each one in the Airflow UI before triggering.

DAG	Description
`spark_static_data_v1_skeleton`	Minimal DAG, no Spark. Confirms Airflow scheduler and worker are healthy
`spark_static_data_v2_submit`	Writes a static dataset to an Iceberg table via Nessie
`spark_partitioned_data_v1`	Extends step2 with time-based partitioning derived from the scheduled slot
`scraper_pipeline_v1`	Full ingestion flow via Scrapworker. Requires Scrapworker running (see below)

Scrapworker

Only required for scraper_pipeline_v1. Runs on the host directly (not dockerized):

cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworker

Stopping

Stop in reverse order:

cd airflow-docker && docker compose down
cd scrapredis && docker compose down
cd spark && docker compose down
cd nessie && docker compose down
cd rustfs && docker compose down

To remove all data (irreversible):

sudo rm -rf data/

sudo is required because RustFS data directories are owned by uid=10001.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
airflow-docker		airflow-docker
docs		docs
jupyter		jupyter
nessie		nessie
rustfs		rustfs
scrapredis		scrapredis
scrapworker		scrapworker
spark		spark
trino		trino
.gitignore		.gitignore
COMPOSE.md		COMPOSE.md
README.md		README.md
docker-compose.override.yaml		docker-compose.override.yaml
docker-compose.yaml		docker-compose.yaml
init.sh		init.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Platform

Prerequisites

One-time setup

Start Order

1. RustFS

2. Nessie

3. Spark

4. Scrapredis

5. Airflow

Bootstrap Namespaces

Service URLs

Pipelines

Scrapworker

Stopping

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Platform

Prerequisites

One-time setup

Start Order

1. RustFS

2. Nessie

3. Spark

4. Scrapredis

5. Airflow

Bootstrap Namespaces

Service URLs

Pipelines

Scrapworker

Stopping

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages