Run GPU-based numerical workloads under Slurm using Apptainer, with a focus on reproducible execution, accounting, and reporting, rather than peak performance.
This repository is a continuation of my work on GPU-accelerated iterative solvers
(wgpu_solver_backend) and explores how such workloads behave when placed into a
scheduler-driven environment similar to real HPC systems.
Most examples of GPU compute focus on:
- single-node execution
- ad-hoc Docker containers
- manual GPU access
This repository explores a different question:
What does it take to run a custom GPU compute backend as a scheduled job, with resource accounting, isolation, and reproducibility?
To answer that, this repo demonstrates:
- a minimal Slurm setup with accounting enabled
- Apptainer containers suitable for GPU workloads
- batch job submission for GPU compute
- extraction of usage and billing-style metrics from Slurm
The goal is understanding the system mechanics, not building a production cluster.
- A small, self-contained Slurm + GPU sandbox
- A way to run
wgpu_solver_backendas a scheduled GPU job - A testbed for:
- GPU allocation vs. utilization
- Slurm accounting behavior
- containerized GPU execution
- job-level metrics export (CSV / JSON)
- Not a full HPC cluster
- Not a production-grade deployment
- Not optimized for performance or scale
- Not a replacement for real cluster tooling
Everything here is intentionally minimal and explicit.
- Slurm
- Controller + compute node
- Accounting enabled via
slurmdbdand MariaDB
- Apptainer
- GPU-enabled runtime image
- Runs the solver backend without Docker
- wgpu_solver_backend
- Invoked as a batch job
- Reads binary inputs
- Writes results and metrics
Slurm is responsible for resource allocation and accounting.
The solver is responsible for numerical work only.
slurm/– Slurm configuration files and init scriptsapptainer/– Definition files and runtime setupjobs/– Examplesbatchscripts for GPU jobsscripts/– Helpers for exporting usage / billing datadocs/– Notes and experiments during setup
- Build Apptainer image with GPU support
- Start Slurm controller + compute node
- Submit a GPU job via
sbatch - Run
wgpu_solver_backendinside Apptainer - Export:
- job runtime
- allocated resources
- GPU seconds (as Slurm reports them)
- Inspect results and accounting data
An important observation confirmed by this setup:
- Slurm accounts GPU usage by allocation, not by real utilization
- A job that reserves a GPU but does little work still consumes GPU time
- Fine-grained GPU utilization requires external tooling (outside Slurm)
This repo intentionally exposes that behavior rather than hiding it.
wgpu_solver_backend— GPU compute backend (PCG + Block-Jacobi, wgpu-based)iterative_solvers— CPU iterative methods (CG / PCG)colsol— direct-solver experiments (LDLᵀ / column-style elimination)extended_matrix— Sparse matrix structures and utilitiesfinite_element_method/fea_app– FEM pipeline and problem generation
This repository focuses purely on execution and scheduling.
- End-to-end workflow working
- GPU jobs run correctly under Slurm
- Accounting and metrics export validated
- Intended as a learning and demonstration environment
Further extensions (multi-node, MPI, scaling) are intentionally out of scope.
MIT License.
This project is intended as a learning and experimentation platform, not a production-ready scheduler or billing system.