🚨 Adaptive Crisis Management Environment (OpenEnv)

title

Adaptive Crisis Management Environment

emoji

🚨

colorFrom

red

colorTo

red

sdk

docker

app_port

7860

license

mit

🚨 Adaptive Crisis Management Environment (OpenEnv)

Submission for the Meta PyTorch OpenEnv Hackathon This repository implements a mathematically rigorous Partially Observable Markov Decision Process (POMDP) for multi-objective crisis triage.

Architectural Highlights:

Strict M2M Telemetry: Fully compliant with Meta's regex-based evaluation pipeline (guaranteed .2f precision & $Z_{norm}$ bounding).
PRNG Isolation: Eliminates global state leaks via isolated random.Random monolith lock instantiated exclusively upon initialization.
Structural Fault Tolerance: Graceful 422 to 200 recovery for LLM hallucination handling.

🟢 Live Hugging Face Space

Component	Technology	Specification
RL Architecture	POMDP	Epistemic Strictness Enforced
Inference Engine	Llama 3.3 70B	HF Router (OpenAI-compatible)
API Bridge	FastAPI / Uvicorn	Port 7860
Environment	Docker	`python:3.10-slim` (UID 1000)

The adaptive-crisis-env is an advanced state-transition engine engineered for evaluating large language model (LLM) reasoning, planning, and resource allocation under extreme multi-objective constraints.

Motivation

Designed rigorously for the Meta PyTorch OpenEnv Hackathon, this environment drops the heuristic "toy" physics for mathematically bounded, stateless operational complexity. The motivation is to provide a highly challenging, non-stationary environment where resource dispatch directly impacts sequential outcomes, forcing models to utilize long-term POMDP trajectory planning over myopic step-by-step reactions. Real-world crisis management was chosen because it cleanly separates algorithmic planning (resource allocation) from natural language generation (public broadcast semantics) under strict bounds.

1. Mathematical Formulation

To rigorously evaluate autonomous dispatch decisions, this environment is modeled strictly as a Partially Observable Markov Decision Process (POMDP).

The State Vector ($S$)

The simulation tracks incidents across distributed geographic zones (3 for Easy/Medium, 5 for Hard). The composite discrete state at timestamp $t$ is represented as a structured observation:

$$S_t = [\text{Weather}_t,\ {Z_k}_{k=1}^{K},\ R_t^{\text{idle}},\ R_t^{\text{busy}}]$$

Where:

$\text{Weather}_t \in {\text{clear, storm, hurricane}}$ is the global weather modifier affecting all zones simultaneously.
$Z_k = (F_k, P_k, T_k)$ is the state tuple for zone $k$: $F_k$ (fire severity ∈ {none,low,medium,high,catastrophic}), $P_k$ (patient criticality ∈ {none,moderate,critical,fatal}), $T_k$ (traffic ∈ {low,heavy,gridlock}).
$R_t^{\text{idle}}$, $R_t^{\text{busy}}$ are the idle and deployed resource pool vectors (fire units, ambulances, police).

The Reward Function ($R$)

Agentic dispatch decisions are graded on a dense multi-objective reward function evaluated at every step. We implement a strict temporal discount factor of $\gamma = 0.99$. A $\gamma$ value of $0.99$ mathematically forces the RL agent sequence to prioritize sustained cascading-failure prevention over prioritizing a localized, easy resolution while allowing other zones to drift into catastrophic failure states.

$$R_t(s, a) = \gamma^{t} \cdot \left( R_{\text{dispatch}} + R_{\text{NLP}} - R_{\text{waste}} + R_{\text{efficiency}} - R_{\text{time}} + R_{\text{multi-obj}} \right)$$

Where each term is defined in env/reward.py and the final episodic score is computed by the three-component grader:

$$\text{score} = 0.50 \times \text{success rate} + 0.30 \times \text{efficiency} + 0.20 \times \text{resource usage}$$

Visualizing the State-Transition Loop

graph LR
    A[LLM Agent / Grader] -- "Action (A_t)" --> B[FastAPI Bridge]
    B -- "Parsed / Sanitized" --> C((Crisis Engine))
    C -- "Calculates Chaos & Physics" --> C
    C -- "State (S_t) + Reward (R_t)" --> B
    B -- "200 OK + JSON" --> A
    
    style A fill:#e1f5fe,stroke:#01579b
    style B fill:#f3e5f5,stroke:#4a148c
    style C fill:#ffebee,stroke:#b71c1c,stroke-width:2px

2. System Architecture: The FastAPI-Docker Bridge

To strictly adhere to OpenEnv Phase 1 validation (the "Guillotine" checks), absolute statelessness and reproducible container executions are required natively.

Our architecture leverages a strict FastAPI-Docker Bridge. Each simulation wrapper is constructed ephemerally inside a Docker pod, isolating memory allocation while projecting endpoints on app_port: 7860.

Resilient Schema Enforcement (From 422 to 200 OK)

Historically, rigorous RL environments fail in evaluating LLMs because agents frequently hallucinate JSON formatting (e.g., outputting "five" instead of 5). Standard APIs crash explicitly with a 422 Unprocessable Entity, dropping the container.

We completely redesigned the API boundary logic to process these native hallucinations algorithmically. Rather than crashing endpoints, malformed actions are intelligently parsed and mapped internally to a StructuralHallucinationError. Our FastAPI instance natively returns a continuous 200 OK handshake but natively routes the hallucination into the step engine as a terminal penalty evaluation. The sequence never breaks, the infrastructure stays perfectly stateless, and the LLM safely receives its negative feedback gradient.

Hallucination Handling Sequence

sequenceDiagram
    participant LLM as Meta Grader (LLM)
    participant API as FastAPI Bridge
    participant Env as POMDP Engine

    LLM->>API: Malformed JSON (e.g., amount: "five")
    Note over API: Pydantic Validation Fails (normally a 422 crash)
    API->>API: Intercept & Suppress Exception
    API->>Env: Trigger StructuralHallucinationError
    Note over Env: Apply Cascade Penalty to Reward
    Env-->>API: Return Next State + Negative Reward
    API-->>LLM: HTTP 200 OK (Graceful Continuation)

3. Zero-Trust Architecture & Inference

In an enterprise LLM simulation, token generation latency and deterministic secret management are paramount.

HF Router Inference

We use Meta Llama 3.3 70B Instruct via the HF Router (router.huggingface.co/v1), an OpenAI-compatible endpoint that routes to the optimal TGI or vLLM backend. The routing layer is transparent to the agent — inference.py uses the standard OpenAI Python client pointed at API_BASE_URL, so the model can be swapped at evaluation time (e.g. Meta's Nemotron 3 Super for Phase 2) with zero code changes.

Secure Execution Context

True statelessness demands physical secret extraction. We've built the framework assuming a zero-trust external footprint:

The required HF_TOKEN is safely sourced from Hugging Face Secrets — never hardcoded in source.
credentials must be injected via the secure HF Spaces secrets management tier upon container boot-up.
The container (sdk: docker) refuses to commit state. If it dies, all local logs, inference buffers, and PRNG seeds are permanently zeroed out.

🔐 Environment Variables

The application implements strict hierarchical payload loading defining security configurations:

Variable	Status	Role	Cascade Order
`HF_TOKEN`	Required	Accesses default Hugging Face inference environments natively securing model pulls.	Checked first. Acts as `api_key` if `API_KEY` is completely unset.
`API_KEY`	Optional	Native primary endpoint injection token.	Overrides everything gracefully mapping strictly prior to `GROQ_API_KEY`.
`GROQ_API_KEY`	Optional	Standby fallback key exclusively for Groq ultra-latency routing.	Lowest priority fallback.

4. Task Descriptions

The environment exposes three tasks with monotonically increasing difficulty. Each is a fully self-contained episode with its own initial state, resource pool, and step budget.

Task	Name	Difficulty	Weather	Zones Active	Step Budget	Success Threshold
1	Single-Zone Emergency	Easy	Clear	1 (Downtown fire — MEDIUM)	12	0.50
2	Multi-Zone Weather Chaos	Medium	Storm	2 (Suburbs fire, Downtown casualties)	15	0.50
3	City-Wide Meta Triage	Hard	Hurricane	5 (all zones, cascading + spawning)	25	0.50

Task 1 — Single-Zone Emergency (Easy) One Downtown fire (MEDIUM severity) under clear weather. Resources: 5 fire, 5 ambulances, 3 police. Agent must contain the fire before step 12. Minimal cross-zone interference makes this a calibration task.

Task 2 — Multi-Zone Weather Chaos (Medium) Simultaneous Suburbs fire (MEDIUM/HIGH) and Downtown medical casualties (MODERATE/CRITICAL), both under STORM weather. Storm weather increases required fire units by ×1.5. Agent must triage resources across competing zones with a constrained pool (5 fire, 3 ambulances, 2 police).

Task 3 — City-Wide Meta Triage (Hard) Five zones (Downtown, Suburbs, Industrial, Harbor, Residential) simultaneously active under HURRICANE weather with scarce resources (6 fire, 3 ambulances, 2 police). Features three non-stationary mechanics:

Inter-Zone Cascading: HIGH+ severity fires spread to neighbors with probability $P = \beta \cdot (\xi_j - \tau) / (\xi_{max} - \tau)$, where $\beta = 0.4$.
Resource Depletion: Fire units decay every 4 steps: $N_{fire,t} = N_{fire,0} - \lfloor t/4 \rfloor$.
Mid-Episode Crisis Spawning: Disaster spawning follows a Non-Homogeneous Poisson Process (NHPP) where $\lambda(t) = \lambda_0 \times \exp(\alpha \times \chi(t))$ with $\lambda_0=0.02, \alpha=2.5$. Note: this means the success_rate denominator (total_incidents) is strictly non-stationary and adjusts dynamically during the episode timeline.

4.1 Adaptive Curriculum Design

For Task 2 and Task 3, the environment implements in-episode adaptive difficulty escalation. If the agent performs consistently well over a rolling 5-step window, the environment automatically increases difficulty:

$$\text{If } \bar{R}_{t-5:t} > 0.7 \implies \begin{cases} \text{Resources} \leftarrow \lfloor 0.8 \times \text{Resources} \rfloor \ \text{NewCrisis} \leftarrow \text{spawn}(\text{clear zone}) \end{cases}$$

A 5-step cooldown prevents over-punishment. This forces agents to demonstrate sustained planning ability rather than one-shot solutions.

4.2 Real-World Domain Grounding

The environment's abstract severity levels and resource units are formally grounded in internationally recognized emergency management frameworks to demonstrate research-grade fidelity.

FEMA Incident Command System (ICS) Mapping

Our fire severity scale maps to the FEMA Incident Typing system used by the U.S. National Incident Management System (NIMS):

Environment Level	FEMA ICS Type	Description	Typical Resource Commitment
`NONE`	—	No active incident	None
`LOW`	Type 5	Initial attack / small incident	1–2 single resources
`MEDIUM`	Type 4	Extended attack	1 Strike Team (5 units)
`HIGH`	Type 3	Extended / complex incident	Incident Management Team (IMT)
`CATASTROPHIC`	Type 1	National significance	National-level coordination

UN OCHA Humanitarian Cluster Protocol Mapping

Our patient triage scale maps to the Modified Mercalli Intensity Scale and UN OCHA Cluster humanitarian response tiers:

Environment Level	OCHA Response Tier	Modified Mercalli	Resource Mapping
`NONE`	—	I–III	No humanitarian response needed
`MODERATE`	L2 — Country-level support	IV–V	1 USAR Task Force (Type-3)
`CRITICAL`	L3 — System-wide activation	VI–VIII	Multiple USAR Task Forces
`FATAL`	Beyond capacity	IX–X	Exceeded local response capacity

Resource Unit Mapping

Environment Resource	Real-World Equivalent	Capacity
1 `fire_unit`	1 FEMA Type-1 USAR Task Force	70 personnel, heavy rescue
1 `ambulance`	1 ALS Ambulance Unit	2 paramedics, 1 transport
1 `police`	1 Traffic Control Detail	4–6 officers, intersection mgmt

References: FEMA Incident Command System (ICS) — FEMA.gov/ICS; UN OCHA Emergency Response Framework — OCHA IASC.

5. Statistical Normalization & Baselines

To ensure mathematical discriminative validity for RL training, agent performance is calibrated using empirical evaluations against our official OpenEnv Grader. The grader weights incidents resolved against severity-weighted resource waste.

Below are the empirical baseline evaluations recorded by the Grader across all three evaluation tasks, including strict Confidence Intervals and Cohen's d to map true discriminative power.

Task	Difficulty	Agent	Score Mean	Score Std	95% CI	Cohen's d vs Random
Task 1	Easy	Random	0.454	0.299	[0.242, 0.666]	—
Task 1	Easy	Heuristic	0.921	0.003	[0.919, 0.923]	2.21
Task 1	Easy	LLM	0.885	0.021	[0.870, 0.900]	2.05
Task 2	Medium	Random	0.156	0.121	[0.070, 0.242]	—
Task 2	Medium	Heuristic	0.901	0.011	[0.893, 0.909]	8.76
Task 2	Medium	LLM	0.762	0.032	[0.739, 0.785]	7.12
Task 3	Hard	Random	0.159	0.054	[0.121, 0.197]	—
Task 3	Hard	Heuristic	0.176	0.048	[0.142, 0.210]	0.33
Task 3	Hard	LLM	0.410	0.042	[0.380, 0.440]	5.01

Reproducibility

To reproduce these results exactly:

Set API_BASE_URL, MODEL_NAME, HF_TOKEN in your .env
Run: python benchmark.py --seeds 10 --output benchmark_results.json
Expected runtime: ~8 minutes on 2vCPU / 8GB RAM
All results committed as benchmark_results.json in repo root

6. Execution Sandbox Instructions

# 1. Build the local image
docker build -t adaptive-crisis-env .

# 2. Run with HF_TOKEN injected (the only required secret)
docker run -d -p 7860:7860 \
  -e HF_TOKEN="<your-hf-token>" \
  --name eval-container \
  adaptive-crisis-env

🏗️ Technical Validation & Compliance

This repository implements a dual-layer validation architecture to ensure 100% "Guillotine-Proof" submission status.

1. `deploy.sh` (Active Deployment Validator)

This is the primary developer workflow. It executes a mathematically rigorous sync pipeline including:

pre_flight_check.py: A schema-level validator that enforces openenv.yaml constraints before any Git operations.
3-Stage Gate: Pings the live Space, validates the local Docker build, and runs openenv validate.
Atomicity: Only pushes to GitHub and Hugging Face if all 4 internal stages pass.

2. `validate-submission.sh` (Official Compliance Auditor)

This script serves as the Official Audit that satisfies the Meta Hackathon Checklist. It is a safer, "dry-run" utility designed to:

Verify live connectivity of the /reset endpoint.
Perform a baseline spec-compliance check without triggering accidental Git commits or pushes.

Evaluator Note: For the most rigorous evaluation, we recommend running ./validate-submission.sh <HF_SPACE_URL>.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
env		env
server		server
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE_WHITEPAPER.md		ARCHITECTURE_WHITEPAPER.md
CHANGELOG.md		CHANGELOG.md
DEVPOST_SUBMISSION.md		DEVPOST_SUBMISSION.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
benchmark_results.json		benchmark_results.json
conftest.py		conftest.py
deploy.sh		deploy.sh
inference.py		inference.py
metrics_tracker.py		metrics_tracker.py
openenv.yaml		openenv.yaml
openenv_catalog_entry.md		openenv_catalog_entry.md
pre_flight_check.py		pre_flight_check.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sync_benchmark_to_readme.py		sync_benchmark_to_readme.py
uv.lock		uv.lock
validate-submission.sh		validate-submission.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚨 Adaptive Crisis Management Environment (OpenEnv)

Motivation

1. Mathematical Formulation

The State Vector ($S$)

The Reward Function ($R$)

Visualizing the State-Transition Loop

2. System Architecture: The FastAPI-Docker Bridge

Resilient Schema Enforcement (From 422 to 200 OK)

Hallucination Handling Sequence

3. Zero-Trust Architecture & Inference

HF Router Inference

Secure Execution Context

🔐 Environment Variables

4. Task Descriptions

4.1 Adaptive Curriculum Design

4.2 Real-World Domain Grounding

FEMA Incident Command System (ICS) Mapping

UN OCHA Humanitarian Cluster Protocol Mapping

Resource Unit Mapping

5. Statistical Normalization & Baselines

Reproducibility

6. Execution Sandbox Instructions

🏗️ Technical Validation & Compliance

1. `deploy.sh` (Active Deployment Validator)

2. `validate-submission.sh` (Official Compliance Auditor)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚨 Adaptive Crisis Management Environment (OpenEnv)

Motivation

1. Mathematical Formulation

The State Vector ($S$)

The Reward Function ($R$)

Visualizing the State-Transition Loop

2. System Architecture: The FastAPI-Docker Bridge

Resilient Schema Enforcement (From 422 to 200 OK)

Hallucination Handling Sequence

3. Zero-Trust Architecture & Inference

HF Router Inference

Secure Execution Context

🔐 Environment Variables

4. Task Descriptions

4.1 Adaptive Curriculum Design

4.2 Real-World Domain Grounding

FEMA Incident Command System (ICS) Mapping

UN OCHA Humanitarian Cluster Protocol Mapping

Resource Unit Mapping

5. Statistical Normalization & Baselines

Reproducibility

6. Execution Sandbox Instructions

🏗️ Technical Validation & Compliance

1. deploy.sh (Active Deployment Validator)

2. validate-submission.sh (Official Compliance Auditor)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `deploy.sh` (Active Deployment Validator)

2. `validate-submission.sh` (Official Compliance Auditor)

Packages