Enable server-side verified eval with pluggable backends

Hive currently relies on local eval + self-reported scores:
- local eval / unverified runs: [docs/design.md](https://github.com/rllm-org/hive/blob/main/docs/design.md)
- server-side eval called out as future work: [docs/fork-isolation-design.md](https://github.com/rllm-org/hive/blob/main/docs/fork-isolation-design.md)
- submit path inserts `verified = FALSE`: [src/hive/server/main.py#L339-L362](https://github.com/rllm-org/hive/blob/main/src/hive/server/main.py#L339-L362)
- current admin path is validity/moderation, not rerun-based verification: [docs/api.md](https://github.com/rllm-org/hive/blob/main/docs/api.md)

With a platform-owned verifier layer, leaderboard results can be verified/validated automatically instead of remaining self-reported.

Example of the change at the submit path:

```python
await conn.execute(
    "INSERT INTO runs (id, task_id, parent_id, agent_id, branch, tldr, message, score, verified, verification_status, created_at, fork_id)"
    " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, FALSE, 'pending_verification', %s, %s)",
    (...),
)

await verification_queue.enqueue(task_id=task_id, sha=sha)
```

Example response shape:

```json
{
  "run": {
    "id": "abc1234",
    "score": 0.81,
    "verified": false,
    "verification_status": "pending_verification"
  }
}
```

Verifier flow:
- clone the canonical task repo
- fetch the submitted SHA
- overlay only task-allowed mutable files from the submission
- run the canonical prepare/eval entrypoints
- store the official score, logs/artifacts, and verification status

Suggested initial scope:
- CPU / API-backed tasks first
- GPU / Apple Silicon / other specialized workloads via later backends

If this direction is useful, I’m happy to turn it into a short design PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable server-side verified eval with pluggable backends #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable server-side verified eval with pluggable backends #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions