undata

A unified neuroscience data element registry that harmonizes schemas across BIDS, NWB, DANDI, openMINDS, and AIND into a searchable, version-tracked system with provenance, mappings, and LinkML interoperability.

Why undata Exists

The Problem

Neuroscience data lives in five major schema ecosystems — BIDS, NWB, DANDI, openMINDS, and AIND — each with its own vocabulary, types, units, and constraints. The same real-world concept (e.g., "subject age") appears differently in each:

Source	Name	Type	Unit	Representation
BIDS	`age`	float	years	bare numeric field
NWB	`age`	string	ISO 8601	duration string
DANDI	`age`	PropertyValue	—	structured object with BirthReference/GestationalReference
AIND	`subject.age`	number	—	nested JSON Schema property

This fragmentation means:

No shared vocabulary — researchers who combine data from multiple sources must build ad-hoc mappings manually, with no systematic way to discover what exists across all sources.
Identity conflation — existing systems conflate what a data element is (type, unit, constraints) with where it came from (source name, description). The same concept from two sources looks like two different things; two different concepts with the same name look identical.
Invisible transformations — when data moves between formats, there is no record of what changed, who decided it, or whether it's reversible.

The Solution

undata solves these problems with three architectural innovations:

Content-addressed identity — Each data element's identity is the SHA-256 hash of its semantic graph (ontology term + data type + unit + constraints). Same concept from any source = same hash = automatic deduplication. Different type or unit = different hash = distinct element. Identity is stable, dereferenceable, and independent of any source's naming.
Identity ≠ Provenance separation — Each element has one semantic identity block (hashed) and N provenance entries (not hashed). Cross-source elements naturally merge when semantically identical, while preserving full lineage via W3C PROV-O metadata (who ingested it, when, from where, derived from what).
Ingest → Enrich → Align pipeline — Automated extraction from all 5 sources, followed by embedding-based ontology alignment ("{class} {name}: {description}" → precomputed vectors in parquet) and alias detection with SKOS mapping relations — all tracked with provenance.

The result is a canonical registry where every neuroscience data concept has a single stable URI, full cross-source provenance, and semantic relationships (exact match, close match, broad/narrow match) to related concepts — making neuroscience data FAIR (Findable, Accessible, Interoperable, Reusable).

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Ingestion   │────▶│   Backend    │◀────│ Migration API   │
│  (Python CLI)│     │ (FastAPI)    │     │ (FastAPI+Celery) │
└─────────────┘     └──────┬───────┘     └────────┬────────┘
                           │                      │
                    ┌──────┴───────┐        ┌─────┴─────┐
                    │  PostgreSQL  │        │   Redis    │
                    │  (pgvector)  │        └───────────┘
                    └──────────────┘
                           │
                    ┌──────┴───────┐     ┌─────────────┐
                    │   Frontend   │────▶│ Meilisearch  │
                    │  (Next.js)   │     └─────────────┘
                    └──────────────┘

Service	Port	Description
Backend	8002	Schema registry REST API (elements, schemas, mappings, provenance)
Migration API	8004	Async migration execution (pathways, jobs, schema diff)
Frontend	3000	Schema Explorer (search, graph, compare, contribute, migrations)
PostgreSQL	5432	Primary data store with pgvector for embeddings
Redis	6379	Celery task queue for migration jobs
Keycloak	8080	OIDC identity provider
Meilisearch	7700	Full-text search engine

Quick Start

# Clone
git clone https://github.com/sensein/undata.git
cd undata

# Start the full stack
cp .env.example .env
docker compose up -d

# Seed sample data
docker compose exec backend bash /app/scripts/seed.sh http://localhost:8002

# Open the frontend
open http://localhost:3000

Project Structure

undata/
├── backend/          # Schema backend REST API (Python 3.14 / FastAPI / SQLAlchemy)
├── migration-api/    # Migration execution API (Python 3.12 / FastAPI / Celery / Redis)
├── frontend/         # Schema Explorer UI (TypeScript / Next.js 15 / React / Cytoscape.js)
├── library/          # Schema library CLI + adapters (Python package — no data files in git)
├── tutorials/        # 7 Jupyter notebooks + JupyterBook site
├── docs/             # Meta-model documentation (LinkML + MkDocs)
├── specs/            # Feature specifications (001-015)
├── scripts/          # Utility scripts (seed data)
├── docker-compose.yml  # Full-stack orchestration (8 services)
└── .github/workflows/  # CI/CD (lint, tests, image builds, Pages deploy)

Development

Backend

cd backend
docker compose up -d          # Start DB + Keycloak
uv sync                       # Install dependencies
uv run alembic upgrade head   # Run migrations
uv run uvicorn src.main:app --reload --port 8002
uv run pytest tests/ -v       # Run tests (276 tests)

Frontend

cd frontend
pnpm install
pnpm dev          # Dev server at http://localhost:3000
pnpm test         # Unit tests (44 tests)
pnpm lint         # ESLint
pnpm build        # Production build

Library

The library CLI handles all ingestion, enrichment, and alignment. Registry output (elements, schemas, transforms) is written to ~/.local/share/undata/registry/ by default (configurable via --output-dir or $UNDATA_REGISTRY_DIR). Output is not committed to git — it's generated data.

cd library
uv sync
uv run undata-library pipeline --source bids   # Full pipeline: ingest → enrich → align → transform
uv run undata-library pipeline --source nwb     # Each source auto-downloads from its upstream repo
uv run undata-library ontology refresh          # Bulk download ontologies from OBO Foundry
uv run undata-library validate-ingestion        # Validate all output
uv run undata-library cache list                # Show cached source downloads

Tutorials

cd tutorials
uv sync
uv run jupyter-book build .     # Build documentation site
open _build/html/index.html     # View locally

CI/CD

Workflow	Trigger	Purpose
`lint.yml`	All pushes	Ruff lint (ingestion + tutorials)
`frontend.yml`	frontend/ changes	ESLint + vitest + build
`backend-tests.yml`	PRs touching backend/	PostgreSQL + pytest
`build-images.yml`	`v*` tags	Build + push to GHCR
`tutorials-site.yml`	main push (tutorials/)	JupyterBook → GitHub Pages
`metamodel-docs.yml`	main push (docs/)	MkDocs → GitHub Pages
`tutorials-offline.yml`	tutorials/ changes	Notebook execution tests

Features

18 features implemented across the system:

Neuro Schema Integration — 5 adapters (BIDS, NWB, DANDI, openMINDS, AIND)
Schema Backend — REST API with auth, versioning, unit standardization
Schema Explorer — Search, filter, element detail, relationship graph
Migration API — Async pathway execution with Celery
Schema Enrichment — Validation rules, MRO, provenance, soft-delete
Dual-Path Adapters — Code + file extraction modes
End-to-End Pipeline — Full ingestion + LinkML generation
Schema Import Roundtrip — JSON Schema + LinkML fidelity scoring
Tutorials — 7 interactive Jupyter notebooks
JupyterBook — Rendered tutorial documentation site
Metamodel Provenance — PROV-O JSON-LD, LinkML I/O, mapping accept
Full-Stack Compose — Single-command local development
Migration UI — Pathway browsing, job execution, schema diff
Deployment Pipeline — GHCR images, GitHub Pages, backend CI
undata-library — Standalone flat-file library with CLI
Value Concepts — Categorical values as content-addressed semantic entities
Backend–Library Alignment — Content-addressed element model in backend
Rich Data Model — reproschema alignment, semantic embeddings, ingest→enrich→align pipeline

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

undata

Why undata Exists

The Problem

The Solution

Architecture

Quick Start

Project Structure

Development

Backend

Frontend

Library

Tutorials

CI/CD

Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 401 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
.specify		.specify
backend		backend
docs		docs
frontend		frontend
library		library
scripts		scripts
specs		specs
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
VISION.md		VISION.md
docker-compose.yml		docker-compose.yml
eval-record.md		eval-record.md

Folders and files

Latest commit

History

Repository files navigation

undata

Why undata Exists

The Problem

The Solution

Architecture

Quick Start

Project Structure

Development

Backend

Frontend

Library

Tutorials

CI/CD

Features

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages