Skip to content

feat(grammar): harvest 10 novel physical_base tokens from imas-codex rotations#16

Merged
Simon-McIntosh merged 14 commits intoiterorganization:mainfrom
Simon-McIntosh:w31-harvest-novel-bases
Apr 28, 2026
Merged

feat(grammar): harvest 10 novel physical_base tokens from imas-codex rotations#16
Simon-McIntosh merged 14 commits intoiterorganization:mainfrom
Simon-McIntosh:w31-harvest-novel-bases

Conversation

@Simon-McIntosh
Copy link
Copy Markdown
Collaborator

Summary

Harvest of 10 novel physical_base tokens from imas-codex W30 rotation evidence.

Context

imas-codex W29 (commits 5f741fc4, aa16a350) made physical_base truly open
in the LLM compose prompt with auto-VocabGap detection running post-LLM. W30B
(edge_plasma_physics rotation) surfaced these candidates organically — the LLM
proposed them without prompting because they fit the underlying physics.

Methodology

  1. Mining: Queried all 830 composed StandardName nodes in the imas-codex
    Neo4j graph, parsed each with imas_standard_names.grammar.parse_standard_name(),
    and collected physical_base tokens not present in the registered vocabulary.

  2. Raw yield: 526 novel base tokens detected (445 genuine + 77 of_ parser
    artifacts + 13 parse failures).

  3. Vetting: Each candidate classified as STRONG (clean physics term, fills
    clear gap), BORDERLINE (needs editorial discussion), or NOISE (LLM artifact,
    synonym of existing token, or grammar misparse). Vetting criteria:

    • Is it a genuine, standalone physics quantity noun?
    • Does it fill a gap not covered by existing tokens (including via
      operator+base decomposition)?
    • Is it used in established plasma physics literature?
    • Does it avoid being a synonym of an existing registered base?
  4. Selection: 10 STRONG candidates included in this PR. ~10 BORDERLINE
    candidates documented below for editorial review.

Tokens Added (10)

Token Kind Used by Domains Notes
anomalous_current_density vector 3 names edge_plasma Current density from anomalous/turbulent transport
covariant_metric_tensor tensor 1 name MHD Lower-index metric g_ij; counterpart to existing contravariant_metric_tensor
diamagnetic_energy scalar 1 name transport Plasma stored energy from diamagnetic measurement W_dia
distribution_function scalar 1 name edge_plasma Kinetic distribution function f(x,v); distinct from existing distribution
eigenmode_frequency scalar 1 name gyrokinetics Oscillation frequency of a plasma eigenmode
eigenmode_growth_rate scalar 1 name gyrokinetics Linear growth rate of an unstable eigenmode
ionization_potential scalar 1 name edge_plasma Ionization energy of an atomic species
logarithmic_density_gradient scalar 1 name gyrokinetics d ln(n)/dx, standard gyrokinetic drive parameter (R/Ln)
logarithmic_temperature_gradient scalar 1 name gyrokinetics d ln(T)/dx, standard gyrokinetic drive parameter (R/LT)
pressure_gradient scalar 1 name gyrokinetics/MHD Spatial derivative of pressure ∇p; distinct from pressure_gradient_alpha_parameter

Borderline Candidates (NOT included in this PR)

Token Count Reason
particle_source_density 5 Likely synonym of existing particle_number_density_source
momentum_source_density 5 Possibly synonym of existing momentum_source (per-volume implied in transport context)
energy_source_density 4 Likely covered by existing energy_source / volumetric_energy_source
conducted_power 1 May decompose as process qualifier + generic power
convected_power 1 May decompose as process qualifier + generic power
wall_temperature 1 May decompose as locus wall + generic temperature
charge_state 1 dominant_charge_state and minimum_charge_state exist; bare form may be too generic
plasma_internal_inductance 1 internal_inductance exists; may be redundant with subject+base
electromagnetic_force_density 1 May decompose as modifier + existing force
passive_conductor_resistivity 1 May decompose as device qualifier + resistivity

Noise Tokens (excluded)

  • of_* prefix artifacts (77 tokens): parser residue from operator decomposition
  • electric_field_amplitude / electric_field_phase: should be operator+base constructions
  • peak_power_flux: should decompose as maximum operator + power_flux_density
  • Various single-use compound tokens from structural/engineering domains

Test Results

  • Before: 1079 passed, 1 skipped, 68 xfailed
  • After: 1079 passed, 1 skipped, 68 xfailed (identical — vocab additions only)

Cross-repo References

…ase tests

Upgrades the release state-machine CLI to mirror imas-codex's release
command shape:

- --version: explicit version override, bypasses bump computation
- --skip-git: skip git tag creation and push (useful for testing)
- Dirty worktree policy: RC releases warn only, final releases abort
- _check_clean_tree gains strict parameter for RC vs final semantics

Adds comprehensive test suite (36 tests) covering:
- Version parsing, formatting, and bump logic
- State machine transitions: stable→RC (patch/minor/major), RC→RC,
  RC→final, direct release, RC abandon with bump
- Rejection cases: final-from-stable, stable-no-bump, duplicate tag
- CLI integration: dry-run, skip-git, explicit version, status display,
  message required, remote defaults/overrides, dirty worktree policy,
  end-to-end tag creation
…n 40)

Migrate catalog storage from one-file-per-name nested in physics-domain
directories to one-file-per-physics-domain YAML sequences.

- Loader rejects legacy nested per-file layout with CatalogMigrationError
  (permissive mode preserves single-dict compatibility for tooling).
- ArgumentRef Pydantic model and error_variants field on entry schema.
- Topological ordering includes arguments[].name edges so derived entries
  follow their operands within a domain file.
- Integrity tables track per-entry hash (blake2b of canonical entry YAML)
  so additions, deletions, and modifications are still detectable when
  multiple entries share a file.
- Migrate in-repo example fixtures to new per-domain layout.
- Update test helpers and round-trip tests for semantic equivalence.

Part of plan 40 implementation.
…ades (plan 41)

- graph/local_graph.py: DiGraph builder over per-domain YAML with 5 edge types
  (HAS_ARGUMENT, HAS_ERROR, HAS_PREDECESSOR, HAS_SUCCESSOR, REFERENCES);
  stub nodes for forward refs; ordering-parent/child closure helpers for
  ancestors/descendants traversal.
- tools/graph.py: 4 FastMCP tools — get_standard_name_neighbours,
  get_standard_name_ancestors, get_standard_name_descendants,
  shortest_standard_name_path. Registered as optional read-only tools
  (gated on networkx availability).
- rendering/catalog.py: Mermaid hierarchy blocks, resolved links,
  cocos_transformation_type emission, per-entry sibling nav
  (Arguments/Wrapped by/Error variants/Deprecates/Superseded by).
- mkdocs.yml: mermaid2 plugin.
- pyproject.toml: [graph-local] networkx extra; mkdocs-mermaid2-plugin in
  docs group.
- AGENTS.md: local-graph module + MCP tools section with edge-convention
  table and HAS_ERROR direction note.
- tests: 27 new tests covering graph build, traversal, MCP tools, and
  renderer output; readonly-server allowlist extended with 4 new tools.
Additions harvested from cross-domain standard-name cycling:

Processes:
- e_cross_b_drift
- heat_viscosity
- ohmic_induction

Subjects (particle classification + polarization):
- trapped, passing, counter_passing, co_current, counter_current
- inertial, sonic
- left_hand_circularly_polarized, right_hand_circularly_polarized
- Add .github/PULL_REQUEST_TEMPLATE.md with required evidence section
  for vocabulary token PRs (N >= 3 distinct DD paths)
- Add Vocabulary Token Policy section to CONTRIBUTING.md with N >= 3
  evidence gate, deprecation rules, and structural exceptions
- Add docs/vocab-retrospective-rc21-rc26.md auditing all 15 tokens
  added between rc21 and rc26 (all pass N >= 3, verdict: keep all)
Add 5 tokens from the imas-codex electromagnetic_wave_diagnostics tier-a
pilot that passed the N>=3 evidence gate (32 VocabGap nodes harvested,
5 eligible, 27 deferred pending Tier B coverage).

Additions:
- physical_bases.yml: diagnostic_latency (N=4), sweep_duration (N=3),
  x1_width (N=3)
- geometry_carriers.yml: x1_coordinate (N=3), x2_coordinate (N=3)

Deferred tokens (N<3) are documented in docs/vocab-retrospective.md.
Grammar model_types.py regenerated via build-grammar.
feat: vocab-evidence-gate + rc21-rc26 retrospective
…losure

Add 7 physical_base tokens to close vocabulary gaps identified in W22B
review score analysis:

Class 2 — Physics compound nouns:
  - bootstrap_current_density: j_bootstrap (core_profiles, N=5 IDSs)
  - rotation_frequency: rotation_frequency_tor (core_profiles, N=8+)
  - mach_number: mach_number_parallel (langmuir_probes, N=3)
  - resistivity: wall/*/resistivity (wall, cryostat, N=6)

Class 2 — Viscosity current density compounds:
  - heat_viscosity_current_density: j_heat_viscosity (edge/plasma_profiles, N=4)
  - parallel_viscosity_current_density: j_parallel_viscosity (N=4)
  - perpendicular_viscosity_current_density: j_perpendicular_viscosity (N=4)

All tokens meet the N>=3 evidence gate. Sonic rotation frequency is
not added as a separate base because subject=sonic + base=rotation_frequency
correctly composes to sonic_rotation_frequency via existing grammar.
feat(vocab): W23A evidence-gated physical bases for grammar gap closure
Compose-level NC-32 patch in imas-codex (47ed76eb) prevented new
_on_ggd compositions, but the vocab attach pipeline kept resurfacing
pre-existing registry entries with this suffix. W25D + W26B evidence:
13 attach-through names scoring 0.5-0.6 dragged MHD domain mean to
0.645 (YELLOW boundary).

Standard names should be coordinate-system agnostic. The DD ggd/*
path subtree encodes coordinate metadata at the schema level, not the
physics name level.

This PR retires the _on_ggd suffix family by:
- Deleting 4 physical bases with canonical non-GGD twins:
  energy_radial_diffusivity_on_ggd (twin: energy_radial_diffusivity)
  momentum_diffusivity_on_ggd (twin: momentum_diffusivity)
  momentum_radial_diffusivity_on_ggd (twin: momentum_radial_diffusivity)
  particle_radial_diffusivity_on_ggd (twin: particle_radial_diffusivity)
- Deleting on_ggd unary postfix operator from operators.yml
- Regenerating model_types.py and constants.py (ON_GGD enum removed)
- Updating test expectations (2 parametrized entries removed)

All 1106 tests pass. Zero test count delta (removed 2 parametrized
entries, no new tests needed since canonical twins remain).

Cross-references:
- imas-codex commit 47ed76eb (NC-32 compose-level prohibition)
- W26B verdict report (this PR motivation)
- W25D MHD domain rotation evidence
…codex W30 rotations

Adds the following physical_base tokens, all proposed independently by the
imas-codex auto-VocabGap detection mechanism during edge_plasma_physics
rotation (W30B). Each is a clean, well-established physics term that filled
an evident gap in the registry.

  - anomalous_current_density (vector): current density from anomalous/turbulent
    transport; used by 3 names in edge_plasma_physics
  - covariant_metric_tensor (tensor): lower-index metric g_ij, counterpart to
    existing contravariant_metric_tensor; used by 1 name in magnetohydrodynamics
  - diamagnetic_energy (scalar): plasma stored energy from diamagnetic measurement;
    used by 1 name in transport
  - distribution_function (scalar): kinetic distribution function f(x,v) in
    phase space; used by 1 name in edge_plasma_physics
  - eigenmode_frequency (scalar): oscillation frequency of a plasma eigenmode;
    used by 1 name in gyrokinetics
  - eigenmode_growth_rate (scalar): linear growth rate of an unstable eigenmode;
    used by 1 name in gyrokinetics
  - ionization_potential (scalar): ionization energy of an atomic species;
    used by 1 name in edge_plasma_physics
  - logarithmic_density_gradient (scalar): d ln(n)/dx, standard gyrokinetic
    drive parameter; used by 1 name in gyrokinetics
  - logarithmic_temperature_gradient (scalar): d ln(T)/dx, standard gyrokinetic
    drive parameter; used by 1 name in gyrokinetics
  - pressure_gradient (scalar): spatial derivative of pressure, distinct from
    pressure_gradient_alpha_parameter; used by 1 name in gyrokinetics/MHD

Source: imas-codex W29 commit aa16a350 added auto-VocabGap detection;
W30B rotation surfaced these proposals via parse-the-name post-processing.

Verification:
  - Vetted against existing physical_bases.yml — no duplicates
  - All 1079 ISN tests pass (unchanged count)
  - Reviewer-suggested usage in tracked imas-codex StandardName nodes
@imbeauf
Copy link
Copy Markdown

imbeauf commented Apr 28, 2026

I guess this is not the final thing, but I am still giving a few comments to the "Tokens added".
You may consider changing the order of the words in some of the proposals, to have the primary physics quantity first, then have qualifiers as suffixes. Although it's less natural for English language, it highlights the physics quantity represented and makes it easier to search a specific quantity in a list. For instance:

  • anomalous_current_density --> current_density_anomalous
  • logarithmic_density_gradient --> density_gradient_logarithmic

@imbeauf
Copy link
Copy Markdown

imbeauf commented Apr 28, 2026

What means the "kind" here ? I thought the Standard Names didn't contain information about the number of dimensions of a quantity.

@imbeauf
Copy link
Copy Markdown

imbeauf commented Apr 28, 2026

The "Notes" are already quite informative, but I guess they are not the final definition of the Standard Names.
In particular, it doesn't contain units.

@Simon-McIntosh
Copy link
Copy Markdown
Collaborator Author

Thanks Frederic, these comments are useful. The development that you are seeing here relate to the development of our SN vocab (the base name in particular). The generation pipeline is coming on nicely and I should have a set of prototype names with all of their metadata shortly. You raise a good point regarding word order and I have grappled with this same issue myself. I have made the decision to go with the prefix version for now. What you see above are examples of base names onto which other grammar elements are appended to construct our full names. It will make more sense when you see actual catalog examples. The ordering issue is already addressed. As we store these names in a graph we can display them relative to their connections, for example with siblings shown adjacent, parents close etc. We do not need to rely on alphabetical sorting. The names will all include links so navigation between them should be simple.

@Simon-McIntosh Simon-McIntosh merged commit 274bbd5 into iterorganization:main Apr 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants