fglock · fglock · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/dev/bench/hash_wrap_repro/lib/Hash/Wrap.pm b/dev/bench/hash_wrap_repro/lib/Hash/Wrap.pm
diff --git a/dev/bench/results/baseline-078e0b3d7.json b/dev/bench/results/baseline-078e0b3d7.json
@@ -0,0 +1,21 @@
+{
+  "git_sha":      "078e0b3d7",
+  "date":         "2026-04-21T21:17:48Z",
+  "runs":         3,
+  "jperl":        "/Users/fglock/projects/PerlOnJava3/jperl",
+  "perl":         "perl",
+  "perl_version": "5.042000",
+  "benchmarks":   {
+    "benchmark_anon_simple": { "unit": "s", "jperl": [7.149,7.020,7.213], "perl": [1.435,1.454,1.427] },
+    "benchmark_closure": { "unit": "s", "jperl": [8.784,9.783,9.768], "perl": [8.108,7.961,7.877] },
+    "benchmark_eval_string": { "unit": "s", "jperl": [14.766,14.777,14.365], "perl": [3.135,3.164,3.276] },
+    "benchmark_global": { "unit": "s", "jperl": [14.608,14.579,14.720], "perl": [10.993,11.063,9.400] },
+    "benchmark_lexical": { "unit": "s", "jperl": [4.059,4.010,3.989], "perl": [10.589,10.581,10.441] },
+    "benchmark_method": { "unit": "s", "jperl": [2.620,2.537,2.607], "perl": [1.456,1.490,1.511] },
+    "benchmark_refcount_anon": { "unit": "s", "jperl": [1.792,1.807,1.776], "perl": [0.455,0.447,0.443] },
+    "benchmark_refcount_bless": { "unit": "s", "jperl": [1.293,1.305,1.311], "perl": [0.197,0.198,0.197] },
+    "benchmark_regex": { "unit": "s", "jperl": [2.732,2.719,2.701], "perl": [1.974,2.005,2.006] },
+    "benchmark_string": { "unit": "s", "jperl": [4.131,4.025,4.066], "perl": [6.887,6.867,6.977] },
+    "life_bitpacked": { "unit": "Mcells/s", "jperl": [8.21,8.12,8.28], "perl": [20.99,20.58,20.70] }
+  }
+}
diff --git a/dev/bench/results/baseline-078e0b3d7.md b/dev/bench/results/baseline-078e0b3d7.md
@@ -0,0 +1,23 @@
+# Benchmark baseline — 078e0b3d7
+
+**Date:** 2026-04-21T21:17:48Z
+**Runs per benchmark:** 3
+**jperl:** `/Users/fglock/projects/PerlOnJava3/jperl`
+**perl:** `perl` (5.042000)
+
+For "time" benches lower = faster; ratio is `jperl / perl`.
+For "Mcells/s" (life_bitpacked) higher = faster; ratio is `perl / jperl`.
+
+| Benchmark | unit | jperl | perl | ratio | parity? |
+|---|---|---:|---:|---:|:---:|
+| `benchmark_anon_simple` | s | 7.127 | 1.439 | **4.95×** | ❌ |
+| `benchmark_closure` | s | 9.445 | 7.982 | **1.18×** | ≈ |
+| `benchmark_eval_string` | s | 14.636 | 3.192 | **4.59×** | ❌ |
+| `benchmark_global` | s | 14.636 | 10.485 | **1.40×** | ❌ |
+| `benchmark_lexical` | s | 4.019 | 10.537 | **0.38×** | ✅ |
+| `benchmark_method` | s | 2.588 | 1.486 | **1.74×** | ❌ |
+| `benchmark_refcount_anon` | s | 1.792 | 0.448 | **4.00×** | ❌ |
+| `benchmark_refcount_bless` | s | 1.303 | 0.197 | **6.61×** | ❌ |
+| `benchmark_regex` | s | 2.717 | 1.995 | **1.36×** | ❌ |
+| `benchmark_string` | s | 4.074 | 6.910 | **0.59×** | ✅ |
+| `life_bitpacked` | Mcells/s | 8.203 | 20.757 | **2.53×** | ❌ |
diff --git a/dev/design/classic_experiment_finding.md b/dev/design/classic_experiment_finding.md
@@ -0,0 +1,102 @@
+# JPERL_CLASSIC experiment — cumulative-tax hypothesis confirmed
+
+**Branch:** `perf/perl-parity-phase1` @ 3c2ca4b6a + CLASSIC gate patches (4 files)
+**Date:** 2026-04-18
+**Hypothesis:** The master→branch regression (1.67× on life_bitpacked) is NOT attributable to any single hot method. It is the cumulative cost of many small taxes added by the refcount/walker/weaken/DESTROY machinery, each individually invisible in a profile.
+
+## Test
+
+Added `JPERL_CLASSIC` env var (read once at class-init into a `static final boolean`). When set, short-circuits the branch's added machinery to near-master behavior:
+
+| Site | CLASSIC behavior |
+|---|---|
+| `MortalList.active` | `false` — every `deferDecrement*` / `scopeExitCleanup{Hash,Array}` / `mortalizeForVoidDiscard` early-returns |
+| `EmitStatement.emitScopeExitNullStores` Phase 1 (`scopeExitCleanup` per scalar) | Not emitted |
+| `EmitStatement.emitScopeExitNullStores` Phase 1b (cleanupHash/Array) | Not emitted |
+| `EmitStatement.emitScopeExitNullStores` Phase E (`MyVarCleanupStack.unregister`) | Not emitted |
+| `EmitStatement.emitScopeExitNullStores` Phase 3 (`MortalList.flush`) | Not emitted |
+| `EmitVariable` `MyVarCleanupStack.register` on every `my` | Not emitted |
+| `MyVarCleanupStack.register` / `unregister` | Early-return |
+| `RuntimeScalar.scopeExitCleanup` | Early-return |
+| `RuntimeScalar.setLargeRefCounted` | Direct field assignment, skipping refcount/WeakRefRegistry/MortalList work |
+
+Correctness: CLASSIC breaks DESTROY, weaken, walker semantics — only useful for measurement, not shipping.
+
+## Result — life_bitpacked
+
+`./jperl examples/life_bitpacked.pl -r none -g 500`, 5 runs each, median:
+
+| Mode | Runs (Mcells/s) | Median |
+|---|---|---:|
+| Baseline (branch machinery on) | 8.58 / 8.51 / 8.49 / 8.51 / 8.45 | **8.51** |
+| `JPERL_CLASSIC=1` | 14.18 / 14.60 / 14.14 / 13.32 / 13.77 | **14.18** |
+| System perl (reference) | — | 20.8 – 21.5 |
+| Master @ pre-merge (reference) | — | 14.0 |
+
+**Speedup: 14.18 / 8.51 = 1.666×**, essentially recovering master's pre-merge number.
+
+## Result — benchmark_lexical (simple, no refs)
+
+`./jperl dev/bench/benchmark_lexical.pl`, 3 runs each:
+
+| Mode | Runs (iters/s) | Median |
+|---|---|---:|
+| Baseline | 313484 / 329270 / 314172 | **314172** |
+| `JPERL_CLASSIC=1` | 357144 / 347743 / 359080 | **357144** |
+
+**Speedup: 1.14×**
+
+Even on a workload with no references and no blesses, the `my`-variable register/unregister emissions and scope-exit cleanup emissions cost ~14%.
+
+## Interpretation
+
+The hypothesis is definitively confirmed:
+
+1. **The master→branch perf gap is recoverable in full** (1.67× on the most ref-heavy workload) by gating the added machinery.
+2. **No single site is the bottleneck.** Phase 1 (MortalList.flush) alone was worth 0.7%. Phase 2's pristine-args stub alone was worth 0%. The 1.67× comes from ~a dozen sites each contributing 2–10%.
+3. **The taxes are broadly distributed across the scope-exit / variable-declaration / reference-assignment paths.** Even workloads that never exercise DESTROY/weaken pay them.
+
+## Implication for the plan
+
+The piecewise Phase 2'/3'/4' approach was the wrong framing. The right structural fix:
+
+**Make the machinery per-object-opt-in, not always-on.** Perl 5's design: `SvREFCNT_inc` is free for most SVs because the type tag gates the work. Only objects that need refcount tracking pay the cost.
+
+Concrete proposal (call it Phase R — "refcount by need"):
+
+1. Add a single `needsCleanup` bit to `RuntimeBase`, default `false`.
+2. Set it to `true` only when:
+   - The object is blessed into a class that has `DESTROY`, OR
+   - The object is targeted by `Scalar::Util::weaken`, OR
+   - The object is captured by a CODE ref whose refCount we need to track for cycle break.
+3. Every CURRENT-BRANCH fast-path site becomes `if (!needsCleanup) return <classic behavior>;`:
+   - `setLargeRefCounted` → direct assignment if neither side needs cleanup
+   - `scopeExitCleanup` → no-op if scalar's value doesn't need cleanup
+   - `MyVarCleanupStack.register` → skip if the var's referent doesn't need cleanup
+   - `MortalList.deferDecrement*` → skip if referent doesn't need cleanup
+   - `scopeExitCleanupHash/Array` → skip if container has no needsCleanup descendants
+
+With per-object gating, life_bitpacked (zero blessed objects, zero weaken) pays zero tax and runs at ~14 Mc/s. DBIx::Class / txn_scope_guard / destroy_eval_die (objects that DO need cleanup) still work correctly.
+
+This is a **significant refactor** — every site listed above needs a cheap gate check. But:
+
+- The CLASSIC experiment has already implemented those gate checks (just globally rather than per-object). Most of the code is the early-return condition.
+- The JIT will fold the `needsCleanup == false` check away to almost nothing once it sees a type-stable call site.
+- Correctness is easier to reason about than the current "always-tracked" design, because the gate explicitly matches the semantic condition that requires tracking.
+
+## Files touched in this experiment
+
+```
+src/main/java/org/perlonjava/runtime/runtimetypes/MortalList.java       (+CLASSIC flag, active init)
+src/main/java/org/perlonjava/runtime/runtimetypes/MyVarCleanupStack.java (register/unregister early-return)
+src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java    (setLargeRefCounted + scopeExitCleanup early-return)
+src/main/java/org/perlonjava/backend/jvm/EmitStatement.java             (4 emission sites gated)
+src/main/java/org/perlonjava/backend/jvm/EmitVariable.java              (register emission gated)
+```
+
+## Next step
+
+Either:
+1. **Commit the CLASSIC gate** as a measurement tool on `perf/perl-parity-phase1` (doesn't ship to users; helps future perf work A/B the full-feature cost).
+2. **Move directly to Phase R** (per-object `needsCleanup` bit) based on this evidence, using the CLASSIC gate sites as the map of what needs per-object gating.
+3. **Revert** the CLASSIC gate and keep this document as the finding.
diff --git a/dev/design/hash_wrap_triage_plan.md b/dev/design/hash_wrap_triage_plan.md
@@ -0,0 +1,211 @@
+# Hash::Wrap `t/as_return.t` — GC-thrash / infinite-loop triage plan
+
+**Status**: Investigation in progress. PR #536 blocked until this class of failure is resolved.
+
+## Scope
+
+Hash::Wrap's `t/as_return.t` (45 lines) and DBIx::Class exhibit the same class of failure: extremely high CPU + memory, no apparent forward progress, wallclock >> real-Perl expectation. User-visible symptom is "stuck" or "timeout".
+
+This plan picks Hash::Wrap as the minimal reproducer (tight CPAN test, independent of DBIC fixtures).
+
+## Observations (2026-04-23)
+
+### Reproducer captured
+```
+/Users/fglock/projects/PerlOnJava3/dev/bench/hash_wrap_repro/
+  t/as_return.t        # 45 lines, copied from Hash-Wrap-1.09
+  lib/Hash/Wrap.pm     # upstream pure-Perl
+```
+
+Invoke:
+```bash
+cd dev/bench/hash_wrap_repro
+timeout 30 ../../../jperl -Ilib t/as_return.t
+```
+
+Baseline: at 15 s the main thread has used 13 s CPU (~89 % of one core — **not** GC-thrash on my machine). Reproduces at 11+ cores on the user's original machine — same code, different GC amplification due to machine/load. Correctness-level reproducer is the same.
+
+### First bug localised: `B::NULL::next` self-loop
+
+`jstack` on the stuck process shows the inner loop is:
+
+```
+java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:952)
+NameNormalizer.normalizeVariableName(NameNormalizer.java:144)
+InheritanceResolver.findMethodInHierarchy(InheritanceResolver.java:310)
+Universal.can(Universal.java:175)
+RuntimeCode.callCached(RuntimeCode.java:1780)
+anon1485.apply(Test2/Util/Sub.pm:577)      <-- $op->can('line') / $op->can('next')
+```
+
+Tracing upward: `Test2::Util::Sub::sub_info` walks the OP tree:
+
+```perl
+my $op = $cobj->START;
+while ($op) {
+    push @all_lines => $op->line if $op->can('line');
+    last unless $op->can('next');       # <- termination check
+    $op = $op->next;
+}
+```
+
+PerlOnJava's `src/main/perl/lib/B.pm` has:
+
+```perl
+package B::NULL {
+    our @ISA = ('B::OP');
+    sub new { bless {}, shift }
+    sub next {
+        # NULL is terminal -- return self to prevent infinite loops
+        return $_[0];
+    }
+}
+```
+
+**The comment is inverted.** Returning `$_[0]` keeps `$op` as the same B::NULL forever:
+
+* `$op->can('line')` → true (inherited from B::OP)
+* `$op->can('next')` → true (inherited from B::OP)
+* `$op = $op->next` → same B::NULL
+* Loop never exits, `@all_lines` grows unboundedly → GC pressure once array outgrows young gen → user sees the 13 GC threads + 25 % useful CPU.
+
+Hash::Wrap trips this because Test2's structural compare (`meta { prop ... object { call ... } }`) calls `sub_info` on every comparison callback — one infinite loop per check.
+
+DBIx::Class likely trips the same path (its test suite also uses Test2 deep compare, and DBIC itself uses Sub::Defer / B introspection heavily).
+
+### Fix for the immediate infinite loop
+
+Replace `B::NULL::next` with a sentinel that actually terminates the common walker patterns:
+
+```perl
+package B::NULL {
+    our @ISA = ('B::OP');
+    sub new { bless {}, shift }
+
+    # Every method call on B::NULL returns undef (matches real Perl XS).
+    # Crucially, `$null->next` returning undef terminates while($op) loops.
+    sub next { return; }
+    sub line { return; }
+    # `can('next')` still returns true via B::OP inheritance; the
+    # caller's `$op = $op->next` sets $op to undef and while($op) exits.
+}
+```
+
+Before landing: audit other B.pm sentinel methods (`sibling`, `targ`, `sibparent`, `first`, `last`, etc.) for the same mistake.
+
+## Why this is sufficient for Hash::Wrap but not the full class of problem
+
+The B::NULL fix makes `sub_info` terminate on first invocation. Once it's terminating:
+
+1. The test proceeds into the actual structural compare.
+2. Every `is($obj, meta { ... })` still allocates deep `Test2::Compare::Delta` trees.
+3. Each Delta node is a blessed hashref → traverses `RuntimeScalar.setLargeRefCounted`, `MortalList.deferDecrement*`, walker arming etc.
+4. This is the *real* distributed-tax problem we already confirmed in Phase R.
+
+With just the B::NULL fix, Hash::Wrap completes but still runs an order of magnitude slower than real Perl. That may be acceptable for the test-to-pass gate; it is not acceptable for "perf parity". The full plan below addresses both.
+
+## Plan
+
+Four phases. Each phase has an explicit measurement gate before moving to the next.
+
+### Phase 0 — Unblock the test (same-day)
+
+1. **Fix `B::NULL::next`** and audit other B.pm sentinels (see above).
+2. Run Hash::Wrap `t/as_return.t` and `DBIx-Class-0.082844-68/t/storage/base.t` to completion. Record wallclock, CPU ratio, allocation rate via JFR.
+3. Acceptance: both complete in finite time, produce TAP with actual pass/fail rather than timeouts. (Pass/fail counts themselves can still regress — that's Phase 1-3 territory.)
+4. Commit the fix on `perf/phase-r-needs-cleanup`.
+
+**Risk**: very low. Change is localised to the B.pm shim. Regression surface: code that relied on `$null->next == $null` for some iteration invariant. No known such code.
+
+### Phase 1 — Establish allocation baseline
+
+Goal: turn "slow under GC" from hand-wave into numbers.
+
+1. JFR run on Hash::Wrap `t/as_return.t`:
+   ```
+   JPERL_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=\
+     filename=dev/bench/results/jfr/hash_wrap.jfr,\
+     settings=profile,duration=60s" \
+     ./jperl -Ilib t/as_return.t
+   ```
+   Capture `jdk.ObjectAllocationSample` + `jdk.ObjectAllocationInNewTLAB` + `jdk.GCHeapSummary`.
+
+2. Same run with `JPERL_CLASSIC=1` for the upper bound.
+
+3. Top allocators (top 10 by bytes): expected candidates are `RuntimeScalar`, `RuntimeHash`, `RuntimeArray`, `MortalList$Entry`, Test2 Delta/Check/Meta classes (pure Perl packages compiled to our anon classes). Record exact numbers in `dev/design/hash_wrap_alloc_profile.md`.
+
+4. GC metric deltas: young-gen pause %, old-gen promotions/sec, total GC time as % of wallclock. If CLASSIC drops GC time from e.g. 60 % to 10 %, we know our machinery is the allocation driver; if GC stays high under CLASSIC, the allocation source is non-PerlOnJava (upstream Test2 / Hash::Wrap pattern itself).
+
+**Acceptance gate**: an allocation profile committed under `dev/bench/results/` that clearly identifies the top 3 allocation sites contributing >60 % of bytes.
+
+### Phase 2 — Reduce allocation at the top-3 sites
+
+This is concrete engineering work whose scope depends on Phase 1's findings. Candidate targets based on prior profiling work:
+
+| Candidate | Already known from | Expected impact |
+|---|---|---|
+| `RuntimeList.add` → `ArrayList.grow` from initial capacity 10 | `life_bitpacked_jfr_profile.md` | 5–14 % on life_bitpacked |
+| `MortalList.pending` growth (same `ArrayList.grow` pattern) | `classic_experiment_finding.md` (implicit) | varies with callsite density |
+| Per-`my` `MyVarCleanupStack.register` list add | Phase R measured | already captured in `1.49×` |
+| Intermediate `RuntimeScalar(integer)` boxing in comparison callbacks | `life_bitpacked_jfr_profile.md` (via `RuntimeScalarCache.getScalarInt`) | unknown for Test2 workload |
+
+For each chosen target:
+
+1. Minimal hack that short-circuits the allocation (even if broken) — upper-bound measurement.
+2. If upper bound ≥ 5 % wallclock improvement, implement cleanly.
+3. If < 5 %, document and move on (Phase 1 Lessons Learned rule).
+
+**Acceptance gate**: Hash::Wrap wallclock within 5 × real Perl and no test failures beyond pre-existing.
+
+### Phase 3 — Conditional machinery (the real Phase R)
+
+`JPERL_CLASSIC=1` proved that removing the machinery globally restores master-era performance. Making the machinery *conditional on need* gives us that speedup without sacrificing DESTROY/weaken correctness.
+
+Proposal restated here for a fresh reader:
+
+* One `public boolean needsCleanup` on `RuntimeBase`, default `false`.
+* Set to `true` on: `bless` into a class with `DESTROY`, `Scalar::Util::weaken`, closure-capture of a blessed referent (later — first cut only covers the first two).
+* Every CLASSIC-gated site becomes `if (!base.needsCleanup) return <classic fast path>;`:
+  - `RuntimeScalar.setLargeRefCounted`
+  - `RuntimeScalar.scopeExitCleanup`
+  - `MortalList.deferDecrementIfTracked` etc.
+  - `MortalList.scopeExitCleanupHash` / `scopeExitCleanupArray`
+  - `EmitVariable`: MyVarCleanupStack.register emission (still compile-time gated via `CleanupNeededVisitor`, that stays)
+
+Test2's `Compare::Delta` nodes are blessed but *don't* have DESTROY — so they land on the fast path. Hash::Wrap's `A1`/`A2` wrappers are blessed but don't have DESTROY — fast path. DBIC's `ResultSet`/`ResultSource` *do* have DESTROY (via `next::can` dispatch under the hood) — slow path, correct.
+
+**Scope**: ~30 gate sites mapped by the CLASSIC patch. Each call site gets a one-line guard. Core invariant change is on `RuntimeBase` — one new bit.
+
+**Acceptance gate** (the PR merge gate):
+
+| Measurement | Gate |
+|---|---|
+| Hash::Wrap `t/as_return.t` | passes in < 2 × real-Perl wallclock |
+| DBIC full suite `./jcpan -t DBIx::Class` | zero timeouts; same pass count as commit `99509c6a0` (13 804 / 13 804) |
+| `make test-bundled-modules` | still 176 / 176 |
+| `make` unit tests | no new regressions beyond pre-existing `destroy_eval_die.t#4` |
+| `life_bitpacked` | Phase R speedup preserved (≥ 1.3 × vs pre-merge baseline) |
+| `destroy_eval_die.t` | same pass count (9 / 10 on current branch) |
+| DBIx::Class `t/storage/txn_scope_guard.t` | 18 / 18 |
+
+**Risk**: Medium. Per-object bit is simple in principle; the hard part is ensuring every *entry* into the tracked-object set correctly flips the bit. Fortunately the CLASSIC patch already identifies the gates, so we have a map.
+
+### Phase 4 — Validation & documentation
+
+1. Run Phase 3 acceptance gate on a clean machine. Document wallclock/CPU/GC numbers for each benchmark in `dev/bench/results/`.
+2. Update `dev/design/perl_parity_plan.md` to reflect Phase R → Phase R+(refcount-by-need) progression.
+3. Merge PR #536 once all gates are green.
+4. File follow-up tickets for remaining ≤ 5 % per-site optimisations (none are in scope for the merge).
+
+## Sequence / dependencies
+
+```
+Phase 0 (immediate fix) ──┐
+                          ├─▶ Phase 1 (profile) ──▶ Phase 2 (alloc reductions) ──▶ Phase 3 (conditional machinery) ──▶ Phase 4 (validate + merge)
+```
+
+Phase 0 is the sole prerequisite to unblock `./jcpan -t DBIx::Class` from getting stuck in the infinite loop. Phases 2 and 3 are independent of each other — if Phase 2 alone gets us to the merge gate, Phase 3 can slip to a follow-up PR.
+
+## Immediate next step
+
+Apply the B::NULL fix, verify Hash::Wrap completes (doesn't need to *pass*, just complete), commit, rerun `./jcpan -t DBIx::Class` to see whether any tests that were previously timing out now progress to a proper result.