Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,627 changes: 1,627 additions & 0 deletions dev/bench/hash_wrap_repro/lib/Hash/Wrap.pm

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions dev/bench/results/baseline-078e0b3d7.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"git_sha": "078e0b3d7",
"date": "2026-04-21T21:17:48Z",
"runs": 3,
"jperl": "/Users/fglock/projects/PerlOnJava3/jperl",
"perl": "perl",
"perl_version": "5.042000",
"benchmarks": {
"benchmark_anon_simple": { "unit": "s", "jperl": [7.149,7.020,7.213], "perl": [1.435,1.454,1.427] },
"benchmark_closure": { "unit": "s", "jperl": [8.784,9.783,9.768], "perl": [8.108,7.961,7.877] },
"benchmark_eval_string": { "unit": "s", "jperl": [14.766,14.777,14.365], "perl": [3.135,3.164,3.276] },
"benchmark_global": { "unit": "s", "jperl": [14.608,14.579,14.720], "perl": [10.993,11.063,9.400] },
"benchmark_lexical": { "unit": "s", "jperl": [4.059,4.010,3.989], "perl": [10.589,10.581,10.441] },
"benchmark_method": { "unit": "s", "jperl": [2.620,2.537,2.607], "perl": [1.456,1.490,1.511] },
"benchmark_refcount_anon": { "unit": "s", "jperl": [1.792,1.807,1.776], "perl": [0.455,0.447,0.443] },
"benchmark_refcount_bless": { "unit": "s", "jperl": [1.293,1.305,1.311], "perl": [0.197,0.198,0.197] },
"benchmark_regex": { "unit": "s", "jperl": [2.732,2.719,2.701], "perl": [1.974,2.005,2.006] },
"benchmark_string": { "unit": "s", "jperl": [4.131,4.025,4.066], "perl": [6.887,6.867,6.977] },
"life_bitpacked": { "unit": "Mcells/s", "jperl": [8.21,8.12,8.28], "perl": [20.99,20.58,20.70] }
}
}
23 changes: 23 additions & 0 deletions dev/bench/results/baseline-078e0b3d7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Benchmark baseline — 078e0b3d7

**Date:** 2026-04-21T21:17:48Z
**Runs per benchmark:** 3
**jperl:** `/Users/fglock/projects/PerlOnJava3/jperl`
**perl:** `perl` (5.042000)

For "time" benches lower = faster; ratio is `jperl / perl`.
For "Mcells/s" (life_bitpacked) higher = faster; ratio is `perl / jperl`.

| Benchmark | unit | jperl | perl | ratio | parity? |
|---|---|---:|---:|---:|:---:|
| `benchmark_anon_simple` | s | 7.127 | 1.439 | **4.95×** | ❌ |
| `benchmark_closure` | s | 9.445 | 7.982 | **1.18×** | ≈ |
| `benchmark_eval_string` | s | 14.636 | 3.192 | **4.59×** | ❌ |
| `benchmark_global` | s | 14.636 | 10.485 | **1.40×** | ❌ |
| `benchmark_lexical` | s | 4.019 | 10.537 | **0.38×** | ✅ |
| `benchmark_method` | s | 2.588 | 1.486 | **1.74×** | ❌ |
| `benchmark_refcount_anon` | s | 1.792 | 0.448 | **4.00×** | ❌ |
| `benchmark_refcount_bless` | s | 1.303 | 0.197 | **6.61×** | ❌ |
| `benchmark_regex` | s | 2.717 | 1.995 | **1.36×** | ❌ |
| `benchmark_string` | s | 4.074 | 6.910 | **0.59×** | ✅ |
| `life_bitpacked` | Mcells/s | 8.203 | 20.757 | **2.53×** | ❌ |
102 changes: 102 additions & 0 deletions dev/design/classic_experiment_finding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# JPERL_CLASSIC experiment — cumulative-tax hypothesis confirmed

**Branch:** `perf/perl-parity-phase1` @ 3c2ca4b6a + CLASSIC gate patches (4 files)
**Date:** 2026-04-18
**Hypothesis:** The master→branch regression (1.67× on life_bitpacked) is NOT attributable to any single hot method. It is the cumulative cost of many small taxes added by the refcount/walker/weaken/DESTROY machinery, each individually invisible in a profile.

## Test

Added `JPERL_CLASSIC` env var (read once at class-init into a `static final boolean`). When set, short-circuits the branch's added machinery to near-master behavior:

| Site | CLASSIC behavior |
|---|---|
| `MortalList.active` | `false` — every `deferDecrement*` / `scopeExitCleanup{Hash,Array}` / `mortalizeForVoidDiscard` early-returns |
| `EmitStatement.emitScopeExitNullStores` Phase 1 (`scopeExitCleanup` per scalar) | Not emitted |
| `EmitStatement.emitScopeExitNullStores` Phase 1b (cleanupHash/Array) | Not emitted |
| `EmitStatement.emitScopeExitNullStores` Phase E (`MyVarCleanupStack.unregister`) | Not emitted |
| `EmitStatement.emitScopeExitNullStores` Phase 3 (`MortalList.flush`) | Not emitted |
| `EmitVariable` `MyVarCleanupStack.register` on every `my` | Not emitted |
| `MyVarCleanupStack.register` / `unregister` | Early-return |
| `RuntimeScalar.scopeExitCleanup` | Early-return |
| `RuntimeScalar.setLargeRefCounted` | Direct field assignment, skipping refcount/WeakRefRegistry/MortalList work |

Correctness: CLASSIC breaks DESTROY, weaken, walker semantics — only useful for measurement, not shipping.

## Result — life_bitpacked

`./jperl examples/life_bitpacked.pl -r none -g 500`, 5 runs each, median:

| Mode | Runs (Mcells/s) | Median |
|---|---|---:|
| Baseline (branch machinery on) | 8.58 / 8.51 / 8.49 / 8.51 / 8.45 | **8.51** |
| `JPERL_CLASSIC=1` | 14.18 / 14.60 / 14.14 / 13.32 / 13.77 | **14.18** |
| System perl (reference) | — | 20.8 – 21.5 |
| Master @ pre-merge (reference) | — | 14.0 |

**Speedup: 14.18 / 8.51 = 1.666×**, essentially recovering master's pre-merge number.

## Result — benchmark_lexical (simple, no refs)

`./jperl dev/bench/benchmark_lexical.pl`, 3 runs each:

| Mode | Runs (iters/s) | Median |
|---|---|---:|
| Baseline | 313484 / 329270 / 314172 | **314172** |
| `JPERL_CLASSIC=1` | 357144 / 347743 / 359080 | **357144** |

**Speedup: 1.14×**

Even on a workload with no references and no blesses, the `my`-variable register/unregister emissions and scope-exit cleanup emissions cost ~14%.

## Interpretation

The hypothesis is definitively confirmed:

1. **The master→branch perf gap is recoverable in full** (1.67× on the most ref-heavy workload) by gating the added machinery.
2. **No single site is the bottleneck.** Phase 1 (MortalList.flush) alone was worth 0.7%. Phase 2's pristine-args stub alone was worth 0%. The 1.67× comes from ~a dozen sites each contributing 2–10%.
3. **The taxes are broadly distributed across the scope-exit / variable-declaration / reference-assignment paths.** Even workloads that never exercise DESTROY/weaken pay them.

## Implication for the plan

The piecewise Phase 2'/3'/4' approach was the wrong framing. The right structural fix:

**Make the machinery per-object-opt-in, not always-on.** Perl 5's design: `SvREFCNT_inc` is free for most SVs because the type tag gates the work. Only objects that need refcount tracking pay the cost.

Concrete proposal (call it Phase R — "refcount by need"):

1. Add a single `needsCleanup` bit to `RuntimeBase`, default `false`.
2. Set it to `true` only when:
- The object is blessed into a class that has `DESTROY`, OR
- The object is targeted by `Scalar::Util::weaken`, OR
- The object is captured by a CODE ref whose refCount we need to track for cycle break.
3. Every CURRENT-BRANCH fast-path site becomes `if (!needsCleanup) return <classic behavior>;`:
- `setLargeRefCounted` → direct assignment if neither side needs cleanup
- `scopeExitCleanup` → no-op if scalar's value doesn't need cleanup
- `MyVarCleanupStack.register` → skip if the var's referent doesn't need cleanup
- `MortalList.deferDecrement*` → skip if referent doesn't need cleanup
- `scopeExitCleanupHash/Array` → skip if container has no needsCleanup descendants

With per-object gating, life_bitpacked (zero blessed objects, zero weaken) pays zero tax and runs at ~14 Mc/s. DBIx::Class / txn_scope_guard / destroy_eval_die (objects that DO need cleanup) still work correctly.

This is a **significant refactor** — every site listed above needs a cheap gate check. But:

- The CLASSIC experiment has already implemented those gate checks (just globally rather than per-object). Most of the code is the early-return condition.
- The JIT will fold the `needsCleanup == false` check away to almost nothing once it sees a type-stable call site.
- Correctness is easier to reason about than the current "always-tracked" design, because the gate explicitly matches the semantic condition that requires tracking.

## Files touched in this experiment

```
src/main/java/org/perlonjava/runtime/runtimetypes/MortalList.java (+CLASSIC flag, active init)
src/main/java/org/perlonjava/runtime/runtimetypes/MyVarCleanupStack.java (register/unregister early-return)
src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java (setLargeRefCounted + scopeExitCleanup early-return)
src/main/java/org/perlonjava/backend/jvm/EmitStatement.java (4 emission sites gated)
src/main/java/org/perlonjava/backend/jvm/EmitVariable.java (register emission gated)
```

## Next step

Either:
1. **Commit the CLASSIC gate** as a measurement tool on `perf/perl-parity-phase1` (doesn't ship to users; helps future perf work A/B the full-feature cost).
2. **Move directly to Phase R** (per-object `needsCleanup` bit) based on this evidence, using the CLASSIC gate sites as the map of what needs per-object gating.
3. **Revert** the CLASSIC gate and keep this document as the finding.
211 changes: 211 additions & 0 deletions dev/design/hash_wrap_triage_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Hash::Wrap `t/as_return.t` — GC-thrash / infinite-loop triage plan

**Status**: Investigation in progress. PR #536 blocked until this class of failure is resolved.

## Scope

Hash::Wrap's `t/as_return.t` (45 lines) and DBIx::Class exhibit the same class of failure: extremely high CPU + memory, no apparent forward progress, wallclock >> real-Perl expectation. User-visible symptom is "stuck" or "timeout".

This plan picks Hash::Wrap as the minimal reproducer (tight CPAN test, independent of DBIC fixtures).

## Observations (2026-04-23)

### Reproducer captured
```
/Users/fglock/projects/PerlOnJava3/dev/bench/hash_wrap_repro/
t/as_return.t # 45 lines, copied from Hash-Wrap-1.09
lib/Hash/Wrap.pm # upstream pure-Perl
```

Invoke:
```bash
cd dev/bench/hash_wrap_repro
timeout 30 ../../../jperl -Ilib t/as_return.t
```

Baseline: at 15 s the main thread has used 13 s CPU (~89 % of one core — **not** GC-thrash on my machine). Reproduces at 11+ cores on the user's original machine — same code, different GC amplification due to machine/load. Correctness-level reproducer is the same.

### First bug localised: `B::NULL::next` self-loop

`jstack` on the stuck process shows the inner loop is:

```
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:952)
NameNormalizer.normalizeVariableName(NameNormalizer.java:144)
InheritanceResolver.findMethodInHierarchy(InheritanceResolver.java:310)
Universal.can(Universal.java:175)
RuntimeCode.callCached(RuntimeCode.java:1780)
anon1485.apply(Test2/Util/Sub.pm:577) <-- $op->can('line') / $op->can('next')
```

Tracing upward: `Test2::Util::Sub::sub_info` walks the OP tree:

```perl
my $op = $cobj->START;
while ($op) {
push @all_lines => $op->line if $op->can('line');
last unless $op->can('next'); # <- termination check
$op = $op->next;
}
```

PerlOnJava's `src/main/perl/lib/B.pm` has:

```perl
package B::NULL {
our @ISA = ('B::OP');
sub new { bless {}, shift }
sub next {
# NULL is terminal -- return self to prevent infinite loops
return $_[0];
}
}
```

**The comment is inverted.** Returning `$_[0]` keeps `$op` as the same B::NULL forever:

* `$op->can('line')` → true (inherited from B::OP)
* `$op->can('next')` → true (inherited from B::OP)
* `$op = $op->next` → same B::NULL
* Loop never exits, `@all_lines` grows unboundedly → GC pressure once array outgrows young gen → user sees the 13 GC threads + 25 % useful CPU.

Hash::Wrap trips this because Test2's structural compare (`meta { prop ... object { call ... } }`) calls `sub_info` on every comparison callback — one infinite loop per check.

DBIx::Class likely trips the same path (its test suite also uses Test2 deep compare, and DBIC itself uses Sub::Defer / B introspection heavily).

### Fix for the immediate infinite loop

Replace `B::NULL::next` with a sentinel that actually terminates the common walker patterns:

```perl
package B::NULL {
our @ISA = ('B::OP');
sub new { bless {}, shift }

# Every method call on B::NULL returns undef (matches real Perl XS).
# Crucially, `$null->next` returning undef terminates while($op) loops.
sub next { return; }
sub line { return; }
# `can('next')` still returns true via B::OP inheritance; the
# caller's `$op = $op->next` sets $op to undef and while($op) exits.
}
```

Before landing: audit other B.pm sentinel methods (`sibling`, `targ`, `sibparent`, `first`, `last`, etc.) for the same mistake.

## Why this is sufficient for Hash::Wrap but not the full class of problem

The B::NULL fix makes `sub_info` terminate on first invocation. Once it's terminating:

1. The test proceeds into the actual structural compare.
2. Every `is($obj, meta { ... })` still allocates deep `Test2::Compare::Delta` trees.
3. Each Delta node is a blessed hashref → traverses `RuntimeScalar.setLargeRefCounted`, `MortalList.deferDecrement*`, walker arming etc.
4. This is the *real* distributed-tax problem we already confirmed in Phase R.

With just the B::NULL fix, Hash::Wrap completes but still runs an order of magnitude slower than real Perl. That may be acceptable for the test-to-pass gate; it is not acceptable for "perf parity". The full plan below addresses both.

## Plan

Four phases. Each phase has an explicit measurement gate before moving to the next.

### Phase 0 — Unblock the test (same-day)

1. **Fix `B::NULL::next`** and audit other B.pm sentinels (see above).
2. Run Hash::Wrap `t/as_return.t` and `DBIx-Class-0.082844-68/t/storage/base.t` to completion. Record wallclock, CPU ratio, allocation rate via JFR.
3. Acceptance: both complete in finite time, produce TAP with actual pass/fail rather than timeouts. (Pass/fail counts themselves can still regress — that's Phase 1-3 territory.)
4. Commit the fix on `perf/phase-r-needs-cleanup`.

**Risk**: very low. Change is localised to the B.pm shim. Regression surface: code that relied on `$null->next == $null` for some iteration invariant. No known such code.

### Phase 1 — Establish allocation baseline

Goal: turn "slow under GC" from hand-wave into numbers.

1. JFR run on Hash::Wrap `t/as_return.t`:
```
JPERL_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=\
filename=dev/bench/results/jfr/hash_wrap.jfr,\
settings=profile,duration=60s" \
./jperl -Ilib t/as_return.t
```
Capture `jdk.ObjectAllocationSample` + `jdk.ObjectAllocationInNewTLAB` + `jdk.GCHeapSummary`.

2. Same run with `JPERL_CLASSIC=1` for the upper bound.

3. Top allocators (top 10 by bytes): expected candidates are `RuntimeScalar`, `RuntimeHash`, `RuntimeArray`, `MortalList$Entry`, Test2 Delta/Check/Meta classes (pure Perl packages compiled to our anon classes). Record exact numbers in `dev/design/hash_wrap_alloc_profile.md`.

4. GC metric deltas: young-gen pause %, old-gen promotions/sec, total GC time as % of wallclock. If CLASSIC drops GC time from e.g. 60 % to 10 %, we know our machinery is the allocation driver; if GC stays high under CLASSIC, the allocation source is non-PerlOnJava (upstream Test2 / Hash::Wrap pattern itself).

**Acceptance gate**: an allocation profile committed under `dev/bench/results/` that clearly identifies the top 3 allocation sites contributing >60 % of bytes.

### Phase 2 — Reduce allocation at the top-3 sites

This is concrete engineering work whose scope depends on Phase 1's findings. Candidate targets based on prior profiling work:

| Candidate | Already known from | Expected impact |
|---|---|---|
| `RuntimeList.add` → `ArrayList.grow` from initial capacity 10 | `life_bitpacked_jfr_profile.md` | 5–14 % on life_bitpacked |
| `MortalList.pending` growth (same `ArrayList.grow` pattern) | `classic_experiment_finding.md` (implicit) | varies with callsite density |
| Per-`my` `MyVarCleanupStack.register` list add | Phase R measured | already captured in `1.49×` |
| Intermediate `RuntimeScalar(integer)` boxing in comparison callbacks | `life_bitpacked_jfr_profile.md` (via `RuntimeScalarCache.getScalarInt`) | unknown for Test2 workload |

For each chosen target:

1. Minimal hack that short-circuits the allocation (even if broken) — upper-bound measurement.
2. If upper bound ≥ 5 % wallclock improvement, implement cleanly.
3. If < 5 %, document and move on (Phase 1 Lessons Learned rule).

**Acceptance gate**: Hash::Wrap wallclock within 5 × real Perl and no test failures beyond pre-existing.

### Phase 3 — Conditional machinery (the real Phase R)

`JPERL_CLASSIC=1` proved that removing the machinery globally restores master-era performance. Making the machinery *conditional on need* gives us that speedup without sacrificing DESTROY/weaken correctness.

Proposal restated here for a fresh reader:

* One `public boolean needsCleanup` on `RuntimeBase`, default `false`.
* Set to `true` on: `bless` into a class with `DESTROY`, `Scalar::Util::weaken`, closure-capture of a blessed referent (later — first cut only covers the first two).
* Every CLASSIC-gated site becomes `if (!base.needsCleanup) return <classic fast path>;`:
- `RuntimeScalar.setLargeRefCounted`
- `RuntimeScalar.scopeExitCleanup`
- `MortalList.deferDecrementIfTracked` etc.
- `MortalList.scopeExitCleanupHash` / `scopeExitCleanupArray`
- `EmitVariable`: MyVarCleanupStack.register emission (still compile-time gated via `CleanupNeededVisitor`, that stays)

Test2's `Compare::Delta` nodes are blessed but *don't* have DESTROY — so they land on the fast path. Hash::Wrap's `A1`/`A2` wrappers are blessed but don't have DESTROY — fast path. DBIC's `ResultSet`/`ResultSource` *do* have DESTROY (via `next::can` dispatch under the hood) — slow path, correct.

**Scope**: ~30 gate sites mapped by the CLASSIC patch. Each call site gets a one-line guard. Core invariant change is on `RuntimeBase` — one new bit.

**Acceptance gate** (the PR merge gate):

| Measurement | Gate |
|---|---|
| Hash::Wrap `t/as_return.t` | passes in < 2 × real-Perl wallclock |
| DBIC full suite `./jcpan -t DBIx::Class` | zero timeouts; same pass count as commit `99509c6a0` (13 804 / 13 804) |
| `make test-bundled-modules` | still 176 / 176 |
| `make` unit tests | no new regressions beyond pre-existing `destroy_eval_die.t#4` |
| `life_bitpacked` | Phase R speedup preserved (≥ 1.3 × vs pre-merge baseline) |
| `destroy_eval_die.t` | same pass count (9 / 10 on current branch) |
| DBIx::Class `t/storage/txn_scope_guard.t` | 18 / 18 |

**Risk**: Medium. Per-object bit is simple in principle; the hard part is ensuring every *entry* into the tracked-object set correctly flips the bit. Fortunately the CLASSIC patch already identifies the gates, so we have a map.

### Phase 4 — Validation & documentation

1. Run Phase 3 acceptance gate on a clean machine. Document wallclock/CPU/GC numbers for each benchmark in `dev/bench/results/`.
2. Update `dev/design/perl_parity_plan.md` to reflect Phase R → Phase R+(refcount-by-need) progression.
3. Merge PR #536 once all gates are green.
4. File follow-up tickets for remaining ≤ 5 % per-site optimisations (none are in scope for the merge).

## Sequence / dependencies

```
Phase 0 (immediate fix) ──┐
├─▶ Phase 1 (profile) ──▶ Phase 2 (alloc reductions) ──▶ Phase 3 (conditional machinery) ──▶ Phase 4 (validate + merge)
```

Phase 0 is the sole prerequisite to unblock `./jcpan -t DBIx::Class` from getting stuck in the infinite loop. Phases 2 and 3 are independent of each other — if Phase 2 alone gets us to the merge gate, Phase 3 can slip to a follow-up PR.

## Immediate next step

Apply the B::NULL fix, verify Hash::Wrap completes (doesn't need to *pass*, just complete), commit, rerun `./jcpan -t DBIx::Class` to see whether any tests that were previously timing out now progress to a proper result.
Loading