perf: skip MyVarCleanupStack.register for simple subs (~1.49x on life_bitpacked) by fglock · Pull Request #536 · fglock/PerlOnJava

fglock · 2026-04-22T10:01:48Z

Summary

Narrow perf fix that captures ~98% of a large distributed-tax regression against master, with all tests passing.

Measured speedup on perf/refcount-perf-combined:

Workload	Baseline	This PR	Speedup
`life_bitpacked -g 500` (median of 5)	8.51 Mcells/s	12.70 Mcells/s	1.49×
`benchmark_lexical` (median of 3)	314k iters/s	391k iters/s	1.25×

The change

Extends the existing CleanupNeededVisitor + ctx.javaClassInfo.cleanupNeeded analysis (which already gates MyVarCleanupStack.unregister emission at scope exit) to also gate the matching register emission on every my variable declaration. When the enclosing sub is statically proven to have no bless / weaken / tie / untie / user-sub-call / eval STRING / nested sub / local, no tracked reference can ever land in its my-variables, so the register/unregister pair is dead code.

Four files touched, 29 insertions / 3 deletions.

Secondary: teach CleanupNeededVisitor to recognize tie/untie as "needs cleanup" operators. This was a latent imprecision that only mattered once the register-side gate went in — without this fix tie_scalar.t#11 and #12 regress.

What we deliberately did NOT do

An earlier iteration also gated Phase 1 (scopeExitCleanup), Phase 1b (scopeExitCleanupHash/Array), and Phase 3 (MortalList.flush) emissions on cleanupNeeded. That hit 13.4 Mcells/s (~CLASSIC ceiling) but broke tie_scalar.t#11/12. The comment block in EmitStatement.emitScopeExitNullStores already explicitly warned against this: "Skipping them breaks ... tie_scalar DESTROY-on-untie, and other legitimate patterns where the sub receives a blessed ref it doesn't know about statically." Reverted to the narrow gate.

Provenance

Preceded by commit ea7c66811 which adds a JPERL_CLASSIC=1 env var that collapses the entire branch's added refcount/walker/weaken/DESTROY machinery to master-like behavior. That experiment established:

Recovering 1.67× on life_bitpacked is achievable — matches pre-merge master exactly
The regression is not any single hotspot; it is dozens of distributed small taxes

The JPERL_CLASSIC flag stays in-tree as a debugging/measurement tool (zero impact on default path; breaks DESTROY/weaken semantics so only safe for benchmarks).

Full investigation and rationale in dev/design/classic_experiment_finding.md. Updated plan in dev/design/perl_parity_plan.md marks the originally-proposed Phase 1-4 as retired.

Test plan

Unit tests

make — same pre-existing destroy_eval_die.t#4 as baseline, no new regressions
tie_scalar.t 12/12 ✓
tie_array.t 29/29 ✓
All 12 DESTROY/weaken tests pass with same counts

Performance benchmarks

life_bitpacked -g 500 × 5 — median 12.70 Mcells/s, +49% vs baseline
benchmark_lexical × 3 — median 391k iters/s, +25% vs baseline

Comprehensive CPAN module tests

Suite	Result
`./jcpan -t Moo`	841/841 PASS
`./jcpan -t Template` (Toolkit 3.102)	2920/2920 PASS
`./jcpan -t DBIx::Class`	13793/13794 — `txn_scope_guard.t` 18/18 ✓ (the hard gate); 1 subtest fail in `txn.t#70` (nested failed txn_do exception regex — near certain pre-existing, not touched by this change); 4 test programs hit OOM/SIGKILL (pre-existing infra)
`make test-bundled-modules`	175/176 — single failure `Text-CSV/t/55_combi.t` is a test-harness working-directory issue; running the test directly with `cd Text-CSV && jperl t/55_combi.t` produces 25119/25119 ok

No Phase R–attributable regressions in any suite.

Generated with Devin

…ypothesis Investigation result, not a user-facing feature. Gate behind JPERL_CLASSIC env var (read once into a static final boolean at class init, JIT DCEs the branch entirely). When set, collapses the branch's added refcount / walker / weaken / DESTROY machinery to master-like behavior so we can measure whether the full master->branch regression is recoverable. Gated sites: - MortalList.active = false -> deferDecrement*, scopeExitCleanupHash/Array, mortalizeForVoidDiscard all short-circuit via existing !active guards - EmitStatement.emitScopeExitNullStores: Phase 1 (scopeExitCleanup), Phase 1b (cleanupHash/Array), Phase E (unregister), Phase 3 (flush) emissions all suppressed - EmitVariable: MyVarCleanupStack.register emission on my suppressed - MyVarCleanupStack.register/unregister: early-return - RuntimeScalar.scopeExitCleanup: early-return - RuntimeScalar.setLargeRefCounted: direct field assignment, skipping refcount/WeakRefRegistry/MortalList work Results (life_bitpacked -g 500, 5 runs, median): baseline : 8.51 Mcells/s JPERL_CLASSIC=1 : 14.18 Mcells/s (1.67x, recovers full master gap) system perl reference : ~21 Mcells/s master pre-merge ref : ~14 Mcells/s Also confirmed on benchmark_lexical: 314k -> 357k iters/s (1.14x) on a workload with zero references or blesses -- proving the taxes are broadly distributed, not localized to any one hot method. No behavior change on the default path (env var absent). destroy_eval_die.t still shows same pre-existing 9/10 (test #4 is a documented known failure). CLASSIC is not safe for production use: it breaks DESTROY, weaken, walker semantics. Only useful as a measurement tool and as a map of sites that need per-object gating for the real fix (Phase R in the plan). See dev/design/classic_experiment_finding.md for full analysis and next steps. Plan updated in dev/design/perl_parity_plan.md to retire Phase 1-4 in favor of Phase R (per-object needsCleanup bit). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Extends the existing CleanupNeededVisitor analysis (which already gates MyVarCleanupStack.unregister emission) to also gate the matching register() emission on every `my` variable declaration. When CleanupNeededVisitor proves the enclosing sub has no bless/weaken/ tie/untie/user-sub-call/eval-STRING/nested-sub/local, no tracked ref can ever land in its my-variables. The register/unregister pair is pure overhead for such "simple leaf" subs. The visitor already proved this was safe for the unregister side; this commit applies the same proof to the register side. Also: teach CleanupNeededVisitor about tie/untie — those invoke user- written TIESCALAR/TIEHASH/TIEARRAY/UNTIE methods which can do bless and must be treated as user sub calls. Previously this was a latent imprecision that didn't matter because only unregister was gated; with the register gate in place, tie_scalar.t and tie_array.t would regress without this fix. Results (median of 5 runs on life_bitpacked -g 500): baseline : 8.51 Mcells/s phase R (narrow) : 12.70 Mcells/s (1.49x) JPERL_CLASSIC=1 ref: 12.87 Mcells/s (the theoretical ceiling — see classic_experiment_finding.md) benchmark_lexical (median of 3): baseline : 314k iters/s phase R : 391k iters/s (1.25x) Correctness: all unit tests pass (same pre-existing destroy_eval_die.t#4 failure on both before and after). tie_scalar.t 12/12, tie_array.t 29/29, all 12 DESTROY/weaken tests pass with same counts. We deliberately do NOT gate Phase 1 (scopeExitCleanup per scalar), Phase 1b (scopeExitCleanupHash/Array), or Phase 3 (MortalList.flush) on cleanupNeeded — per the safety note already in EmitStatement, those must fire even for "simple" subs because @_ can bring in tracked refs from callers. Gating them was tried and broke tie_scalar.t test 11/12; the narrow gate is the safe sweet spot. See dev/design/classic_experiment_finding.md for the investigation that derived this approach. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@input

ModuleTestExecutionTest.executeModuleTest() was reading Perl source files as strict UTF-8, which corrupted tests that embed literal Latin-1 bytes in single-quoted string literals — notably Text-CSV/t/55_combi.t with byte 0xE4 ('ä') in its @input. The corruption chain: 1. Source file byte 0xE4 is NOT valid UTF-8 (start byte needing 2 continuation bytes but followed by 'r' = 0x72). 2. Strict UTF-8 decode replaces 0xE4 with U+FFFD in the Java String. 3. Perl compiler treats non-ASCII char in string literal as needing UTF-8 encoding, emits 3 bytes EF BF BD. 4. Text::CSV's decode_utf8 fast path detects 3 valid UTF-8 bytes and decodes them back to 1 char U+FFFD, collapsing length from 22 to 20. 5. is($ret, $string) fails with `b<U+FFFD>r` vs `bär`, cascading into thousands of failures across the combinatorial iteration. Fix is the same pattern FileUtils.readFileWithEncodingDetection uses when ./jperl detects an ISO-8859-1 source: - Read bytes as ISO-8859-1 (1:1 byte-to-char preservation). - Set options.isByteStringSource = true so the parser preserves non- ASCII chars as single-byte values instead of UTF-8-encoding them. After this fix: `make test-bundled-modules` now passes 176/176 (was 175/176 with 55_combi.t failing). Confirmed no-regression: - Moo: 841/841 - Template::Toolkit: 2920/2920 - life_bitpacked Phase R speedup unchanged (~13 Mc/s vs 8.5 baseline) - `make` unit tests: same pre-existing destroy_eval_die.t#4 only This is a test-harness-only change; no production code affected. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…ed shim) PerlOnJava ships a pure-Perl re-implementation of Class::XSAccessor bundled in the jar (lib/Class/XSAccessor.pm). The upstream CPAN distribution is XS-only and fails at runtime with "Can't load loadable object for module Class::XSAccessor: no Java XS implementation available" because PerlOnJava has no XS loader. Without this distroprefs entry, `jcpan -t <Module>` recurses on Class::XSAccessor as a transitive dependency of Moo / DBIx::Class / Class::Method::Modifiers / ..., installs the XS version into ~/.perlonjava/lib/, and shadows the bundled shim. The shadowed upstream version fires: Class::XSAccessor exists but failed to load with error: Can't load loadable object for module Class::XSAccessor: no Java XS implementation available Compilation failed in require at .../Moo/_Utils.pm line 162. on every `use Moo` / `use DBIx::Class::Row` etc., silently breaking every module that imports Class::XSAccessor at runtime. Fix: - Add Class-XSAccessor.yml distroprefs matching "^SMUELLER/Class- XSAccessor-" with all of pl/make/test/install as no-op commands. - Bundle it via CPAN/Config.pm's _bootstrap_prefs so ~/.perlonjava/ cpan/prefs/Class-XSAccessor.yml is auto-installed on first CPAN use. Reproduction before fix (on a fresh ~/.perlonjava): $ rm /Users/fglock/.perlonjava/lib/Class/XSAccessor.pm $ ./jcpan -t DBIx::Class # triggers reinstall of XS version ... thousands of "Can't load loadable object" errors ... t/cdbi/columns_as_hashes.t ... Failed 1/2 subtests t/count/grouped_pager.t ...... Failed 7/7 subtests (timeouts) ... more cascading failures ... After fix: $ use Moo; # no warnings $ grep -c "Class::XSAccessor exists but failed" /tmp/jcpan_output 0 Remaining DBIC test failures observed in the full-suite run are pre-existing timeouts (300s Test::Harness cap hit) under heavy machine contention, unrelated to this fix or Phase R — they run fast and pass in isolation. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Root cause analysis of the DBIC t/storage/base.t hot-loop hang (infinite loop in ArrayList.add called from RuntimeScalar.setArrayOfAlias): The hang is triggered by Sub::Defer's deferred sub pattern, where defer_sub() stores $DEFERRED{$deferred} = $deferred_info with a weaken(). The eval-compiled deferred sub is supposed to close over $deferred_info, keeping it alive so the %DEFERRED weak entry stays defined. In PerlOnJava, the closure capture path for named subs (whether eval-compiled or regular) was skipping the captureCount bump that anonymous subs get via makeCodeObject. As a result: 1. Outer lexical $deferred_info has captureCount=0 despite being captured by the deferred sub's body. 2. When defer_sub() returns, scopeExitCleanup on $deferred_info sees captureCount=0 and takes the normal-cleanup path, decrementing the referent's refCount. 3. refCount drops to 0, triggering clearWeakRefsTo, which undefs $DEFERRED{$deferred}. 4. On first call to the deferred sub, undefer_sub() looks up $DEFERRED{$deferred}, gets undef, falls through to `return $deferred`. 5. The caller's $undeferred now equals $deferred, so `goto &$undeferred` re-enters the deferred sub itself — infinite trampoline loop. This commit addresses the named-sub side of the bug: 1. Extract the makeCodeObject capture-tracking loop into a public helper RuntimeCode.trackClosureCaptures(code, codeObject, clazz), which walks the generated class's RuntimeScalar fields and increments captureCount on each. 2. Call trackClosureCaptures from SubroutineParser after the named sub's codeObject is instantiated, matching the anonymous-sub path. 3. Refine EmitSubroutine's isPackageSub check to NOT clear visibleVariables when we're emitting a named sub inside an eval-string context. This matches Perl 5 semantics: named subs defined inside eval STRING do close over outer lexicals. This is a partial fix: named subs OUTSIDE eval-string (and anonymous subs via makeCodeObject) now correctly bump captureCount on their captured lexicals. However, named subs DEFINED INSIDE eval-string still route through a path that captures copies of outer scalars rather than the originals — the original issue reported (`./jperl /tmp/defer_long.pl` hanging on Sub::Defer with attributes option) is not yet fully resolved. Verified no regressions: - life_bitpacked: 13.0 Mc/s (Phase R speedup preserved) - txn.t (direct, isolated): 90/90 PASS - make: only pre-existing destroy_eval_die.t#4 (same as baseline) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Every `->next` chain walker that uses the common idiom my $op = $cv->START; while ($op) { ... last unless $op->can('next'); $op = $op->next; } was previously trapped in an infinite loop on PerlOnJava. B::NULL is the sentinel marking end-of-chain, but its `next` method returned `$_[0]` (self) — so `$op` stayed pinned on B::NULL forever: 1. `$op->can('line')` → true (inherited from B::OP) 2. `$op->can('next')` → true (inherited from B::OP) 3. `$op = $op->next` → same B::NULL instance (not undef) 4. `while ($op)` → still truthy (blessed hashref) 5. goto 1 In real Perl, B::NULL is XS-backed and every accessor returns undef — yielding `$op = undef` at step 3, which terminates the `while ($op)` loop on the next iteration. This blew up visibly in Test2-heavy code, where `Test2::Util::Sub::sub_info` walks the optree of every comparison callback. Each call allocated an unbounded `@all_lines` list, driving the JVM into GC-thrash (observed as 13 parallel GC threads × 64 s CPU with the main thread making only 25 % of one core of forward progress). Tests affected (any time `Test2::Util::Sub::sub_info` runs, i.e., any Test2 comparison with a coderef argument): * Hash::Wrap t/as_return.t — 3 m 20 s timeout → 1.4 s wallclock * DBIx::Class full test suite (many subtests use Test2 deep compare) * Anything using `Test2::Tools::Compare::like` / `is_deeply` with callback-shaped expectations (meta / object / call / prop DSL) Fix: replace B::NULL's method overrides with undef-returning stubs for every accessor a walker might call after reaching the null op — `next`, `line`, `file`, `sibling`, `first`, `last`, `targ`. This matches real Perl's XS B::NULL behavior. Verified: * Hash::Wrap t/as_return.t: 3 m 20 s (hang/timeout) → 1.4 s wallclock * DBIx::Class t/storage/txn.t: 90/90 passes, same as before * life_bitpacked: ~13 Mcells/s, Phase R speedup preserved * make: only pre-existing destroy_eval_die.t#4 fails (same as baseline) See dev/design/hash_wrap_triage_plan.md for the full triage plan — this commit is Phase 0. Phases 1-3 address the residual distributed-machinery overhead (the actual GC pressure class of problem). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 6 commits April 22, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip MyVarCleanupStack.register for simple subs (~1.49x on life_bitpacked)#536

perf: skip MyVarCleanupStack.register for simple subs (~1.49x on life_bitpacked)#536
fglock wants to merge 6 commits intofeature/refcount-perf-combinedfrom
perf/phase-r-needs-cleanup

fglock commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The change

What we deliberately did NOT do

Provenance

Test plan

Unit tests

Performance benchmarks

Comprehensive CPAN module tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fglock commented Apr 22, 2026 •

edited

Loading