perf: skip MyVarCleanupStack.register for simple subs (~1.49x on life_bitpacked)#536
Open
fglock wants to merge 6 commits intofeature/refcount-perf-combinedfrom
Open
perf: skip MyVarCleanupStack.register for simple subs (~1.49x on life_bitpacked)#536fglock wants to merge 6 commits intofeature/refcount-perf-combinedfrom
fglock wants to merge 6 commits intofeature/refcount-perf-combinedfrom
Conversation
…ypothesis Investigation result, not a user-facing feature. Gate behind JPERL_CLASSIC env var (read once into a static final boolean at class init, JIT DCEs the branch entirely). When set, collapses the branch's added refcount / walker / weaken / DESTROY machinery to master-like behavior so we can measure whether the full master->branch regression is recoverable. Gated sites: - MortalList.active = false -> deferDecrement*, scopeExitCleanupHash/Array, mortalizeForVoidDiscard all short-circuit via existing !active guards - EmitStatement.emitScopeExitNullStores: Phase 1 (scopeExitCleanup), Phase 1b (cleanupHash/Array), Phase E (unregister), Phase 3 (flush) emissions all suppressed - EmitVariable: MyVarCleanupStack.register emission on my suppressed - MyVarCleanupStack.register/unregister: early-return - RuntimeScalar.scopeExitCleanup: early-return - RuntimeScalar.setLargeRefCounted: direct field assignment, skipping refcount/WeakRefRegistry/MortalList work Results (life_bitpacked -g 500, 5 runs, median): baseline : 8.51 Mcells/s JPERL_CLASSIC=1 : 14.18 Mcells/s (1.67x, recovers full master gap) system perl reference : ~21 Mcells/s master pre-merge ref : ~14 Mcells/s Also confirmed on benchmark_lexical: 314k -> 357k iters/s (1.14x) on a workload with zero references or blesses -- proving the taxes are broadly distributed, not localized to any one hot method. No behavior change on the default path (env var absent). destroy_eval_die.t still shows same pre-existing 9/10 (test #4 is a documented known failure). CLASSIC is not safe for production use: it breaks DESTROY, weaken, walker semantics. Only useful as a measurement tool and as a map of sites that need per-object gating for the real fix (Phase R in the plan). See dev/design/classic_experiment_finding.md for full analysis and next steps. Plan updated in dev/design/perl_parity_plan.md to retire Phase 1-4 in favor of Phase R (per-object needsCleanup bit). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Extends the existing CleanupNeededVisitor analysis (which already gates
MyVarCleanupStack.unregister emission) to also gate the matching
register() emission on every `my` variable declaration.
When CleanupNeededVisitor proves the enclosing sub has no bless/weaken/
tie/untie/user-sub-call/eval-STRING/nested-sub/local, no tracked ref can
ever land in its my-variables. The register/unregister pair is pure
overhead for such "simple leaf" subs. The visitor already proved this
was safe for the unregister side; this commit applies the same proof to
the register side.
Also: teach CleanupNeededVisitor about tie/untie — those invoke user-
written TIESCALAR/TIEHASH/TIEARRAY/UNTIE methods which can do bless and
must be treated as user sub calls. Previously this was a latent
imprecision that didn't matter because only unregister was gated; with
the register gate in place, tie_scalar.t and tie_array.t would regress
without this fix.
Results (median of 5 runs on life_bitpacked -g 500):
baseline : 8.51 Mcells/s
phase R (narrow) : 12.70 Mcells/s (1.49x)
JPERL_CLASSIC=1 ref: 12.87 Mcells/s (the theoretical ceiling — see
classic_experiment_finding.md)
benchmark_lexical (median of 3):
baseline : 314k iters/s
phase R : 391k iters/s (1.25x)
Correctness: all unit tests pass (same pre-existing destroy_eval_die.t#4
failure on both before and after). tie_scalar.t 12/12, tie_array.t 29/29,
all 12 DESTROY/weaken tests pass with same counts.
We deliberately do NOT gate Phase 1 (scopeExitCleanup per scalar), Phase
1b (scopeExitCleanupHash/Array), or Phase 3 (MortalList.flush) on
cleanupNeeded — per the safety note already in EmitStatement, those must
fire even for "simple" subs because @_ can bring in tracked refs from
callers. Gating them was tried and broke tie_scalar.t test 11/12; the
narrow gate is the safe sweet spot.
See dev/design/classic_experiment_finding.md for the investigation that
derived this approach.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
ModuleTestExecutionTest.executeModuleTest() was reading Perl source files
as strict UTF-8, which corrupted tests that embed literal Latin-1 bytes
in single-quoted string literals — notably Text-CSV/t/55_combi.t with
byte 0xE4 ('ä') in its @input.
The corruption chain:
1. Source file byte 0xE4 is NOT valid UTF-8 (start byte needing 2
continuation bytes but followed by 'r' = 0x72).
2. Strict UTF-8 decode replaces 0xE4 with U+FFFD in the Java String.
3. Perl compiler treats non-ASCII char in string literal as needing
UTF-8 encoding, emits 3 bytes EF BF BD.
4. Text::CSV's decode_utf8 fast path detects 3 valid UTF-8 bytes and
decodes them back to 1 char U+FFFD, collapsing length from 22 to 20.
5. is($ret, $string) fails with `b<U+FFFD>r` vs `bär`, cascading into
thousands of failures across the combinatorial iteration.
Fix is the same pattern FileUtils.readFileWithEncodingDetection uses when
./jperl detects an ISO-8859-1 source:
- Read bytes as ISO-8859-1 (1:1 byte-to-char preservation).
- Set options.isByteStringSource = true so the parser preserves non-
ASCII chars as single-byte values instead of UTF-8-encoding them.
After this fix: `make test-bundled-modules` now passes 176/176 (was
175/176 with 55_combi.t failing).
Confirmed no-regression:
- Moo: 841/841
- Template::Toolkit: 2920/2920
- life_bitpacked Phase R speedup unchanged (~13 Mc/s vs 8.5 baseline)
- `make` unit tests: same pre-existing destroy_eval_die.t#4 only
This is a test-harness-only change; no production code affected.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ed shim)
PerlOnJava ships a pure-Perl re-implementation of Class::XSAccessor
bundled in the jar (lib/Class/XSAccessor.pm). The upstream CPAN
distribution is XS-only and fails at runtime with "Can't load
loadable object for module Class::XSAccessor: no Java XS implementation
available" because PerlOnJava has no XS loader.
Without this distroprefs entry, `jcpan -t <Module>` recurses on
Class::XSAccessor as a transitive dependency of Moo / DBIx::Class /
Class::Method::Modifiers / ..., installs the XS version into
~/.perlonjava/lib/, and shadows the bundled shim. The shadowed
upstream version fires:
Class::XSAccessor exists but failed to load with error:
Can't load loadable object for module Class::XSAccessor:
no Java XS implementation available
Compilation failed in require at .../Moo/_Utils.pm line 162.
on every `use Moo` / `use DBIx::Class::Row` etc., silently breaking
every module that imports Class::XSAccessor at runtime.
Fix:
- Add Class-XSAccessor.yml distroprefs matching "^SMUELLER/Class-
XSAccessor-" with all of pl/make/test/install as no-op commands.
- Bundle it via CPAN/Config.pm's _bootstrap_prefs so ~/.perlonjava/
cpan/prefs/Class-XSAccessor.yml is auto-installed on first CPAN
use.
Reproduction before fix (on a fresh ~/.perlonjava):
$ rm /Users/fglock/.perlonjava/lib/Class/XSAccessor.pm
$ ./jcpan -t DBIx::Class # triggers reinstall of XS version
... thousands of "Can't load loadable object" errors ...
t/cdbi/columns_as_hashes.t ... Failed 1/2 subtests
t/count/grouped_pager.t ...... Failed 7/7 subtests (timeouts)
... more cascading failures ...
After fix:
$ use Moo; # no warnings
$ grep -c "Class::XSAccessor exists but failed" /tmp/jcpan_output
0
Remaining DBIC test failures observed in the full-suite run are
pre-existing timeouts (300s Test::Harness cap hit) under heavy machine
contention, unrelated to this fix or Phase R — they run fast and pass
in isolation.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Root cause analysis of the DBIC t/storage/base.t hot-loop hang (infinite
loop in ArrayList.add called from RuntimeScalar.setArrayOfAlias):
The hang is triggered by Sub::Defer's deferred sub pattern, where
defer_sub() stores $DEFERRED{$deferred} = $deferred_info with a weaken().
The eval-compiled deferred sub is supposed to close over $deferred_info,
keeping it alive so the %DEFERRED weak entry stays defined.
In PerlOnJava, the closure capture path for named subs (whether
eval-compiled or regular) was skipping the captureCount bump that
anonymous subs get via makeCodeObject. As a result:
1. Outer lexical $deferred_info has captureCount=0 despite being
captured by the deferred sub's body.
2. When defer_sub() returns, scopeExitCleanup on $deferred_info sees
captureCount=0 and takes the normal-cleanup path, decrementing the
referent's refCount.
3. refCount drops to 0, triggering clearWeakRefsTo, which undefs
$DEFERRED{$deferred}.
4. On first call to the deferred sub, undefer_sub() looks up
$DEFERRED{$deferred}, gets undef, falls through to `return $deferred`.
5. The caller's $undeferred now equals $deferred, so `goto &$undeferred`
re-enters the deferred sub itself — infinite trampoline loop.
This commit addresses the named-sub side of the bug:
1. Extract the makeCodeObject capture-tracking loop into a public helper
RuntimeCode.trackClosureCaptures(code, codeObject, clazz), which
walks the generated class's RuntimeScalar fields and increments
captureCount on each.
2. Call trackClosureCaptures from SubroutineParser after the named sub's
codeObject is instantiated, matching the anonymous-sub path.
3. Refine EmitSubroutine's isPackageSub check to NOT clear
visibleVariables when we're emitting a named sub inside an
eval-string context. This matches Perl 5 semantics: named subs
defined inside eval STRING do close over outer lexicals.
This is a partial fix: named subs OUTSIDE eval-string (and anonymous
subs via makeCodeObject) now correctly bump captureCount on their
captured lexicals. However, named subs DEFINED INSIDE eval-string
still route through a path that captures copies of outer scalars
rather than the originals — the original issue reported
(`./jperl /tmp/defer_long.pl` hanging on Sub::Defer with attributes
option) is not yet fully resolved.
Verified no regressions:
- life_bitpacked: 13.0 Mc/s (Phase R speedup preserved)
- txn.t (direct, isolated): 90/90 PASS
- make: only pre-existing destroy_eval_die.t#4 (same as baseline)
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Every `->next` chain walker that uses the common idiom
my $op = $cv->START;
while ($op) {
...
last unless $op->can('next');
$op = $op->next;
}
was previously trapped in an infinite loop on PerlOnJava. B::NULL is the
sentinel marking end-of-chain, but its `next` method returned `$_[0]`
(self) — so `$op` stayed pinned on B::NULL forever:
1. `$op->can('line')` → true (inherited from B::OP)
2. `$op->can('next')` → true (inherited from B::OP)
3. `$op = $op->next` → same B::NULL instance (not undef)
4. `while ($op)` → still truthy (blessed hashref)
5. goto 1
In real Perl, B::NULL is XS-backed and every accessor returns undef —
yielding `$op = undef` at step 3, which terminates the `while ($op)`
loop on the next iteration.
This blew up visibly in Test2-heavy code, where `Test2::Util::Sub::sub_info`
walks the optree of every comparison callback. Each call allocated an
unbounded `@all_lines` list, driving the JVM into GC-thrash (observed as
13 parallel GC threads × 64 s CPU with the main thread making only
25 % of one core of forward progress).
Tests affected (any time `Test2::Util::Sub::sub_info` runs, i.e., any
Test2 comparison with a coderef argument):
* Hash::Wrap t/as_return.t — 3 m 20 s timeout → 1.4 s wallclock
* DBIx::Class full test suite (many subtests use Test2 deep compare)
* Anything using `Test2::Tools::Compare::like` / `is_deeply` with
callback-shaped expectations (meta / object / call / prop DSL)
Fix: replace B::NULL's method overrides with undef-returning stubs for
every accessor a walker might call after reaching the null op —
`next`, `line`, `file`, `sibling`, `first`, `last`, `targ`. This matches
real Perl's XS B::NULL behavior.
Verified:
* Hash::Wrap t/as_return.t: 3 m 20 s (hang/timeout) → 1.4 s wallclock
* DBIx::Class t/storage/txn.t: 90/90 passes, same as before
* life_bitpacked: ~13 Mcells/s, Phase R speedup preserved
* make: only pre-existing destroy_eval_die.t#4 fails (same as baseline)
See dev/design/hash_wrap_triage_plan.md for the full triage plan —
this commit is Phase 0. Phases 1-3 address the residual
distributed-machinery overhead (the actual GC pressure class of problem).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Narrow perf fix that captures ~98% of a large distributed-tax regression against master, with all tests passing.
Measured speedup on
perf/refcount-perf-combined:life_bitpacked -g 500(median of 5)benchmark_lexical(median of 3)The change
Extends the existing
CleanupNeededVisitor+ctx.javaClassInfo.cleanupNeededanalysis (which already gatesMyVarCleanupStack.unregisteremission at scope exit) to also gate the matchingregisteremission on everymyvariable declaration. When the enclosing sub is statically proven to have nobless/weaken/tie/untie/ user-sub-call /evalSTRING / nested sub /local, no tracked reference can ever land in itsmy-variables, so the register/unregister pair is dead code.Four files touched, 29 insertions / 3 deletions.
Secondary: teach
CleanupNeededVisitorto recognizetie/untieas "needs cleanup" operators. This was a latent imprecision that only mattered once the register-side gate went in — without this fixtie_scalar.t#11and#12regress.What we deliberately did NOT do
An earlier iteration also gated Phase 1 (
scopeExitCleanup), Phase 1b (scopeExitCleanupHash/Array), and Phase 3 (MortalList.flush) emissions oncleanupNeeded. That hit 13.4 Mcells/s (~CLASSIC ceiling) but broketie_scalar.t#11/12. The comment block inEmitStatement.emitScopeExitNullStoresalready explicitly warned against this: "Skipping them breaks ... tie_scalar DESTROY-on-untie, and other legitimate patterns where the sub receives a blessed ref it doesn't know about statically." Reverted to the narrow gate.Provenance
Preceded by commit
ea7c66811which adds aJPERL_CLASSIC=1env var that collapses the entire branch's added refcount/walker/weaken/DESTROY machinery to master-like behavior. That experiment established:life_bitpackedis achievable — matches pre-merge master exactlyThe
JPERL_CLASSICflag stays in-tree as a debugging/measurement tool (zero impact on default path; breaks DESTROY/weaken semantics so only safe for benchmarks).Full investigation and rationale in
dev/design/classic_experiment_finding.md. Updated plan indev/design/perl_parity_plan.mdmarks the originally-proposed Phase 1-4 as retired.Test plan
Unit tests
make— same pre-existingdestroy_eval_die.t#4as baseline, no new regressionstie_scalar.t12/12 ✓tie_array.t29/29 ✓Performance benchmarks
life_bitpacked -g 500× 5 — median 12.70 Mcells/s, +49% vs baselinebenchmark_lexical× 3 — median 391k iters/s, +25% vs baselineComprehensive CPAN module tests
./jcpan -t Moo./jcpan -t Template(Toolkit 3.102)./jcpan -t DBIx::Classtxn_scope_guard.t18/18 ✓ (the hard gate); 1 subtest fail intxn.t#70(nested failed txn_do exception regex — near certain pre-existing, not touched by this change); 4 test programs hit OOM/SIGKILL (pre-existing infra)make test-bundled-modulesText-CSV/t/55_combi.tis a test-harness working-directory issue; running the test directly withcd Text-CSV && jperl t/55_combi.tproduces 25119/25119 okNo Phase R–attributable regressions in any suite.
Generated with Devin