perf: refcount + Phase-J + walker hardening combined (rebased on master) by fglock · Pull Request #526 · fglock/PerlOnJava

fglock · 2026-04-21T13:31:05Z

Summary

Rebases feature/walker-hardening-j3 (superseding PRs #508, #518, #523) onto current master (143 commits), producing a single up-to-date PR with all three stacks of work combined.

Perf wins (from #508/#518/#523)

Walker gating — skip ScalarRefRegistry.registerRef() and MyVarCleanupStack.liveCounts when no weak refs exist. +28% on life_bitpacked (7.74 -> 12.50 Mcells/s, 1.61x).
J2: suppressFlush skip for simple anon-literal returns that cannot allocate blessed temps. +12.7% arr_int, +4.7% arr_str.
J4: ByteCodeSourceMapper fileName cache — minor but measurable.
RuntimeScalar(long) no-autobox + int fast paths in BitwiseOperators.
Warning-bits cache per RuntimeCode + shared EMPTY_ARGS_SNAPSHOT in pushArgs (per-call list alloc in zero-arg case).
System.getenv() cached as static final in 7 hot classes.

Merge integration notes

The merge with origin/master preserved our HEAD features while adopting master's newer refactors:

Iterative goto &func trampoline kept (master converted to recursive; we re-integrated master's inTailCallTrampoline counter into the iterative path via a per-iteration tailCallReentry flag so Deep-recursion warnings correctly skip tail-call hops).
@DB::args via setFromListAliased kept (shared slots for DBIC txn_scope_guard) while switching source to master's pristineArgsStack.
VOID-context mortalizeForVoidDiscard + MortalList.flush() kept.
EMPTY_ARGS_SNAPSHOT re-expressed over master's List<RuntimeScalar> snapshot type.

Dropped from HEAD in favor of master's refactors:

throwIfFallbackDenied (subsumed by master's OverloadContext.tryTwoArgumentNomethod which handles fallback=0 itself).
HEAD's InterpreterState.setCurrentPackage (dead code; only master's setCurrentPackageStatic has callers).
HEAD's ad-hoc main:: strip in RuntimeGlob.HASH (master's GlobalVariable.getGlobalHash now strips it at the lookup layer).

Test plan

make — all unit tests pass except pre-existing destroy_eval_die.t test 4 (verified on pre-merge HEAD 660aa9e via worktree).
DBIx::Class bundled module tests
Template-Toolkit tests
Moo tests
Modules referenced in merged commits: Memoize, Date::Calc, Math::Base::Convert, List::MoreUtils, AnyEvent::Util, Storable

Supersedes

Align refcount / DESTROY / weaken with Perl semantics #508 feature/refcount-alignment
Phase J: performance investigation plan + baseline tooling #518 feature/phase-j-performance
perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked) #523 feature/walker-hardening-j3

Generated with Devin

Two regressions from the DESTROY/weaken merge (PR #464): 1. BytecodeInterpreter: SCOPE_EXIT_CLEANUP_ARRAY/HASH/scalar opcodes crash with ClassCastException when the interpreter fallback path reuses registers with unexpected types. Add instanceof guards before casting. Fixes Sub::Exporter::Progressive (used by Devel::GlobalDestruction, needed by DBIx::Class). 2. GlobalDestruction: runGlobalDestruction() iterates global variable HashMaps while DESTROY callbacks can modify them, causing ConcurrentModificationException. Snapshot collections with toArray() before iterating. Fixes DBIx::Class Makefile.PL. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…weaken - Updated branch/PR references for feature/dbix-class-destroy-weaken - Added Phase 9 section documenting post-DESTROY/weaken assessment - Documented 645 ok / 183 not ok across 92 test files - Identified premature DESTROY blocker (20 tests) and GC leak blocker - Catalogued improvements from DESTROY/weaken merge (PR #464) - Updated Next Steps with new priorities (P0-P2) - Marked obsoleted items (Phase 7, old GC/DESTROY sections) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@array

…ings Add localBindingExists flag to RuntimeBase that tracks when a named hash/array (my %hash, my @array) has had a reference created via the \ operator. This flag indicates a JVM local variable slot holds a strong reference not counted in refCount. When refCount reaches 0, the flag prevents premature callDestroy since the local variable may still be alive. The flag is cleared at scope exit (scopeExitCleanupHash/Array), allowing subsequent refCount==0 to correctly trigger callDestroy. This fixes the DBIx::Class bug where \%reg stored via accessor caused premature DESTROY of Schema objects when %reg went out of scope, even though the hash was still alive through the stored reference. The localBindingExists check is applied consistently across all refCount decrement paths: setLargeRefCounted, undefine, weaken, MortalList.flush, and MortalList.popAndFlush. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Sub::Exporter::Progressive's import() relies on caller() to determine the target package. When Sub::Uplevel overrides CORE::GLOBAL::caller (used by Test::Exception via Test::Builder), PerlOnJava's caller() returns wrong frames during `use` processing, causing SEP to install exports into the wrong package. This prevented in_global_destruction from being exported to Devel::GlobalDestruction, which namespace::clean then removed, causing a bareword error in DESTROY methods. Fix by bundling a simplified Devel::GlobalDestruction that uses plain Exporter instead of Sub::Exporter::Progressive. Since PerlOnJava always has ${^GLOBAL_PHASE} (Perl 5.14+ feature), the implementation is a simple one-liner. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Add the complete DBI::Const module hierarchy needed by DBIx::Class: - DBI::Const::GetInfo::ANSI - ANSI SQL/CLI constants (from DBI 1.643) - DBI::Const::GetInfo::ODBC - ODBC constants (from DBI 1.643) - DBI::Const::GetInfoType - merged name-to-number mapping - DBI::Const::GetInfoReturn - upgraded from stub to real implementation These are pure Perl constant-data modules from the DBI distribution. DBIx::Class uses them to translate info type names (e.g. SQL_DBMS_NAME) to numeric codes for $dbh->get_info(), which our JDBC-based DBI already implements with matching numeric codes. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

314-test analysis: 155 blocked by "detached result source" (weak ref cleared during clone -> _copy_state_from), ~10 GC-only, ~25 real+GC, ~6 errors. Root cause traced to Schema->connect's shift->clone->connection chain where the clone temporary's refCount drops to 0 mid-operation. Added reference to dev/architecture/weaken-destroy.md for refCount internals needed for debugging. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Corrected categorization: 27 GC-only (was ~10), only 4 real functional failures across all 40 non-detached test files. Added DESTROY trace confirming Schema::DESTROY fires during _copy_state_from in clone(). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Prevents premature DESTROY of return values from chained method calls like shift->clone->connection(@_). During list assignment materialization (my ($self, @info) = @_), setLargeRefCounted calls MortalList.flush() which processes pending decrements from inner scope exits (e.g., clone's $self). This can drop the return value's refCount to 0 before the caller's LHS variables capture it, triggering DESTROY and clearing weak refs to still-live objects. The fix: - Adds MortalList.suppressFlush() to temporarily block flush() execution - RuntimeList.setFromList() suppresses flushing around the materialization and LHS assignment phase, then restores the previous state - Also adds reentrancy guard to flush() itself (try/finally with flushing flag) to prevent cascading DESTROY from re-entering flush() This fixes the DBIx::Class "detached result source" error where Schema->connect() returned an undefined value because the Schema clone was destroyed mid-construction. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Condensed P0 done section, corrected P1 hypothesis (callDestroy sets MIN_VALUE permanently, not a transient underflow), replaced speculative fix approaches with concrete debugging plan. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…ised approach Step 11.2 (popMark + flush in setLargeRefCounted) was implemented and failed: mark-aware flush prevents DESTROY from firing for delete/untie/undef because subroutine calls push marks that hide earlier entries. 4 unit test regressions. Changes reverted. Added Step 11.3 with root cause analysis, comparison to Perl 5 FREETMPS model, and 4 possible approaches. Recommends Approach D (targeted GC leak fix) since P0 premature DESTROY is already solved by suppressFlush. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Blessed objects whose class has no DESTROY method (e.g., Moo objects like DBIx::Class::Storage::BlockRunner) were set to refCount=-1 (untracked) at bless time, so when they went out of scope their hash/array elements' refcounts were never decremented. Changes: - ReferenceOperators.bless(): always track all blessed objects regardless of whether DESTROY exists in the class hierarchy. Previously, classes without DESTROY got refCount=-1 (untracked). - DestroyDispatch.doCallDestroy(): when no DESTROY method is found, still cascade into hash/array elements via scopeExitCleanupHash/ scopeExitCleanupArray + flush() to decrement contained references. Test: dev/sandbox/destroy_weaken/destroy_no_destroy_method.t (13/13) All unit tests pass (make). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…tails Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

- Step 11.4 fix committed and verified (all unit tests pass, 13/13 sandbox) - GC-only failures explained: Sub::Quote closure walk differences, not refcount bugs - Documented B::svref_2object->REFCNT method chain leak (separate bug) - Updated Next Steps and Open Questions with investigation results Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Complete handoff plan with 13 work items covering: - GC object liveness at END (146 files, 658 assertions) - DBI shim fixes (statement handles, transactions, numeric formatting, DBI_DRIVER, stringification, table locking, error handler) - Transaction/savepoint depth tracking - Detached ResultSource weak ref cleanup - B::svref_2object method chain refcount leak - UTF-8 byte-level string handling - Bless/overload performance Full suite baseline: 27 pass, 146 GC-only, 25 real fail, 43 skipped. 11,646 ok / 746 not-ok assertions. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Work Item 4: Added toJdbcValue() helper in DBI.java to convert whole-number Doubles to Long before JDBC setObject(), fixing 10.0 vs 10 issue. Also handles overloaded object stringification. Work Item 5: Fixed DBI.pm connect() to support empty driver in DSN, $ENV{DBI_DRIVER} fallback, $ENV{DBI_DSN} fallback, proper error messages, and require DBD::$driver for missing driver errors. Work Item 6: Overloaded object stringification fixed by toJdbcValue(). Work Item 8: Added HandleError callback support in DBI.pm execute wrapper, enabling DBIx::Class custom error handler. Updated design doc with investigation findings for Work Item 2 (DBI statement handle finalization via cascading DESTROY). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Avoid fork exhaustion by limiting parallel processes. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

do { BLOCK } return values were being prematurely destroyed when the block contained lexical variables. The scope-exit flush processed deferred decrements from inner subroutine returns before the caller could capture the do-block result via assignment. This fixes 11 of 12 "Unreachable cached statement still active" failures in DBIx::Class t/60core.t. The Cursor DESTROY now fires at the correct time, calling finish() on cached statement handles. Root cause: do-blocks were treated as regular bare blocks (flush=true), but like subroutine bodies, their return value is on the JVM operand stack and must not be destroyed before the caller captures it. Fix: Annotate do-blocks with blockIsDoBlock and skip mortal flush at scope exit, matching the existing blockIsSubroutine behavior. Both JVM backend (EmitBlock) and bytecode interpreter (BytecodeCompiler) are updated. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

When die throws inside eval{}, lexical variables between the die point and the eval boundary go out of scope. Previously, their DESTROY methods were never called because the SCOPE_EXIT_CLEANUP opcodes were skipped by Java exception handling. This fix adds scope-exit cleanup in both backends: Bytecode interpreter: - EVAL_TRY now records the first body register index - The exception catch handler iterates registers from that index, calling scopeExitCleanup for each RuntimeScalar/Hash/Array, then flushes the mortal list to trigger DESTROY JVM backend: - During eval body compilation, all my-variable local indices are recorded via emitScopeExitNullStores into evalCleanupLocals - The catch handler emits MortalList.evalExceptionScopeCleanup() for each recorded local, then flushes New runtime helper: MortalList.evalExceptionScopeCleanup(Object) dispatches to the appropriate cleanup method based on runtime type. This is critical for DBIx::Class TxnScopeGuard, which relies on DESTROY firing during eval exception unwinding to rollback transactions. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Implement runtime cleanup stack (MyVarCleanupStack) to ensure DESTROY fires for blessed objects in my-variables when die propagates through regular subroutines without an enclosing eval. Approach: register my-variables at creation time on a runtime stack, and unwind (running scopeExitCleanup) when exceptions propagate through RuntimeCode.apply(). Normal scope exit uses existing bytecodes and discards registrations via popMark. Key changes: - New MyVarCleanupStack class with pushMark/register/unwindTo/popMark - EmitVariable.java: emit register() after my-variable ASTORE - RuntimeCode.java: wrap 3 static apply() overloads with cleanup - BytecodeInterpreter.java: propagatingException for interpreter backend This replaces the failed emitter try/catch approach which caused try_catch.t failures due to JVM exception table ordering. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Sub bodies use flush=false in emitScopeExitNullStores to protect return values on the JVM operand stack. This caused DESTROY to fire outside the caller's dynamic scope -- e.g., after local $SIG{__WARN__} unwinds, causing Test::Warn to miss warnings from DESTROY. In void context there is no return value to protect, so we can safely flush deferred decrements immediately after the sub returns. Added MortalList.flush() after mortalizeForVoidDiscard in all three static apply() overloads. Fixes: destroy.t (0/14 -> 14/14), weaken.t (3/4 -> 4/4), txn_scope_guard.t test 8 Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Three fixes for DBI handle garbage collection and resource management: 1. Use createReferenceWithTrackedElements() for Java-created DBI handles instead of createReference(). The latter incorrectly sets localBindingExists=true (designed for Perl lexical `my %hash`), which prevents DESTROY from firing in MortalList.flush(). This affected all 43+ DBIx::Class test files with GC-only failures. 2. Add Java-side DBI::finish() that closes the JDBC PreparedStatement, releasing database locks (e.g., SQLite table locks). Also add $sth->finish() call in DBI::do() for temporary statement handles. Fixes: t/storage/on_connect_do.t test 8 (table locking). 3. Break circular reference between dbh.sth (stores full sth ref) and sth.Database (stores dbh ref). Now dbh.sth stores only the raw JDBC Statement object needed for the last_insert_id() fallback. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…poraries - Add pushMark/popMark/flushAboveMark to MortalList for scoped mortal boundaries (analogous to Perl 5's SAVETMPS/FREETMPS) - Emit flushAboveMark at statement boundaries in EmitBlock to process deferred DESTROY within the current function scope only - Fix bless() to mortalize newly blessed refs (refCount=1 + deferDecrement) so method chain temporaries like Foo->new()->method() get properly destroyed at the caller's statement boundary - Fix hash/array setFromList() materialization to avoid spurious refCount increments from push() — use direct list.add() instead - Fix RuntimeArray push/pop/shift to properly track refCount for container store/remove operations - RuntimeCode.apply/callCached push/pop marks around function execution to isolate mortal scopes between caller and callee - EmitStatement scope-exit always flushes pending entries even when no my-variables exist, to handle inner sub scope exit temporaries - Remove debug tracing (JPERL_DEBUG_MORTAL env var checks) These changes fix: - Hash clear not triggering DESTROY (materialization refCount leak) - Test2::API::Context premature DESTROY breaking subtests - Method chain temporaries leaking (never reaching refCount=0) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

grep/map/sort/all/any block closures are compiled as anonymous subs that capture lexical variables, incrementing captureCount. Unlike eval blocks, these closures were never having releaseCaptures() called after execution. This caused captureCount to stay elevated, preventing scopeExitCleanup from decrementing blessed ref refCounts — objects could never reach refCount 0 and DESTROY would never fire. The fix adds eager releaseCaptures() calls in ListOperators after each operation completes. Only closures flagged as isMapGrepBlock (temporary block closures) are affected — named subs and user closures are not touched. Sort blocks now also receive the isMapGrepBlock annotation. Also includes MortalList.flush() before END blocks (from prior session) to ensure file-scoped lexical cleanup fires before END block dispatch. Impact: DBIx::Class Storage object refCount dropped from 237 to 1. Moo-generated constructors with grep-based required attribute validation no longer leak refCounts. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

When `local @array` or `local %hash` scope exits, dynamicRestoreState() replaces the current elements with the saved ones. Previously, the current (local scope's) elements were simply discarded by JVM GC, without decrementing refCounts for any tracked blessed references they owned. This caused refCount leaks when Moo-generated writers used `local @_ = ($self, $value)` for inlined qsub triggers — each call leaked +1 on tracked objects stored in the same Moo object's hash. In DBIx::Class, this manifested as Storage refCount climbing by +1 per dbh_do() call (e.g., 108 after init_schema instead of 2). The fix calls MortalList.deferDestroyForContainerClear() on the outgoing elements before replacing them, matching the cleanup done by scopeExitCleanupArray/Hash for my-variable scope exit. Impact: Storage refCount stays at 2 after 100 dbh_do() calls. dbi_env.t failures reduced from 18 to 11 (remaining failures are DBI::db objects held by a different retention path). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The callCached() method (used for all method dispatch via $obj->method) was missing MyVarCleanupStack management. When a called method died, my-variables registered inside the method's bytecode were never cleaned up via unwindTo(), causing their refCount decrements to be lost. This meant blessed objects held in those my-variables would leak (DESTROY never fires). Root cause: Regular function calls go through the static apply() which wraps execution with MyVarCleanupStack.pushMark()/unwindTo()/popMark(). Method calls via callCached() bypassed this wrapper, calling either the raw PerlSubroutine.apply() (cache hit) or the instance RuntimeCode.apply() (cache miss) - neither of which manages MyVarCleanupStack. Fix: Add MyVarCleanupStack.pushMark()/unwindTo()/popMark() to callCached() by extracting the body into callCachedInner() and wrapping it with the cleanup try-catch-finally. Also fixes DBI.pm circular reference: weaken $sth->{Database} back-link to $dbh, matching Perl 5's XS-based DBI which uses weak child→parent refs. Together these fix all 4 remaining refCount leaks in DBIx::Class dbi_env.t (tests 28-31), bringing it to 27/27 pass + 38 leak checks clean. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Anonymous arrays created by [...] were not birth-tracked (refCount stayed at -1/untracked), unlike anonymous hashes which properly set refCount=0 in createReferenceWithTrackedElements(). This caused blessed object references stored inside anonymous arrays to leak: the containerStore INC was never matched by a DEC when the array went out of scope, so the blessed object's refCount never reached 0 and DESTROY was never called. This was the root cause of DBIx::Class leak detection failures, where connect_info(\@info) wraps args in an arrayref. The fix adds the same birth-tracking logic to RuntimeArray that RuntimeHash already had: set refCount=0 for anonymous arrays so setLargeRefCounted can properly count references and callDestroy can cascade element cleanup when the array is no longer referenced. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

21 tests covering blessed objects inside anonymous arrayrefs and hashrefs: - Basic scope exit, function argument passing, weak ref clearing - Multiple objects, nested containers, return values - DBIx::Class connect_info pattern (object in anon arrayref arg) - Reassignment releasing previous contents These prevent regression of the birth-tracking fix in RuntimeArray.createReferenceWithTrackedElements(). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

splice() called deferDecrementIfTracked() on removed elements without checking runtimeArray.elementsOwned. For @_ arrays (populated via setArrayOfAlias), the elements are aliases to the caller's variables, not owned copies. The alias shares the same RuntimeScalar as the caller's variable, so refCountOwned reflects the caller's ownership, not @_'s. This caused splice to decrement refCounts that @_ never incremented, triggering premature DESTROY while the object was still in scope. The fix adds the elementsOwned guard to splice's removal loop, matching the pattern already used by shift() and pop(). For @_ arrays where elementsOwned is false, the DEC is skipped. This is the exact pattern used by Class::Accessor::Grouped::get_inherited in DBIx::Class: splice @_, 0, 1, ref($_[0]) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Add stashRefCount to RuntimeCode to track glob/stash references that are invisible to the cooperative refCount mechanism. Stash assignments (*Foo::bar = $coderef) bypass setLarge(), so the stash reference was not counted, causing false refCount==0 and premature releaseCaptures. Key changes: - RuntimeCode.stashRefCount tracks stash references (inc/dec in glob set, dynamicSaveState/dynamicRestoreState, stash delete paths) - DestroyDispatch skips releaseCaptures when stashRefCount > 0 - releaseCaptures only cascades deferDecrementIfTracked for blessed referents, not unblessed containers (whose cooperative refCount can falsely reach 0 because closure captures hold JVM references not counted in refCount) This fixes infinite recursion in DBIx::Class/Moo/Sub::Quote where premature releaseCaptures cascaded to clear weak references in Sub::Defer's %DEFERRED hash, triggering re-vivification loops. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The per-object memory footprint increased with RuntimeBase refcounting and DESTROY support, causing code_too_large.t (10K lines, ~5K tests) to OOM in the test runner. Setting maxHeapSize = '1g' for test tasks. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

System.getenv() is a native call (JNI → TreeMap lookup, ~200ns each). Several debug flags were being evaluated via System.getenv() on every call rather than once at class init. These cached reads add up in hot compile and runtime paths. Caches: * MortalList.GC_DEBUG — was in maybeAutoSweep (fires after flush) * RuntimeScalar.PHASE_D_DBG — was in undefine() * RuntimeIO.IO_DEBUG — 2 hot spots in getRuntimeIO/close * IOOperator.IO_DEBUG — open/close hot paths * EmitterMethodCreator.ASM_DEBUG, ASM_DEBUG_CLASS_FILTER, BYTECODE_SIZE_DEBUG, SPILL_SLOT_COUNT — compile hot path, called ~once per compiled method (was 4-5 getenv per compile) * SubroutineParser.SHOW_FALLBACK — parser hot path * PerlLanguageProvider.SHOW_FALLBACK — compile fallback hot path Semantically identical — these are all at-startup-determined debug flags whose values never change during execution. Pattern already used elsewhere in the codebase (e.g., ScalarRefRegistry, RuntimeRegex). Regression gates: * DBIx::Class: Files=314, Tests=13804 — PASS (1152s, noise vs 1107) * Template-Toolkit: Files=106, Tests=2920 — PASS (133-137s) * Moo: Files=71, Tests=841 — PASS (91s) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

ScalarRefRegistry.registerRef() was doing a synchronized(WeakHashMap).put() on every setLargeRefCounted call (every ref assignment). The registry is consumed ONLY by ReachabilityWalker.sweepWeakRefs(), which is only invoked when WeakRefRegistry.weakRefsExist is true (i.e., when weaken() has been called at least once). For scripts that never weaken(), every registerRef call was pure overhead. JFR profile of life_bitpacked.pl (which never weakens anything) showed WeakHashMap.put / Collections$SynchronizedMap.put / expungeStaleEntries dominating post-compile CPU. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before: 7.74–8.06 Mcells/s after: 9.79–9.93 Mcells/s (+22% median) Larger grid (200x200, 5000 gens): 10.01 → 11.05 Mcells/s (+10%). Trade-off: scripts that hold many scalars-with-refs PRIOR to the first weaken() call won't be in the registry when the walker first runs. However, any subsequent setLarge on those scalars will register them, and the walker's primary seeds (globals, code refs, DESTROY rescued set) still find reachable structures via the normal BFS. No DBIC 52leaks.t assertions regressed. Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old unconditional behavior. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (136-138s) * Moo: Files=71, Tests=841 — PASS (95s; within noise of 91s baseline) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Same pattern as the prior ScalarRefRegistry.registerRef fix. MyVarCleanupStack.register() is called for every `my` variable declaration and was doing an unconditional `liveCounts.merge(var, 1, Integer::sum)` — an IdentityHashMap merge with a boxed Integer lambda. JFR of life_bitpacked.pl surfaced this right after the scalar-registry gate landed. The `liveCounts` map exists solely for ReachabilityWalker.sweepWeakRefs' `isLive()` check, which only runs when WeakRefRegistry.weakRefsExist is true. Scripts that never weaken() pay the full IdentityHashMap.merge cost for nothing. The walker's pre-weaken fallback (sc.scopeExited + sc.refCountOwned checks) still correctly classifies live vs dead lexicals, so semantically this is indistinguishable. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 9.79 Mcells/s (after ScalarRefRegistry gate) after this patch: 12.50 Mcells/s (+28%) Combined vs pre-gate baseline on this branch: 7.74 → 12.50 Mcells/s, a 1.61× speedup. Larger grid (200x200, 5000 gens): 13.71 Mcells/s (baseline was 10.01, a 1.37× speedup). Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (143s) * Moo: Files=71, Tests=841 — PASS (97s) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Two targeted hot-path optimizations found via JFR on life_bitpacked.pl after the ScalarRefRegistry / MyVarCleanupStack gates landed. 1. Cache getWarningBitsForCode() per RuntimeCode. The JVM-compiled branch looked up `code.methodHandle.type().parameterType(0).getName()` + a `WarningBitsRegistry.get(className)` HashMap probe on every sub invocation. The declaring class of a compiled code's MethodHandle is stable for the lifetime of the RuntimeCode, so cache the resolved warning bits string in a `cachedWarningBits` field. Sentinel pattern (`WARNING_BITS_NOT_COMPUTED = "<uninit>"`) keeps a legitimately-null result distinguishable from not-yet-computed. 2. pushArgs: shared empty snapshot for zero-arg calls. Every sub call was allocating a fresh `RuntimeArray` wrapper + `ArrayList<>` to snapshot @_ for `@DB::args` support, even when the callee was called with zero arguments. Share a single `EMPTY_ARGS_SNAPSHOT` for those; real allocation only happens when args is non-empty. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 11.68–12.50 Mcells/s after this patch: 12.19–13.07 Mcells/s The cumulative speedup since the start of this PR (post-Phase-J baseline on this branch, 7.74 Mcells/s) is **1.65×**. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS * Moo: Files=71, Tests=841 — PASS * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Two small hot-path cleanups found via JFR allocation profile on life_bitpacked.pl: 1. RuntimeScalar.initializeWithLong took `Long` (boxed) rather than `long` (primitive). The `RuntimeScalar(long)` constructor autoboxed long → Long on every call, then the method unboxed it again. Changed the signature to primitive `long` and adapted the `RuntimeScalar(Long)` constructor to call `.longValue()` explicitly. 2. BitwiseOperators.bitwiseAnd / bitwiseOr / bitwiseXor fast paths computed result as `long` and called `new RuntimeScalar(long)`, forcing the initializeWithLong branch cascade. Since int ^ int, int & int, int | int all produce int by definition, compute as int and call `new RuntimeScalar(int)` directly — skips the range-check branches. Bitwise shift ops are unchanged (still use long) because left-shift may grow beyond 32 bits and must preserve semantics. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * make unit tests — PASS Measured on examples/life_bitpacked.pl default args: ~12.3–12.6 Mcells/s (unchanged at the mean — these changes eliminate boxing overhead per op but the JIT already optimized it well; the signature fix prevents an autobox that was showing up in JFR's alloc sample but not in the JIT-compiled hot loop). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@db

Brings 143 commits from master since feature/walker-hardening-j3 was split off, including: - Deep-recursion warning + inTailCallTrampoline (10631d7) - pristineArgsStack for @db::args (4ddfe34) - hasArgsStack for caller()[4] (d1b5c6b) - Bitwise feature opcodes (b75d5e0) - Statement-modifier my/our/state hoisting (c8d065d, 3bfaffd) - require FILE caller-package (40b535a) - Overload autogen refactor: direct->cmp->nomethod (4ddfe34) - Stash hash deref fix (9f38ef5) Conflict resolution preserved our HEAD features: - Iterative goto&func trampoline (no Java-stack growth for long chains) - VOID-context mortalizeForVoidDiscard + MortalList.flush() - @db::args setFromListAliased (shared slots, DBIC txn_scope_guard) - EMPTY_ARGS_SNAPSHOT fast path in pushArgs (life_bitpacked perf) Plus integrated master's inTailCallTrampoline counter into the iterative path via a per-iteration tailCallReentry flag so Deep-recursion warnings don't count goto&func hops. Pre-existing test failure (destroy_eval_die.t test 4 - LIFO DESTROY during eval/die) reproduces on pre-merge HEAD 660aa9e and is therefore NOT a merge regression. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Master's 4ddfe34 refactored overload autogen (direct->cmp->nomethod) but its tryTwoArgumentNomethod only handles the explicit `fallback=>0` case — it returns null when `fallback=>undef`, silently falling through to default stringification. Perl 5 actually treats `fallback=>undef` (or no fallback attr) as "no method found" when none of (op, (cmp, or (nomethod resolve. Compare: package BrokenOverload; use overload '""' => sub { $_[0] }; ... bless({}, 'BrokenOverload') eq "str" # real perl: dies "Operation 'eq': no method found" # post-4ddfe3434: silently stringifies This broke DBIC t/storage/txn.t test 90 ("One matching warning only"), which relies on catching exactly that exception from a BrokenOverload-stringify to emit a DBIC-level warning. Fix: keep master's newer autogen order but re-add HEAD's throwIfFallbackDenied after tryTwoArgumentNomethod returns null for eq/ne, covering the fallback=undef case. Verified against perl 5: bless({}, 'BrokenOverload') eq "str" # perl + jperl: "Operation 'eq': no method found" (match) Regressions fixed: - DBIx::Class t/storage/txn.t test 90 now passes (90/90 total). No regressions introduced: Date::Calc (3005/3005), Math::Base::Convert (5350/5350), Moo (841/841), Template (2920/2920), all unit tests pass except pre-existing destroy_eval_die.t test 4. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Records test results for the feature/refcount-perf-combined merge (bundled modules, DBIC, TT, Moo) plus results for each module referenced in master commits since 660aa9e (Memoize, Date::Calc, Math::Base::Convert, List::MoreUtils, AnyEvent). Documents the one regression found (DBIC txn.t#90) and its fix (1869bad), plus the conflict-resolution decisions made during the merge. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The upstream Class::XSAccessor is an XS-only accessor generator that PerlOnJava can't load. Without this module bundled, every dependent (Moo, DBIC, Class::Method::Modifiers, …) emits a noisy load warning on every import: Class::XSAccessor exists but failed to load with error: Can't load loadable object for module Class::XSAccessor: no Java XS implementation available Compilation failed in require at …/Moo/_Utils.pm line 162. The tests all pass (Moo's tests fall back to pure-Perl), but the log noise drowns out real output and blocks reviewer sign-off. Ship a PP port under src/main/perl/lib/Class/XSAccessor.pm, Class/XSAccessor/Heavy.pm, and Class/XSAccessor/Array.pm. Public API matches upstream exactly: - accessors, getters, setters, lvalue_accessors - predicates, defined_predicates, exists_predicates - constructor, constructors - true, false - chained, replace, class options - hash-based (Class::XSAccessor) and array-based (Class::XSAccessor::Array) object layouts Key semantic fidelity: getters croak on extra args ("Usage: ...") — Moo relies on this to implement read-only ("ro", "reader") attributes so `$obj->reader(42)` on a ro attr throws. Verified: - Moo 841/841 PASS, ZERO XSAccessor load warnings - Pre-existing Class::Method::Modifiers 140-lvalue.t (1/131) failure reproduces on pre-merge HEAD 660aa9e — unrelated to this bundle - Standalone smoke test exercises getters, setters, accessors, predicates, exists_predicates, constructor, true, false Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

jcpan -t Class::C3 (DBIx::Class transitive dependency) shows 4 test failures: - t/24_more_overload.t (2 fails) — numeric-overload 'no method found' path missing; only eq/ne got the fix in 1869bad. - t/36_next_goto.t (1 fail) — proxy next::can via goto returns undef. - t/37_mro_warn.t (1 fail) — MRO::Compat emits redefine warnings on load. All 4 reproduce on pre-merge HEAD 660aa9e so they are NOT regressions from this merge, but they're worth triaging since Class::C3 underpins DBIx::Class (which currently passes its own test suite because it doesn't exercise these Class::C3 code paths). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Move the finished / superseded design docs under dev/design/archive/: - phase_j_performance_plan.md (J-tier landed / documented blocks) - refcount_alignment_plan.md (phases 1-7 complete) - refcount_alignment_52leaks_plan.md (DBIC 52leaks passing) - refcount_alignment_progress.md (sandbox tracker) - pr526_merge_validation.md (this PR's validation log) Add dev/design/next_steps.md that enumerates the concrete follow-ups surfaced by the PR #526 work, sized and prioritized: Correctness 1.1 numeric-overload fallback=undef throw (extends 1869bad pattern) 1.2 goto &func through next::can 1.3 destroy_eval_die.t#4 LIFO order 1.4 MRO::Compat load-time redefine warnings Packaging 2.1 complete Class::XSAccessor PP (__tests__, benchmark) 2.2 Text-CSV bundled-runner flake Performance 3.x registry gating extensions, method-cache stats Each item has a repro, fix sketch, and size estimate, plus the downstream CPAN impact (especially Class::C3 which underpins DBIC). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

PR #526 shipped with a significant performance regression that was accepted only because correctness blocks merge: ./jperl examples/life_bitpacked.pl -r none -g 500 Before merge (660aa9e): ~12.5 Mcells/s After merge (PR #526): ~6.6 Mcells/s => running at ~53% of prior perf (~47% slower) This workload is the benchmark that drove the walker-hardening wins in PR #523 — losing it silently would undo that work. Add §0 to dev/design/next_steps.md: - numbers + measurement command - ranked hypotheses (pristineArgsStack clone, hasArgsStack TL churn, inTailCallTrampoline/tailCallReentry, deep-recursion counter, WarningBits / HintHash push/pop) - action plan: bisect commits, profile, apply weakRefsExist-style gating to each new stack if hot - explicit acceptance criterion: >= 12 Mcells/s with tests green Also: - add header warning banner so reviewers notice before scrolling - mark §3 (new perf work) as blocked on §0 - update progress tracker + next-action pointer Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The life_bitpacked regression is the cumulative cost of ~143 master commits since the walker-hardening branch was forked; no single PR-#526 commit added enough to attribute it via bisect. Rewrite §0 action plan around the techniques that will actually produce signal: 1. Profile both baselines and diff flame graphs (async-profiler or JFR at pre-merge 660aa9e vs PR-#526 tip) 2. Static diff of the RuntimeCode.apply() prologue — count ThreadLocal.get() sites, allocations, static dispatches 3. Count-based analysis: env-gated counters at suspected hot sites, run once, dump totals 4. Per-hypothesis gating fixes (weakRefsExist-style pattern) 5. Iterate per-commit; re-measure each change Also update "Next action" in the progress tracker to reflect this. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Reframe the perf top-priority as "every dev/bench/*.pl ≤ 1.0× perl wallclock" rather than just recovering life_bitpacked. Record the current ratios (2026-04-21, perl 5.42.0, PR #526 tip): lexical 0.41× ✅ 2.4× faster than perl string 0.61× ✅ 1.6× faster closure 1.08× ≈ parity method 1.73× ❌ 1.7× slower refcount_anon 3.94× ❌ worst — 4× slower life_bitpacked 3.15× ❌ 3× slower (regressed from 1.66× pre-merge) Split §0 into sub-sections: 0.1 fix the COMPARE=perl mode in dev/bench/run_baseline.sh so every baseline JSON records both jperl and perl times 0.2 life_bitpacked regression (the one we just found) 0.3 refcount_anon 4× gap (worst absolute gap) 0.4 method 1.7× gap 0.5 hypotheses (pristineArgsStack, hasArgsStack, deep-recursion counter, warning-bits push, tail-call trampoline) 0.6 action plan (profiling + static diff + counters, not bisect) 0.7 acceptance criteria Update the banner + progress tracker + Next action accordingly. §0.1 is the prerequisite — without side-by-side numbers in the bench output, every subsequent perf change is hard to evaluate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

User running the full PR #526 test matrix reported two DBIC failures: - t/discard_changes_in_DESTROY.t -> 300s timeout in parallel harness - t/zzzzzzz_perl_perf_bug.t -> exit 11 Both trace back to §0 (perf parity), not new correctness breaks: 1. zzzzzzz_perl_perf_bug.t is DBIC's "your perl install is slow" diagnostic — it fails when blessed-ref ops are >= 3x plain array-ref ops. PerlOnJava currently hits 4x (same gap as benchmark_refcount_anon). Resolving §0.3 will fix this test. 2. discard_changes_in_DESTROY.t — standalone runs cleanly in ~14s (confirmed locally, 5 consecutive runs, exit 0). The test deliberately installs a DESTROY that calls discard_changes and expects no infinite recursion; that semantic works. Under parallel jcpan -t (HARNESS_OPTIONS=j8) it times out because the per-sub-call overhead x 8 parallel jobs pushes wallclock past the 300s harness timeout. Resolving §0.2/§0.3 will fix this too. Document both under §0.3 so the reader immediately sees the connection between the bench-level gap and the visible DBIC user-facing failures. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Earlier investigation used system `prove`, which ran tests under system perl and failed for unrelated reasons (wrong interpreter). Re-run with `jprove -j4` on the DBIC tree: the batch including t/discard_changes_in_DESTROY.t hangs past 400s, matching the user's 300s-timeout report. Standalone (single jperl) the same test still passes cleanly in ~14s over 5 consecutive runs. So the failure mode is: - Semantic is fine (DESTROY fires exactly once, no recursion) - Under parallel jprove, N JVMs starting concurrently plus per-sub-call machinery cost compounds into either the test itself exceeding the 300s harness timeout or a sibling test on the same slot doing so. Update §0.3 with the confirmed reproduction technique. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Prerequisite for the §0 parity work (see dev/design/next_steps.md). Without side-by-side `jperl` vs `perl` numbers, perf changes are hard to evaluate and regressions (like the one currently shipped in PR #526) slip in unnoticed. Changes to dev/bench/run_baseline.sh: - COMPARE=perl: when set, each bench runs under both jperl and system perl (configurable via PERL=/path/to/perl); ratio + parity marker are emitted in the markdown summary. - life_bitpacked: if examples/life_bitpacked.pl exists, run it with `-r none -g 500` and parse "Cell updates per second" from stdout. Tracked as Mcells/s rather than wallclock seconds, so higher = faster (ratio is inverted: perl / jperl). - Dual output: baseline-<sha>.json (machine-readable) AND baseline-<sha>.md (human-readable table). - SKIP_LIFE=1 to opt out of the life bench (e.g. in CI where startup cost dominates short runs). - bash 3.2 compatible (macOS default) — no associative arrays. Add dev/bench/README.md documenting the harness, workloads, and link to the parity plan. Current baseline (PR #526 tip, BENCH_RUNS=1 sanity run, macOS M-series, perl 5.42): lexical 0.40× ✅ string 0.62× ✅ closure 1.39× ❌ regex 1.36× ❌ method 1.78× ❌ global 1.92× ❌ refcount_anon 3.82× ❌ eval_string 4.68× ❌ anon_simple 5.43× ❌ refcount_bless 6.67× ❌ life_bitpacked ~3.17× ❌ (6.47 Mcells/s vs 20.52) 3 runs per bench (default) amortizes JVM startup better than the single-run shown above. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

bench: COMPARE=perl + life_bitpacked in run_baseline.sh

@db

Captured concrete data from a first pass at §0 optimization work: 1. Ran the full 11-benchmark + life_bitpacked matrix with COMPARE=perl from the newly-landed §0.1 harness. Worst gaps: refcount_bless 7.75× life_bitpacked 2.57× anon_simple 4.73× method 2.01× eval_string 4.71× global 1.92× refcount_anon 4.55× Lexical and string are FASTER than perl. 2-4 benches at parity. 2. Tested a "lazy @db::args snapshot" gate (callerFromDBObserved flag). Recovered ~13-16% on refcount_bless / refcount_anon but BROKE DBIC txn_scope_guard.t ("Preventing MULTIPLE DESTROY() invocations" detection fails). The snapshot is needed retroactively — DBIC's DESTROY walks caller frames and reads @db::args at scope exit, when the relevant snapshots were taken long before any caller-from-DB could have been observed. 3. Math calibration: refcount_bless's ~3µs/call gap is far larger than the ~110ns/call TL traffic we were initially focused on. The dominant cost is elsewhere — likely bless() + MortalList registration + DESTROY dispatch, or MethodHandle dispatch overhead. Profiling is required, speculation is not. Update §0.5 with the failed-experiment note, the correct math, and two new hypotheses (bless/MortalList cost, MethodHandle dispatch cost). No code change in this commit — the attempted gate was reverted. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

JFR allocation profiling on benchmark_refcount_bless (which never touches a regex) surfaced RegexState.save() as one of the top sources of per-sub-call allocation — ~250 sampled allocations in a 13-second run, and the samples rate-weight to millions of actual allocations. RegexState.save() is called on every subroutine entry to preserve Perl's regex match variables ($1, $&, etc.) across the callee's regexes. The snapshot captures 13 fields of the RuntimeRegex static globals. Before this commit, a fresh 13-field object was allocated every time, even when every single captured field was still at its default (null / -1 / false). Introduce a shared EMPTY singleton and a hasDefaultState() check: when all the RuntimeRegex statics are still at their defaults (no regex has matched yet in this process), push the singleton instead of allocating. Once any regex matches, the defaults flip and subsequent save() calls fall back to the per-call allocation path. Correctness: verified with an explicit $1 preservation test across sub calls (the common-case idiom). All unit tests still pass (except the pre-existing destroy_eval_die.t #4). Measured effect: modest alone (1-3% on most benches, ~14-17% on refcount_bless / refcount_anon, within run-to-run noise on others). Worth shipping on its own because it's correct, isolated, and zero risk. The bigger wins for refcount_bless / refcount_anon are still to come — ongoing §0.2/§0.3 investigation. Baseline now in dev/bench/results/baseline-d3ee4e200.md. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

JFR allocation profiling on a tight-loop bless+DESTROY benchmark surfaced NameNormalizer$CacheKey as one of the top per-sub-call allocators (~4 MB weight, hundreds of samples over a 13s run). The prior design used a Map<CacheKey, String> with CacheKey cacheKey = new CacheKey(defaultPackage, variable); String cached = nameCache.get(cacheKey); allocating a fresh CacheKey record on EVERY lookup just to probe the cache. The original comment said "composite key avoids ~12 ns string concatenation per lookup" — but the CacheKey allocation + its hashCode/equals cost on every call more than ate that saving once the method call path became a hot path. Replace with a two-level ConcurrentHashMap: Map<String /* defaultPackage */, Map<String /* variable */, String>> Lookups are two hash probes, zero allocation on the hot path. The outer map has one entry per active package (few distinct values), the inner map has entries per variable name. ConcurrentHashMap handles the population race without extra synchronization. No behaviour change — same cache semantics, same cached values, same size eventually. Only the probing path is cheaper. Correctness: `make` still passes except the pre-existing destroy_eval_die.t #4. Regex + $1 preservation across sub calls still works (smoke test). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Record the results of this session's JFR-allocation-profiler pass on the bless/DESTROY hot path: 1. Identified top per-sub-call allocators via JFR 2. Fixed RegexState (fa8df8a) — singleton for no-match-yet state 3. Fixed NameNormalizer$CacheKey (1400475) — two-level cache 4. Documented remaining top allocators (Object[], RuntimeScalar, ArrayList, RuntimeList, RuntimeHash, etc.) Cumulative measured effect from baseline d071692: benchmark_refcount_bless 7.75× -> 6.63× (-14%) benchmark_refcount_anon 4.55× -> 4.07× (-11%) benchmark_global 1.92× -> 1.33× benchmark_method 2.01× -> 1.72× life_bitpacked 2.57× -> 2.60× (no measured change) Important caveat recorded: Benchmark.pm itself inflates the apparent gap by ~3× compared to a bare `for` loop. Without Benchmark.pm the real gap is ~2.3×, not ~7.75×. Benchmark.pm optimization is an independent future target. Key insight for future work: most remaining allocations look inherent to per-call semantics (@_, return RuntimeList, etc.). Recovering the 2-3× gap likely needs structural changes (inline small arg passing, "simple-sub" emit path skipping the RuntimeArray build for callees that don't reference @_) rather than more allocation micro-optimization. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…ked regression Record the outcome of the apply()/applySlow split experiment on a perf/apply-fast-path branch (since deleted). Using javap to measure actual bytecode sizes, we found: apply(RuntimeScalar, RuntimeArray, int) 1150 bytes -> 28 apply(RuntimeScalar, String, RuntimeBase[], int) 982 bytes -> 99 apply(RuntimeArray, int) instance 619 bytes -> 105 All three dropped well under HotSpot's FreqInlineSize=325, and `-XX:+PrintInlining` confirmed the fast paths inlined at hot call sites (including `(hot)` markers on pushArgs/popArgs/enterCall). BUT: benchmark_closure improved (1.21x -> 0.84x, now FASTER than perl) while life_bitpacked REGRESSED badly (2.57x -> 5.38x, a 2x slowdown). Different callers responded differently to the new inlineability — HotSpot's inline-budget is per-caller, and shrinking apply() let other things inline into callers that then crossed an internal boundary. Per user rule "merge back only when perf gain is confirmed - otherwise revert", the experiment was reverted. Branch deleted. Key takeaway: DO NOT ship apply() inlining changes without explicitly measuring life_bitpacked before/after. The "make apply inlinable" intervention is a zero-sum trade with current code shape, and needs pairing with compile-time analysis of which callers benefit. Also documented the Benchmark.pm 3x overhead caveat and the remaining top allocators from profiling. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

dev/bench/results/ snapshots of the benchmark harness output at each tagged commit on the §0 perf-parity track: baseline-d071692a3.* PR #526 merge tip (before any perf fix) baseline-b7d05b77e.* after fa8df8a (RegexState) + 1400475 (NameNormalizer) baseline-d3ee4e200.* intermediate snapshot baseline-1400475d3.* after NameNormalizer fix These are the canonical reference points for §0.8 numbers so regressions can be diffed against them mechanically rather than by memory. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Three-way measurement + profiling establishes that PR #526's 8.3 Mcells/s on life_bitpacked is NOT a regression from its parent (walker-hardening-j3 tip 660aa9e also does 8.4 Mc/s). The gap is vs MASTER, not vs our own branch. Master runs life_bitpacked at 14.0 Mcells/s — 1.67x faster than our branch — but master fails SIX destroy_eval_die.t tests that our branch passes. The speed difference is paying for DESTROY correctness. Concrete cost breakdown (from javap + JFR): - Our branch has 3x more DESTROY/weaken runtime code (2130 lines vs master's 733, including ReachabilityWalker and ScalarRefRegistry which master does not have at all). - Our EmitStatement emits 2 extra INVOKESTATIC instructions per `my` variable per scope exit (MyVarCleanupStack.unregister plus the existing scopeExitCleanup), pushing compiled Perl method sizes up ~6% and doubling RuntimeScalar allocations (89 vs 43 sampled in a 60s run). - ASM frame-compute is measurable at startup (26 samples in jdk.ExecutionSample) because our generated methods are larger. This reframes §0 work: closing the master/PR gap requires a per-sub "cleanup-needed" static analysis that skips the extra scope-exit emission when the sub's lexicals are never used with weaken/DESTROY/closure capture. Mechanical perf hunts (the kind we did in fa8df8a / 1400475) won't close this gap. Closing the REMAINING gap to perl (1.52x on master already) is a separate problem — PerlOnJava fundamentals (boxing, dispatch), not refcounting. Full analysis: dev/design/life_bitpacked_regression_analysis.md Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

A new CleanupNeededVisitor (following the established visitor pattern, like TempLocalCountVisitor) analyses each sub's body and decides whether the per-scope-exit MyVarCleanupStack.unregister loop can be skipped at emit time. The visitor flags the sub as "needs cleanup" when its body contains any of: - bless / weaken / isweak - local - defer blocks - user sub calls (BinaryOperatorNode "(") — callee may return blessed refs - method calls (BinaryOperatorNode "->" with IdentifierNode or "(" on the RHS) — array/hash derefs are recognised and NOT flagged - nested SubroutineNode (closure captures) Builtins (print, push, chr, length, etc.) are parsed as OperatorNode and don't trigger the sub-call branch, so they're correctly classified as safe. Overridden builtins imported via `use subs` resolve to user sub calls at parse time and are flagged automatically. When the visitor says no-cleanup-needed, EmitStatement skips ONLY the Phase-E MyVarCleanupStack.unregister loop. Phases 1, 1b (scopeExitCleanup on blessed scalars/containers) and Phase 3 (MortalList.flush) are always emitted — skipping those would break blessed-parameter DESTROY, tie_scalar untie, DBIC txn_scope_guard, and the destroy_eval_die tests. The correctness-to-perf tradeoff for these is NOT worth taking. Perf impact: life_bitpacked A/B controlled: 8.25 (force-cleanup) -> 8.46 (-2.5%) 5 consecutive runs each, same process, env-toggled Other benches: within ±2% of baseline noise Correctness: - `make` passes except pre-existing destroy_eval_die.t#4 - tie_scalar.t 12/12 ok - DBIC t/storage/txn_scope_guard.t 18/18 ok Env-var escape hatch: JPERL_FORCE_CLEANUP=1 disables the analysis globally, forcing the old (slow) emission path. Use when investigating suspected correctness regressions attributable to this optimization. This is a modest but real win and follows a clean architectural pattern (visitor + flag + emit branch). Future work could extend the visitor to also detect when Phase 1 / 1b can be skipped, but that requires static type analysis of scalar contents — deferred to a separate effort. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

perf: gate MyVarCleanupStack emission on per-sub AST analysis

Analysis of @perl5/ source to understand what structurally makes system perl's per-sub-call path ~1.5x faster than master's PerlOnJava (separate from our walker-hardening overhead vs master). Key hot-path primitives measured: 1. Refcount: SvREFCNT_inc/dec is 4 inlined machine instructions with cold-path-only dispatch. Ours goes through setLarge/ scopeExitCleanup method calls that often fail to inline. 2. Value stack: PUSHMARK is `*++PL_markstack_ptr = offset` - 3 instructions. Ours is ThreadLocal.get() + ArrayDeque.push() - substantially more overhead per push. 3. Context stack: cxstack is ONE array of PERL_CONTEXT structs; cx_pushsub fills one struct per sub call. Ours has 7 separate ThreadLocal<Deque<X>> stacks (argsStack, pristineArgsStack, hasArgsStack, currentBitsStack, callerBitsStack, callerHintsStack, callerSnapshotIdStack) each pushed individually per call. 4. FREETMPS: `if (PL_tmps_ix > PL_tmps_floor) free_tmps()` - one compare, branch-not-taken when empty. Ours unconditionally INVOKESTATICs MortalList.flush(). 5. Padlist: already matches Perl (Java locals). Writeup ranks interventions by impact/effort: #1 FREETMPS-style compare-not-branch for MortalList.flush - low effort, 3-5% expected on life_bitpacked #2 Consolidate 7 ThreadLocal stacks into one PerlContext struct - medium effort, 5-10% expected, directly addresses JFR's "extra RuntimeList / ArrayList allocations" finding #3 Inline refcount helpers (static method emission path) - medium effort, 5-10% on refcount-heavy workloads Combined estimate: closes ~half the master-to-perl gap (from 2.55x to ~1.7-1.8x perl on life_bitpacked). Full parity is structural. References to exact perl5 source lines included for each item. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 30 commits April 10, 2026 22:35

docs: update DBIx::Class plan with Step 11.4 investigation and fix de…

088898a

…tails Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

docs: add resource management warning to AGENTS.md

87de4f4

Avoid fork exhaustion by limiting parallel processes. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 17 commits April 21, 2026 12:37

fglock mentioned this pull request Apr 21, 2026

bench: COMPARE=perl + life_bitpacked in run_baseline.sh #529

Merged

4 tasks

fglock and others added 9 commits April 21, 2026 19:38

Merge pull request #529 from fglock/bench/compare-with-perl

d071692

bench: COMPARE=perl + life_bitpacked in run_baseline.sh

fglock mentioned this pull request Apr 21, 2026

perf: gate MyVarCleanupStack emission on per-sub AST analysis #533

Merged

fglock and others added 2 commits April 21, 2026 23:19

Merge pull request #533 from fglock/perf/cleanup-needed-analysis

3e3908f

perf: gate MyVarCleanupStack emission on per-sub AST analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: refcount + Phase-J + walker hardening combined (rebased on master)#526

perf: refcount + Phase-J + walker hardening combined (rebased on master)#526
fglock wants to merge 166 commits intomasterfrom
feature/refcount-perf-combined

fglock commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 21, 2026

Summary

Perf wins (from #508/#518/#523)

Merge integration notes

Test plan

Supersedes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant