Skip to content

perf: refcount + Phase-J + walker hardening combined (rebased on master)#526

Open
fglock wants to merge 166 commits intomasterfrom
feature/refcount-perf-combined
Open

perf: refcount + Phase-J + walker hardening combined (rebased on master)#526
fglock wants to merge 166 commits intomasterfrom
feature/refcount-perf-combined

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 21, 2026

Summary

Rebases feature/walker-hardening-j3 (superseding PRs #508, #518, #523) onto current master (143 commits), producing a single up-to-date PR with all three stacks of work combined.

Perf wins (from #508/#518/#523)

  • Walker gating — skip ScalarRefRegistry.registerRef() and MyVarCleanupStack.liveCounts when no weak refs exist. +28% on life_bitpacked (7.74 -> 12.50 Mcells/s, 1.61x).
  • J2: suppressFlush skip for simple anon-literal returns that cannot allocate blessed temps. +12.7% arr_int, +4.7% arr_str.
  • J4: ByteCodeSourceMapper fileName cache — minor but measurable.
  • RuntimeScalar(long) no-autobox + int fast paths in BitwiseOperators.
  • Warning-bits cache per RuntimeCode + shared EMPTY_ARGS_SNAPSHOT in pushArgs (per-call list alloc in zero-arg case).
  • System.getenv() cached as static final in 7 hot classes.

Merge integration notes

The merge with origin/master preserved our HEAD features while adopting master's newer refactors:

  • Iterative goto &func trampoline kept (master converted to recursive; we re-integrated master's inTailCallTrampoline counter into the iterative path via a per-iteration tailCallReentry flag so Deep-recursion warnings correctly skip tail-call hops).
  • @DB::args via setFromListAliased kept (shared slots for DBIC txn_scope_guard) while switching source to master's pristineArgsStack.
  • VOID-context mortalizeForVoidDiscard + MortalList.flush() kept.
  • EMPTY_ARGS_SNAPSHOT re-expressed over master's List<RuntimeScalar> snapshot type.

Dropped from HEAD in favor of master's refactors:

  • throwIfFallbackDenied (subsumed by master's OverloadContext.tryTwoArgumentNomethod which handles fallback=0 itself).
  • HEAD's InterpreterState.setCurrentPackage (dead code; only master's setCurrentPackageStatic has callers).
  • HEAD's ad-hoc main:: strip in RuntimeGlob.HASH (master's GlobalVariable.getGlobalHash now strips it at the lookup layer).

Test plan

  • make — all unit tests pass except pre-existing destroy_eval_die.t test 4 (verified on pre-merge HEAD 660aa9e via worktree).
  • DBIx::Class bundled module tests
  • Template-Toolkit tests
  • Moo tests
  • Modules referenced in merged commits: Memoize, Date::Calc, Math::Base::Convert, List::MoreUtils, AnyEvent::Util, Storable

Supersedes

Generated with Devin

fglock and others added 30 commits April 10, 2026 22:35
Two regressions from the DESTROY/weaken merge (PR #464):

1. BytecodeInterpreter: SCOPE_EXIT_CLEANUP_ARRAY/HASH/scalar opcodes
   crash with ClassCastException when the interpreter fallback path
   reuses registers with unexpected types. Add instanceof guards
   before casting. Fixes Sub::Exporter::Progressive (used by
   Devel::GlobalDestruction, needed by DBIx::Class).

2. GlobalDestruction: runGlobalDestruction() iterates global variable
   HashMaps while DESTROY callbacks can modify them, causing
   ConcurrentModificationException. Snapshot collections with
   toArray() before iterating. Fixes DBIx::Class Makefile.PL.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…weaken

- Updated branch/PR references for feature/dbix-class-destroy-weaken
- Added Phase 9 section documenting post-DESTROY/weaken assessment
- Documented 645 ok / 183 not ok across 92 test files
- Identified premature DESTROY blocker (20 tests) and GC leak blocker
- Catalogued improvements from DESTROY/weaken merge (PR #464)
- Updated Next Steps with new priorities (P0-P2)
- Marked obsoleted items (Phase 7, old GC/DESTROY sections)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ings

Add localBindingExists flag to RuntimeBase that tracks when a named
hash/array (my %hash, my @array) has had a reference created via the \
operator. This flag indicates a JVM local variable slot holds a strong
reference not counted in refCount.

When refCount reaches 0, the flag prevents premature callDestroy since
the local variable may still be alive. The flag is cleared at scope exit
(scopeExitCleanupHash/Array), allowing subsequent refCount==0 to
correctly trigger callDestroy.

This fixes the DBIx::Class bug where \%reg stored via accessor caused
premature DESTROY of Schema objects when %reg went out of scope, even
though the hash was still alive through the stored reference.

The localBindingExists check is applied consistently across all
refCount decrement paths: setLargeRefCounted, undefine, weaken,
MortalList.flush, and MortalList.popAndFlush.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sub::Exporter::Progressive's import() relies on caller() to determine
the target package. When Sub::Uplevel overrides CORE::GLOBAL::caller
(used by Test::Exception via Test::Builder), PerlOnJava's caller()
returns wrong frames during `use` processing, causing SEP to install
exports into the wrong package. This prevented in_global_destruction
from being exported to Devel::GlobalDestruction, which namespace::clean
then removed, causing a bareword error in DESTROY methods.

Fix by bundling a simplified Devel::GlobalDestruction that uses plain
Exporter instead of Sub::Exporter::Progressive. Since PerlOnJava always
has ${^GLOBAL_PHASE} (Perl 5.14+ feature), the implementation is a
simple one-liner.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add the complete DBI::Const module hierarchy needed by DBIx::Class:
- DBI::Const::GetInfo::ANSI - ANSI SQL/CLI constants (from DBI 1.643)
- DBI::Const::GetInfo::ODBC - ODBC constants (from DBI 1.643)
- DBI::Const::GetInfoType - merged name-to-number mapping
- DBI::Const::GetInfoReturn - upgraded from stub to real implementation

These are pure Perl constant-data modules from the DBI distribution.
DBIx::Class uses them to translate info type names (e.g. SQL_DBMS_NAME)
to numeric codes for $dbh->get_info(), which our JDBC-based DBI already
implements with matching numeric codes.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
314-test analysis: 155 blocked by "detached result source" (weak ref
cleared during clone -> _copy_state_from), ~10 GC-only, ~25 real+GC,
~6 errors. Root cause traced to Schema->connect's shift->clone->connection
chain where the clone temporary's refCount drops to 0 mid-operation.

Added reference to dev/architecture/weaken-destroy.md for refCount
internals needed for debugging.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Corrected categorization: 27 GC-only (was ~10), only 4 real functional
failures across all 40 non-detached test files. Added DESTROY trace
confirming Schema::DESTROY fires during _copy_state_from in clone().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Prevents premature DESTROY of return values from chained method calls
like shift->clone->connection(@_). During list assignment materialization
(my ($self, @info) = @_), setLargeRefCounted calls MortalList.flush()
which processes pending decrements from inner scope exits (e.g., clone's
$self). This can drop the return value's refCount to 0 before the
caller's LHS variables capture it, triggering DESTROY and clearing weak
refs to still-live objects.

The fix:
- Adds MortalList.suppressFlush() to temporarily block flush() execution
- RuntimeList.setFromList() suppresses flushing around the materialization
  and LHS assignment phase, then restores the previous state
- Also adds reentrancy guard to flush() itself (try/finally with flushing
  flag) to prevent cascading DESTROY from re-entering flush()

This fixes the DBIx::Class "detached result source" error where
Schema->connect() returned an undefined value because the Schema clone
was destroyed mid-construction.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Condensed P0 done section, corrected P1 hypothesis (callDestroy sets
MIN_VALUE permanently, not a transient underflow), replaced speculative
fix approaches with concrete debugging plan.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ised approach

Step 11.2 (popMark + flush in setLargeRefCounted) was implemented and
failed: mark-aware flush prevents DESTROY from firing for delete/untie/undef
because subroutine calls push marks that hide earlier entries. 4 unit test
regressions. Changes reverted.

Added Step 11.3 with root cause analysis, comparison to Perl 5 FREETMPS
model, and 4 possible approaches. Recommends Approach D (targeted GC leak
fix) since P0 premature DESTROY is already solved by suppressFlush.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Blessed objects whose class has no DESTROY method (e.g., Moo objects
like DBIx::Class::Storage::BlockRunner) were set to refCount=-1
(untracked) at bless time, so when they went out of scope their
hash/array elements' refcounts were never decremented.

Changes:
- ReferenceOperators.bless(): always track all blessed objects
  regardless of whether DESTROY exists in the class hierarchy.
  Previously, classes without DESTROY got refCount=-1 (untracked).
- DestroyDispatch.doCallDestroy(): when no DESTROY method is found,
  still cascade into hash/array elements via scopeExitCleanupHash/
  scopeExitCleanupArray + flush() to decrement contained references.

Test: dev/sandbox/destroy_weaken/destroy_no_destroy_method.t (13/13)
All unit tests pass (make).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…tails

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Step 11.4 fix committed and verified (all unit tests pass, 13/13 sandbox)
- GC-only failures explained: Sub::Quote closure walk differences, not refcount bugs
- Documented B::svref_2object->REFCNT method chain leak (separate bug)
- Updated Next Steps and Open Questions with investigation results

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Complete handoff plan with 13 work items covering:
- GC object liveness at END (146 files, 658 assertions)
- DBI shim fixes (statement handles, transactions, numeric formatting,
  DBI_DRIVER, stringification, table locking, error handler)
- Transaction/savepoint depth tracking
- Detached ResultSource weak ref cleanup
- B::svref_2object method chain refcount leak
- UTF-8 byte-level string handling
- Bless/overload performance

Full suite baseline: 27 pass, 146 GC-only, 25 real fail, 43 skipped.
11,646 ok / 746 not-ok assertions.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Work Item 4: Added toJdbcValue() helper in DBI.java to convert
whole-number Doubles to Long before JDBC setObject(), fixing
10.0 vs 10 issue. Also handles overloaded object stringification.

Work Item 5: Fixed DBI.pm connect() to support empty driver in DSN,
$ENV{DBI_DRIVER} fallback, $ENV{DBI_DSN} fallback, proper error
messages, and require DBD::$driver for missing driver errors.

Work Item 6: Overloaded object stringification fixed by toJdbcValue().

Work Item 8: Added HandleError callback support in DBI.pm execute
wrapper, enabling DBIx::Class custom error handler.

Updated design doc with investigation findings for Work Item 2
(DBI statement handle finalization via cascading DESTROY).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Avoid fork exhaustion by limiting parallel processes.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
do { BLOCK } return values were being prematurely destroyed when the
block contained lexical variables. The scope-exit flush processed
deferred decrements from inner subroutine returns before the caller
could capture the do-block result via assignment.

This fixes 11 of 12 "Unreachable cached statement still active" failures
in DBIx::Class t/60core.t. The Cursor DESTROY now fires at the correct
time, calling finish() on cached statement handles.

Root cause: do-blocks were treated as regular bare blocks (flush=true),
but like subroutine bodies, their return value is on the JVM operand
stack and must not be destroyed before the caller captures it.

Fix: Annotate do-blocks with blockIsDoBlock and skip mortal flush at
scope exit, matching the existing blockIsSubroutine behavior. Both
JVM backend (EmitBlock) and bytecode interpreter (BytecodeCompiler)
are updated.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
When die throws inside eval{}, lexical variables between the die point
and the eval boundary go out of scope. Previously, their DESTROY methods
were never called because the SCOPE_EXIT_CLEANUP opcodes were skipped
by Java exception handling.

This fix adds scope-exit cleanup in both backends:

Bytecode interpreter:
- EVAL_TRY now records the first body register index
- The exception catch handler iterates registers from that index,
  calling scopeExitCleanup for each RuntimeScalar/Hash/Array,
  then flushes the mortal list to trigger DESTROY

JVM backend:
- During eval body compilation, all my-variable local indices are
  recorded via emitScopeExitNullStores into evalCleanupLocals
- The catch handler emits MortalList.evalExceptionScopeCleanup()
  for each recorded local, then flushes

New runtime helper: MortalList.evalExceptionScopeCleanup(Object)
dispatches to the appropriate cleanup method based on runtime type.

This is critical for DBIx::Class TxnScopeGuard, which relies on
DESTROY firing during eval exception unwinding to rollback transactions.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implement runtime cleanup stack (MyVarCleanupStack) to ensure DESTROY
fires for blessed objects in my-variables when die propagates through
regular subroutines without an enclosing eval.

Approach: register my-variables at creation time on a runtime stack,
and unwind (running scopeExitCleanup) when exceptions propagate through
RuntimeCode.apply(). Normal scope exit uses existing bytecodes and
discards registrations via popMark.

Key changes:
- New MyVarCleanupStack class with pushMark/register/unwindTo/popMark
- EmitVariable.java: emit register() after my-variable ASTORE
- RuntimeCode.java: wrap 3 static apply() overloads with cleanup
- BytecodeInterpreter.java: propagatingException for interpreter backend

This replaces the failed emitter try/catch approach which caused
try_catch.t failures due to JVM exception table ordering.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sub bodies use flush=false in emitScopeExitNullStores to protect
return values on the JVM operand stack. This caused DESTROY to fire
outside the caller's dynamic scope -- e.g., after local $SIG{__WARN__}
unwinds, causing Test::Warn to miss warnings from DESTROY.

In void context there is no return value to protect, so we can safely
flush deferred decrements immediately after the sub returns. Added
MortalList.flush() after mortalizeForVoidDiscard in all three static
apply() overloads.

Fixes: destroy.t (0/14 -> 14/14), weaken.t (3/4 -> 4/4),
       txn_scope_guard.t test 8

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Three fixes for DBI handle garbage collection and resource management:

1. Use createReferenceWithTrackedElements() for Java-created DBI handles
   instead of createReference(). The latter incorrectly sets
   localBindingExists=true (designed for Perl lexical `my %hash`),
   which prevents DESTROY from firing in MortalList.flush(). This
   affected all 43+ DBIx::Class test files with GC-only failures.

2. Add Java-side DBI::finish() that closes the JDBC PreparedStatement,
   releasing database locks (e.g., SQLite table locks). Also add
   $sth->finish() call in DBI::do() for temporary statement handles.
   Fixes: t/storage/on_connect_do.t test 8 (table locking).

3. Break circular reference between dbh.sth (stores full sth ref) and
   sth.Database (stores dbh ref). Now dbh.sth stores only the raw JDBC
   Statement object needed for the last_insert_id() fallback.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…poraries

- Add pushMark/popMark/flushAboveMark to MortalList for scoped mortal
  boundaries (analogous to Perl 5's SAVETMPS/FREETMPS)
- Emit flushAboveMark at statement boundaries in EmitBlock to process
  deferred DESTROY within the current function scope only
- Fix bless() to mortalize newly blessed refs (refCount=1 + deferDecrement)
  so method chain temporaries like Foo->new()->method() get properly
  destroyed at the caller's statement boundary
- Fix hash/array setFromList() materialization to avoid spurious refCount
  increments from push() — use direct list.add() instead
- Fix RuntimeArray push/pop/shift to properly track refCount for
  container store/remove operations
- RuntimeCode.apply/callCached push/pop marks around function execution
  to isolate mortal scopes between caller and callee
- EmitStatement scope-exit always flushes pending entries even when no
  my-variables exist, to handle inner sub scope exit temporaries
- Remove debug tracing (JPERL_DEBUG_MORTAL env var checks)

These changes fix:
- Hash clear not triggering DESTROY (materialization refCount leak)
- Test2::API::Context premature DESTROY breaking subtests
- Method chain temporaries leaking (never reaching refCount=0)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
grep/map/sort/all/any block closures are compiled as anonymous subs that
capture lexical variables, incrementing captureCount. Unlike eval blocks,
these closures were never having releaseCaptures() called after execution.
This caused captureCount to stay elevated, preventing scopeExitCleanup
from decrementing blessed ref refCounts — objects could never reach
refCount 0 and DESTROY would never fire.

The fix adds eager releaseCaptures() calls in ListOperators after each
operation completes. Only closures flagged as isMapGrepBlock (temporary
block closures) are affected — named subs and user closures are not
touched. Sort blocks now also receive the isMapGrepBlock annotation.

Also includes MortalList.flush() before END blocks (from prior session)
to ensure file-scoped lexical cleanup fires before END block dispatch.

Impact: DBIx::Class Storage object refCount dropped from 237 to 1.
Moo-generated constructors with grep-based required attribute validation
no longer leak refCounts.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
When `local @array` or `local %hash` scope exits, dynamicRestoreState()
replaces the current elements with the saved ones. Previously, the
current (local scope's) elements were simply discarded by JVM GC,
without decrementing refCounts for any tracked blessed references
they owned.

This caused refCount leaks when Moo-generated writers used
`local @_ = ($self, $value)` for inlined qsub triggers — each call
leaked +1 on tracked objects stored in the same Moo object's hash.
In DBIx::Class, this manifested as Storage refCount climbing by +1
per dbh_do() call (e.g., 108 after init_schema instead of 2).

The fix calls MortalList.deferDestroyForContainerClear() on the
outgoing elements before replacing them, matching the cleanup done
by scopeExitCleanupArray/Hash for my-variable scope exit.

Impact: Storage refCount stays at 2 after 100 dbh_do() calls.
dbi_env.t failures reduced from 18 to 11 (remaining failures are
DBI::db objects held by a different retention path).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The callCached() method (used for all method dispatch via $obj->method)
was missing MyVarCleanupStack management. When a called method died,
my-variables registered inside the method's bytecode were never cleaned
up via unwindTo(), causing their refCount decrements to be lost. This
meant blessed objects held in those my-variables would leak (DESTROY
never fires).

Root cause: Regular function calls go through the static apply() which
wraps execution with MyVarCleanupStack.pushMark()/unwindTo()/popMark().
Method calls via callCached() bypassed this wrapper, calling either the
raw PerlSubroutine.apply() (cache hit) or the instance RuntimeCode.apply()
(cache miss) - neither of which manages MyVarCleanupStack.

Fix: Add MyVarCleanupStack.pushMark()/unwindTo()/popMark() to callCached()
by extracting the body into callCachedInner() and wrapping it with the
cleanup try-catch-finally.

Also fixes DBI.pm circular reference: weaken $sth->{Database} back-link
to $dbh, matching Perl 5's XS-based DBI which uses weak child→parent refs.

Together these fix all 4 remaining refCount leaks in DBIx::Class dbi_env.t
(tests 28-31), bringing it to 27/27 pass + 38 leak checks clean.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Anonymous arrays created by [...] were not birth-tracked (refCount
stayed at -1/untracked), unlike anonymous hashes which properly set
refCount=0 in createReferenceWithTrackedElements(). This caused
blessed object references stored inside anonymous arrays to leak:
the containerStore INC was never matched by a DEC when the array
went out of scope, so the blessed object's refCount never reached 0
and DESTROY was never called.

This was the root cause of DBIx::Class leak detection failures,
where connect_info(\@info) wraps args in an arrayref.

The fix adds the same birth-tracking logic to RuntimeArray that
RuntimeHash already had: set refCount=0 for anonymous arrays so
setLargeRefCounted can properly count references and callDestroy
can cascade element cleanup when the array is no longer referenced.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
21 tests covering blessed objects inside anonymous arrayrefs and hashrefs:
- Basic scope exit, function argument passing, weak ref clearing
- Multiple objects, nested containers, return values
- DBIx::Class connect_info pattern (object in anon arrayref arg)
- Reassignment releasing previous contents

These prevent regression of the birth-tracking fix in
RuntimeArray.createReferenceWithTrackedElements().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
splice() called deferDecrementIfTracked() on removed elements without
checking runtimeArray.elementsOwned. For @_ arrays (populated via
setArrayOfAlias), the elements are aliases to the caller's variables,
not owned copies. The alias shares the same RuntimeScalar as the
caller's variable, so refCountOwned reflects the caller's ownership,
not @_'s. This caused splice to decrement refCounts that @_ never
incremented, triggering premature DESTROY while the object was still
in scope.

The fix adds the elementsOwned guard to splice's removal loop,
matching the pattern already used by shift() and pop(). For @_ arrays
where elementsOwned is false, the DEC is skipped.

This is the exact pattern used by Class::Accessor::Grouped::get_inherited
in DBIx::Class: splice @_, 0, 1, ref($_[0])

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add stashRefCount to RuntimeCode to track glob/stash references that
are invisible to the cooperative refCount mechanism. Stash assignments
(*Foo::bar = $coderef) bypass setLarge(), so the stash reference was
not counted, causing false refCount==0 and premature releaseCaptures.

Key changes:
- RuntimeCode.stashRefCount tracks stash references (inc/dec in glob
  set, dynamicSaveState/dynamicRestoreState, stash delete paths)
- DestroyDispatch skips releaseCaptures when stashRefCount > 0
- releaseCaptures only cascades deferDecrementIfTracked for blessed
  referents, not unblessed containers (whose cooperative refCount can
  falsely reach 0 because closure captures hold JVM references not
  counted in refCount)

This fixes infinite recursion in DBIx::Class/Moo/Sub::Quote where
premature releaseCaptures cascaded to clear weak references in
Sub::Defer's %DEFERRED hash, triggering re-vivification loops.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The per-object memory footprint increased with RuntimeBase refcounting
and DESTROY support, causing code_too_large.t (10K lines, ~5K tests)
to OOM in the test runner. Setting maxHeapSize = '1g' for test tasks.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
fglock and others added 17 commits April 21, 2026 12:37
System.getenv() is a native call (JNI → TreeMap lookup, ~200ns each).
Several debug flags were being evaluated via System.getenv() on every
call rather than once at class init. These cached reads add up in hot
compile and runtime paths.

Caches:
  * MortalList.GC_DEBUG — was in maybeAutoSweep (fires after flush)
  * RuntimeScalar.PHASE_D_DBG — was in undefine()
  * RuntimeIO.IO_DEBUG — 2 hot spots in getRuntimeIO/close
  * IOOperator.IO_DEBUG — open/close hot paths
  * EmitterMethodCreator.ASM_DEBUG, ASM_DEBUG_CLASS_FILTER,
    BYTECODE_SIZE_DEBUG, SPILL_SLOT_COUNT — compile hot path,
    called ~once per compiled method (was 4-5 getenv per compile)
  * SubroutineParser.SHOW_FALLBACK — parser hot path
  * PerlLanguageProvider.SHOW_FALLBACK — compile fallback hot path

Semantically identical — these are all at-startup-determined debug
flags whose values never change during execution. Pattern already
used elsewhere in the codebase (e.g., ScalarRefRegistry, RuntimeRegex).

Regression gates:
  * DBIx::Class: Files=314, Tests=13804 — PASS (1152s, noise vs 1107)
  * Template-Toolkit: Files=106, Tests=2920 — PASS (133-137s)
  * Moo: Files=71, Tests=841 — PASS (91s)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
ScalarRefRegistry.registerRef() was doing a
synchronized(WeakHashMap).put() on every setLargeRefCounted call
(every ref assignment). The registry is consumed ONLY by
ReachabilityWalker.sweepWeakRefs(), which is only invoked when
WeakRefRegistry.weakRefsExist is true (i.e., when weaken() has
been called at least once). For scripts that never weaken(),
every registerRef call was pure overhead.

JFR profile of life_bitpacked.pl (which never weakens anything)
showed WeakHashMap.put / Collections$SynchronizedMap.put /
expungeStaleEntries dominating post-compile CPU.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before:  7.74–8.06 Mcells/s
  after:   9.79–9.93 Mcells/s   (+22% median)

Larger grid (200x200, 5000 gens): 10.01 → 11.05 Mcells/s (+10%).

Trade-off: scripts that hold many scalars-with-refs PRIOR to the
first weaken() call won't be in the registry when the walker first
runs. However, any subsequent setLarge on those scalars will
register them, and the walker's primary seeds (globals, code refs,
DESTROY rescued set) still find reachable structures via the normal
BFS. No DBIC 52leaks.t assertions regressed.

Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old
unconditional behavior.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS (136-138s)
  * Moo: Files=71, Tests=841 — PASS (95s; within noise of 91s baseline)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Same pattern as the prior ScalarRefRegistry.registerRef fix.
MyVarCleanupStack.register() is called for every `my` variable
declaration and was doing an unconditional
`liveCounts.merge(var, 1, Integer::sum)` — an IdentityHashMap
merge with a boxed Integer lambda. JFR of life_bitpacked.pl
surfaced this right after the scalar-registry gate landed.

The `liveCounts` map exists solely for
ReachabilityWalker.sweepWeakRefs' `isLive()` check, which only runs
when WeakRefRegistry.weakRefsExist is true. Scripts that never
weaken() pay the full IdentityHashMap.merge cost for nothing.

The walker's pre-weaken fallback (sc.scopeExited + sc.refCountOwned
checks) still correctly classifies live vs dead lexicals, so
semantically this is indistinguishable.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before this patch:  9.79 Mcells/s   (after ScalarRefRegistry gate)
  after this patch:  12.50 Mcells/s   (+28%)

Combined vs pre-gate baseline on this branch:
  7.74 → 12.50 Mcells/s, a 1.61× speedup.

Larger grid (200x200, 5000 gens): 13.71 Mcells/s
(baseline was 10.01, a 1.37× speedup).

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS (143s)
  * Moo: Files=71, Tests=841 — PASS (97s)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two targeted hot-path optimizations found via JFR on life_bitpacked.pl
after the ScalarRefRegistry / MyVarCleanupStack gates landed.

1. Cache getWarningBitsForCode() per RuntimeCode.

   The JVM-compiled branch looked up
   `code.methodHandle.type().parameterType(0).getName()` + a
   `WarningBitsRegistry.get(className)` HashMap probe on every sub
   invocation. The declaring class of a compiled code's MethodHandle
   is stable for the lifetime of the RuntimeCode, so cache the
   resolved warning bits string in a `cachedWarningBits` field.
   Sentinel pattern (`WARNING_BITS_NOT_COMPUTED = "<uninit>"`) keeps
   a legitimately-null result distinguishable from not-yet-computed.

2. pushArgs: shared empty snapshot for zero-arg calls.

   Every sub call was allocating a fresh `RuntimeArray` wrapper +
   `ArrayList<>` to snapshot @_ for `@DB::args` support, even when
   the callee was called with zero arguments. Share a single
   `EMPTY_ARGS_SNAPSHOT` for those; real allocation only happens when
   args is non-empty.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before this patch:  11.68–12.50 Mcells/s
  after this patch:   12.19–13.07 Mcells/s

The cumulative speedup since the start of this PR (post-Phase-J
baseline on this branch, 7.74 Mcells/s) is **1.65×**.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS
  * Moo: Files=71, Tests=841 — PASS
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two small hot-path cleanups found via JFR allocation profile on
life_bitpacked.pl:

1. RuntimeScalar.initializeWithLong took `Long` (boxed) rather than
   `long` (primitive). The `RuntimeScalar(long)` constructor
   autoboxed long → Long on every call, then the method unboxed it
   again. Changed the signature to primitive `long` and adapted the
   `RuntimeScalar(Long)` constructor to call `.longValue()` explicitly.

2. BitwiseOperators.bitwiseAnd / bitwiseOr / bitwiseXor fast paths
   computed result as `long` and called `new RuntimeScalar(long)`,
   forcing the initializeWithLong branch cascade. Since int ^ int,
   int & int, int | int all produce int by definition, compute as
   int and call `new RuntimeScalar(int)` directly — skips the
   range-check branches.

Bitwise shift ops are unchanged (still use long) because left-shift
may grow beyond 32 bits and must preserve semantics.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * make unit tests — PASS

Measured on examples/life_bitpacked.pl default args: ~12.3–12.6
Mcells/s (unchanged at the mean — these changes eliminate boxing
overhead per op but the JIT already optimized it well; the signature
fix prevents an autobox that was showing up in JFR's alloc sample
but not in the JIT-compiled hot loop).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Brings 143 commits from master since feature/walker-hardening-j3 was
split off, including:
- Deep-recursion warning + inTailCallTrampoline (10631d7)
- pristineArgsStack for @db::args (4ddfe34)
- hasArgsStack for caller()[4] (d1b5c6b)
- Bitwise feature opcodes (b75d5e0)
- Statement-modifier my/our/state hoisting (c8d065d, 3bfaffd)
- require FILE caller-package (40b535a)
- Overload autogen refactor: direct->cmp->nomethod (4ddfe34)
- Stash hash deref fix (9f38ef5)

Conflict resolution preserved our HEAD features:
- Iterative goto&func trampoline (no Java-stack growth for long chains)
- VOID-context mortalizeForVoidDiscard + MortalList.flush()
- @db::args setFromListAliased (shared slots, DBIC txn_scope_guard)
- EMPTY_ARGS_SNAPSHOT fast path in pushArgs (life_bitpacked perf)

Plus integrated master's inTailCallTrampoline counter into the
iterative path via a per-iteration tailCallReentry flag so
Deep-recursion warnings don't count goto&func hops.

Pre-existing test failure (destroy_eval_die.t test 4 - LIFO DESTROY
during eval/die) reproduces on pre-merge HEAD 660aa9e and is
therefore NOT a merge regression.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Master's 4ddfe34 refactored overload autogen (direct->cmp->nomethod)
but its tryTwoArgumentNomethod only handles the explicit `fallback=>0`
case — it returns null when `fallback=>undef`, silently falling through
to default stringification.

Perl 5 actually treats `fallback=>undef` (or no fallback attr) as "no
method found" when none of (op, (cmp, or (nomethod resolve. Compare:

    package BrokenOverload; use overload '""' => sub { $_[0] };
    ...
    bless({}, 'BrokenOverload') eq "str"
    # real perl: dies "Operation 'eq': no method found"
    # post-4ddfe3434: silently stringifies

This broke DBIC t/storage/txn.t test 90 ("One matching warning only"),
which relies on catching exactly that exception from a
BrokenOverload-stringify to emit a DBIC-level warning.

Fix: keep master's newer autogen order but re-add HEAD's
throwIfFallbackDenied after tryTwoArgumentNomethod returns null for
eq/ne, covering the fallback=undef case. Verified against perl 5:

    bless({}, 'BrokenOverload') eq "str"
    # perl + jperl: "Operation 'eq': no method found" (match)

Regressions fixed:
- DBIx::Class t/storage/txn.t test 90 now passes (90/90 total).

No regressions introduced: Date::Calc (3005/3005), Math::Base::Convert
(5350/5350), Moo (841/841), Template (2920/2920), all unit tests pass
except pre-existing destroy_eval_die.t test 4.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Records test results for the feature/refcount-perf-combined merge
(bundled modules, DBIC, TT, Moo) plus results for each module
referenced in master commits since 660aa9e (Memoize, Date::Calc,
Math::Base::Convert, List::MoreUtils, AnyEvent).

Documents the one regression found (DBIC txn.t#90) and its fix
(1869bad), plus the conflict-resolution decisions made during the
merge.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The upstream Class::XSAccessor is an XS-only accessor generator that
PerlOnJava can't load. Without this module bundled, every dependent
(Moo, DBIC, Class::Method::Modifiers, …) emits a noisy load warning
on every import:

    Class::XSAccessor exists but failed to load with error:
      Can't load loadable object for module Class::XSAccessor:
      no Java XS implementation available
    Compilation failed in require at …/Moo/_Utils.pm line 162.

The tests all pass (Moo's tests fall back to pure-Perl), but the
log noise drowns out real output and blocks reviewer sign-off.

Ship a PP port under src/main/perl/lib/Class/XSAccessor.pm,
Class/XSAccessor/Heavy.pm, and Class/XSAccessor/Array.pm. Public
API matches upstream exactly:

- accessors, getters, setters, lvalue_accessors
- predicates, defined_predicates, exists_predicates
- constructor, constructors
- true, false
- chained, replace, class options
- hash-based (Class::XSAccessor) and array-based
  (Class::XSAccessor::Array) object layouts

Key semantic fidelity: getters croak on extra args ("Usage: ...") —
Moo relies on this to implement read-only ("ro", "reader") attributes
so `$obj->reader(42)` on a ro attr throws.

Verified:
- Moo 841/841 PASS, ZERO XSAccessor load warnings
- Pre-existing Class::Method::Modifiers 140-lvalue.t (1/131) failure
  reproduces on pre-merge HEAD 660aa9e — unrelated to this bundle
- Standalone smoke test exercises getters, setters, accessors,
  predicates, exists_predicates, constructor, true, false

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
jcpan -t Class::C3 (DBIx::Class transitive dependency) shows 4 test
failures:

- t/24_more_overload.t (2 fails) — numeric-overload 'no method found'
  path missing; only eq/ne got the fix in 1869bad.
- t/36_next_goto.t (1 fail) — proxy next::can via goto returns undef.
- t/37_mro_warn.t (1 fail) — MRO::Compat emits redefine warnings on
  load.

All 4 reproduce on pre-merge HEAD 660aa9e so they are NOT regressions
from this merge, but they're worth triaging since Class::C3 underpins
DBIx::Class (which currently passes its own test suite because it
doesn't exercise these Class::C3 code paths).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Move the finished / superseded design docs under dev/design/archive/:
- phase_j_performance_plan.md (J-tier landed / documented blocks)
- refcount_alignment_plan.md (phases 1-7 complete)
- refcount_alignment_52leaks_plan.md (DBIC 52leaks passing)
- refcount_alignment_progress.md (sandbox tracker)
- pr526_merge_validation.md (this PR's validation log)

Add dev/design/next_steps.md that enumerates the concrete follow-ups
surfaced by the PR #526 work, sized and prioritized:

Correctness
  1.1 numeric-overload fallback=undef throw (extends 1869bad pattern)
  1.2 goto &func through next::can
  1.3 destroy_eval_die.t#4 LIFO order
  1.4 MRO::Compat load-time redefine warnings

Packaging
  2.1 complete Class::XSAccessor PP (__tests__, benchmark)
  2.2 Text-CSV bundled-runner flake

Performance
  3.x registry gating extensions, method-cache stats

Each item has a repro, fix sketch, and size estimate, plus the
downstream CPAN impact (especially Class::C3 which underpins DBIC).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
PR #526 shipped with a significant performance regression that was
accepted only because correctness blocks merge:

  ./jperl examples/life_bitpacked.pl -r none -g 500
    Before merge (660aa9e): ~12.5 Mcells/s
    After  merge (PR #526):   ~6.6  Mcells/s
    => running at ~53% of prior perf (~47% slower)

This workload is the benchmark that drove the walker-hardening wins
in PR #523 — losing it silently would undo that work.

Add §0 to dev/design/next_steps.md:
- numbers + measurement command
- ranked hypotheses (pristineArgsStack clone, hasArgsStack TL
  churn, inTailCallTrampoline/tailCallReentry, deep-recursion
  counter, WarningBits / HintHash push/pop)
- action plan: bisect commits, profile, apply weakRefsExist-style
  gating to each new stack if hot
- explicit acceptance criterion: >= 12 Mcells/s with tests green

Also:
- add header warning banner so reviewers notice before scrolling
- mark §3 (new perf work) as blocked on §0
- update progress tracker + next-action pointer

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The life_bitpacked regression is the cumulative cost of ~143 master
commits since the walker-hardening branch was forked; no single
PR-#526 commit added enough to attribute it via bisect.

Rewrite §0 action plan around the techniques that will actually
produce signal:

1. Profile both baselines and diff flame graphs
   (async-profiler or JFR at pre-merge 660aa9e vs PR-#526 tip)
2. Static diff of the RuntimeCode.apply() prologue — count
   ThreadLocal.get() sites, allocations, static dispatches
3. Count-based analysis: env-gated counters at suspected hot
   sites, run once, dump totals
4. Per-hypothesis gating fixes (weakRefsExist-style pattern)
5. Iterate per-commit; re-measure each change

Also update "Next action" in the progress tracker to reflect this.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Reframe the perf top-priority as "every dev/bench/*.pl ≤ 1.0× perl
wallclock" rather than just recovering life_bitpacked. Record the
current ratios (2026-04-21, perl 5.42.0, PR #526 tip):

  lexical         0.41×  ✅ 2.4× faster than perl
  string          0.61×  ✅ 1.6× faster
  closure         1.08×  ≈ parity
  method          1.73×  ❌ 1.7× slower
  refcount_anon   3.94×  ❌ worst — 4× slower
  life_bitpacked  3.15×  ❌ 3× slower (regressed from 1.66× pre-merge)

Split §0 into sub-sections:
  0.1 fix the COMPARE=perl mode in dev/bench/run_baseline.sh so
      every baseline JSON records both jperl and perl times
  0.2 life_bitpacked regression (the one we just found)
  0.3 refcount_anon 4× gap (worst absolute gap)
  0.4 method 1.7× gap
  0.5 hypotheses (pristineArgsStack, hasArgsStack, deep-recursion
      counter, warning-bits push, tail-call trampoline)
  0.6 action plan (profiling + static diff + counters, not bisect)
  0.7 acceptance criteria

Update the banner + progress tracker + Next action accordingly.
§0.1 is the prerequisite — without side-by-side numbers in the
bench output, every subsequent perf change is hard to evaluate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
User running the full PR #526 test matrix reported two DBIC failures:

- t/discard_changes_in_DESTROY.t  -> 300s timeout in parallel harness
- t/zzzzzzz_perl_perf_bug.t       -> exit 11

Both trace back to §0 (perf parity), not new correctness breaks:

1. zzzzzzz_perl_perf_bug.t is DBIC's "your perl install is slow"
   diagnostic — it fails when blessed-ref ops are >= 3x plain
   array-ref ops. PerlOnJava currently hits 4x (same gap as
   benchmark_refcount_anon). Resolving §0.3 will fix this test.

2. discard_changes_in_DESTROY.t — standalone runs cleanly in ~14s
   (confirmed locally, 5 consecutive runs, exit 0). The test
   deliberately installs a DESTROY that calls discard_changes and
   expects no infinite recursion; that semantic works. Under
   parallel jcpan -t (HARNESS_OPTIONS=j8) it times out because
   the per-sub-call overhead x 8 parallel jobs pushes wallclock
   past the 300s harness timeout. Resolving §0.2/§0.3 will fix
   this too.

Document both under §0.3 so the reader immediately sees the
connection between the bench-level gap and the visible DBIC
user-facing failures.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Earlier investigation used system `prove`, which ran tests under
system perl and failed for unrelated reasons (wrong interpreter).
Re-run with `jprove -j4` on the DBIC tree: the batch including
t/discard_changes_in_DESTROY.t hangs past 400s, matching the user's
300s-timeout report.

Standalone (single jperl) the same test still passes cleanly in
~14s over 5 consecutive runs. So the failure mode is:

  - Semantic is fine (DESTROY fires exactly once, no recursion)
  - Under parallel jprove, N JVMs starting concurrently plus
    per-sub-call machinery cost compounds into either the test
    itself exceeding the 300s harness timeout or a sibling test
    on the same slot doing so.

Update §0.3 with the confirmed reproduction technique.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Prerequisite for the §0 parity work (see dev/design/next_steps.md).
Without side-by-side `jperl` vs `perl` numbers, perf changes are
hard to evaluate and regressions (like the one currently shipped
in PR #526) slip in unnoticed.

Changes to dev/bench/run_baseline.sh:

- COMPARE=perl: when set, each bench runs under both jperl and
  system perl (configurable via PERL=/path/to/perl); ratio +
  parity marker are emitted in the markdown summary.
- life_bitpacked: if examples/life_bitpacked.pl exists, run it
  with `-r none -g 500` and parse "Cell updates per second" from
  stdout. Tracked as Mcells/s rather than wallclock seconds, so
  higher = faster (ratio is inverted: perl / jperl).
- Dual output: baseline-<sha>.json (machine-readable) AND
  baseline-<sha>.md (human-readable table).
- SKIP_LIFE=1 to opt out of the life bench (e.g. in CI where
  startup cost dominates short runs).
- bash 3.2 compatible (macOS default) — no associative arrays.

Add dev/bench/README.md documenting the harness, workloads, and
link to the parity plan.

Current baseline (PR #526 tip, BENCH_RUNS=1 sanity run, macOS
M-series, perl 5.42):

  lexical         0.40×  ✅
  string          0.62×  ✅
  closure         1.39×  ❌
  regex           1.36×  ❌
  method          1.78×  ❌
  global          1.92×  ❌
  refcount_anon   3.82×  ❌
  eval_string     4.68×  ❌
  anon_simple     5.43×  ❌
  refcount_bless  6.67×  ❌
  life_bitpacked  ~3.17× ❌ (6.47 Mcells/s vs 20.52)

3 runs per bench (default) amortizes JVM startup better than the
single-run shown above.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
fglock and others added 9 commits April 21, 2026 19:38
bench: COMPARE=perl + life_bitpacked in run_baseline.sh
Captured concrete data from a first pass at §0 optimization work:

1. Ran the full 11-benchmark + life_bitpacked matrix with COMPARE=perl
   from the newly-landed §0.1 harness. Worst gaps:
     refcount_bless  7.75×   life_bitpacked 2.57×
     anon_simple     4.73×   method         2.01×
     eval_string     4.71×   global         1.92×
     refcount_anon   4.55×
   Lexical and string are FASTER than perl. 2-4 benches at parity.

2. Tested a "lazy @db::args snapshot" gate (callerFromDBObserved
   flag). Recovered ~13-16% on refcount_bless / refcount_anon but
   BROKE DBIC txn_scope_guard.t ("Preventing MULTIPLE DESTROY()
   invocations" detection fails). The snapshot is needed
   retroactively — DBIC's DESTROY walks caller frames and reads
   @db::args at scope exit, when the relevant snapshots were taken
   long before any caller-from-DB could have been observed.

3. Math calibration: refcount_bless's ~3µs/call gap is far larger
   than the ~110ns/call TL traffic we were initially focused on.
   The dominant cost is elsewhere — likely bless() + MortalList
   registration + DESTROY dispatch, or MethodHandle dispatch
   overhead. Profiling is required, speculation is not.

Update §0.5 with the failed-experiment note, the correct math,
and two new hypotheses (bless/MortalList cost, MethodHandle
dispatch cost).

No code change in this commit — the attempted gate was reverted.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
JFR allocation profiling on benchmark_refcount_bless (which never
touches a regex) surfaced RegexState.save() as one of the top sources
of per-sub-call allocation — ~250 sampled allocations in a 13-second
run, and the samples rate-weight to millions of actual allocations.

RegexState.save() is called on every subroutine entry to preserve
Perl's regex match variables ($1, $&, etc.) across the callee's
regexes. The snapshot captures 13 fields of the RuntimeRegex
static globals. Before this commit, a fresh 13-field object was
allocated every time, even when every single captured field was
still at its default (null / -1 / false).

Introduce a shared EMPTY singleton and a hasDefaultState() check:
when all the RuntimeRegex statics are still at their defaults (no
regex has matched yet in this process), push the singleton instead
of allocating. Once any regex matches, the defaults flip and
subsequent save() calls fall back to the per-call allocation path.

Correctness: verified with an explicit $1 preservation test across
sub calls (the common-case idiom). All unit tests still pass
(except the pre-existing destroy_eval_die.t #4).

Measured effect: modest alone (1-3% on most benches, ~14-17% on
refcount_bless / refcount_anon, within run-to-run noise on others).
Worth shipping on its own because it's correct, isolated, and zero
risk. The bigger wins for refcount_bless / refcount_anon are still
to come — ongoing §0.2/§0.3 investigation.

Baseline now in dev/bench/results/baseline-d3ee4e200.md.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
JFR allocation profiling on a tight-loop bless+DESTROY benchmark
surfaced NameNormalizer$CacheKey as one of the top per-sub-call
allocators (~4 MB weight, hundreds of samples over a 13s run).

The prior design used a Map<CacheKey, String> with
  CacheKey cacheKey = new CacheKey(defaultPackage, variable);
  String cached = nameCache.get(cacheKey);
allocating a fresh CacheKey record on EVERY lookup just to probe
the cache. The original comment said "composite key avoids ~12 ns
string concatenation per lookup" — but the CacheKey allocation +
its hashCode/equals cost on every call more than ate that saving
once the method call path became a hot path.

Replace with a two-level ConcurrentHashMap:
  Map<String /* defaultPackage */, Map<String /* variable */, String>>

Lookups are two hash probes, zero allocation on the hot path. The
outer map has one entry per active package (few distinct values),
the inner map has entries per variable name. ConcurrentHashMap
handles the population race without extra synchronization.

No behaviour change — same cache semantics, same cached values,
same size eventually. Only the probing path is cheaper.

Correctness: `make` still passes except the pre-existing
destroy_eval_die.t #4. Regex + $1 preservation across sub calls
still works (smoke test).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Record the results of this session's JFR-allocation-profiler pass
on the bless/DESTROY hot path:

1. Identified top per-sub-call allocators via JFR
2. Fixed RegexState (fa8df8a) — singleton for no-match-yet state
3. Fixed NameNormalizer$CacheKey (1400475) — two-level cache
4. Documented remaining top allocators (Object[], RuntimeScalar,
   ArrayList, RuntimeList, RuntimeHash, etc.)

Cumulative measured effect from baseline d071692:
  benchmark_refcount_bless  7.75× -> 6.63×  (-14%)
  benchmark_refcount_anon   4.55× -> 4.07×  (-11%)
  benchmark_global          1.92× -> 1.33×
  benchmark_method          2.01× -> 1.72×
  life_bitpacked            2.57× -> 2.60×  (no measured change)

Important caveat recorded: Benchmark.pm itself inflates the
apparent gap by ~3× compared to a bare `for` loop. Without
Benchmark.pm the real gap is ~2.3×, not ~7.75×. Benchmark.pm
optimization is an independent future target.

Key insight for future work: most remaining allocations look
inherent to per-call semantics (@_, return RuntimeList, etc.).
Recovering the 2-3× gap likely needs structural changes (inline
small arg passing, "simple-sub" emit path skipping the RuntimeArray
build for callees that don't reference @_) rather than more
allocation micro-optimization.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ked regression

Record the outcome of the apply()/applySlow split experiment on a
perf/apply-fast-path branch (since deleted).

Using javap to measure actual bytecode sizes, we found:
  apply(RuntimeScalar, RuntimeArray, int)       1150 bytes -> 28
  apply(RuntimeScalar, String, RuntimeBase[], int) 982 bytes -> 99
  apply(RuntimeArray, int) instance              619 bytes -> 105

All three dropped well under HotSpot's FreqInlineSize=325, and
`-XX:+PrintInlining` confirmed the fast paths inlined at hot call
sites (including `(hot)` markers on pushArgs/popArgs/enterCall).

BUT: benchmark_closure improved (1.21x -> 0.84x, now FASTER than
perl) while life_bitpacked REGRESSED badly (2.57x -> 5.38x, a 2x
slowdown). Different callers responded differently to the new
inlineability — HotSpot's inline-budget is per-caller, and shrinking
apply() let other things inline into callers that then crossed an
internal boundary.

Per user rule "merge back only when perf gain is confirmed -
otherwise revert", the experiment was reverted. Branch deleted.

Key takeaway: DO NOT ship apply() inlining changes without
explicitly measuring life_bitpacked before/after. The "make apply
inlinable" intervention is a zero-sum trade with current code
shape, and needs pairing with compile-time analysis of which
callers benefit.

Also documented the Benchmark.pm 3x overhead caveat and the
remaining top allocators from profiling.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
dev/bench/results/ snapshots of the benchmark harness output at
each tagged commit on the §0 perf-parity track:

  baseline-d071692a3.*  PR #526 merge tip (before any perf fix)
  baseline-b7d05b77e.*  after fa8df8a (RegexState) + 1400475 (NameNormalizer)
  baseline-d3ee4e200.*  intermediate snapshot
  baseline-1400475d3.*  after NameNormalizer fix

These are the canonical reference points for §0.8 numbers so
regressions can be diffed against them mechanically rather than
by memory.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Three-way measurement + profiling establishes that PR #526's
8.3 Mcells/s on life_bitpacked is NOT a regression from its
parent (walker-hardening-j3 tip 660aa9e also does 8.4 Mc/s).

The gap is vs MASTER, not vs our own branch. Master runs
life_bitpacked at 14.0 Mcells/s — 1.67x faster than our branch —
but master fails SIX destroy_eval_die.t tests that our branch
passes. The speed difference is paying for DESTROY correctness.

Concrete cost breakdown (from javap + JFR):

- Our branch has 3x more DESTROY/weaken runtime code (2130 lines
  vs master's 733, including ReachabilityWalker and
  ScalarRefRegistry which master does not have at all).
- Our EmitStatement emits 2 extra INVOKESTATIC instructions per
  `my` variable per scope exit (MyVarCleanupStack.unregister plus
  the existing scopeExitCleanup), pushing compiled Perl method
  sizes up ~6% and doubling RuntimeScalar allocations (89 vs 43
  sampled in a 60s run).
- ASM frame-compute is measurable at startup (26 samples in
  jdk.ExecutionSample) because our generated methods are larger.

This reframes §0 work: closing the master/PR gap requires a
per-sub "cleanup-needed" static analysis that skips the extra
scope-exit emission when the sub's lexicals are never used with
weaken/DESTROY/closure capture. Mechanical perf hunts (the kind
we did in fa8df8a / 1400475) won't close this gap.

Closing the REMAINING gap to perl (1.52x on master already) is a
separate problem — PerlOnJava fundamentals (boxing, dispatch),
not refcounting.

Full analysis: dev/design/life_bitpacked_regression_analysis.md

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
A new CleanupNeededVisitor (following the established visitor pattern,
like TempLocalCountVisitor) analyses each sub's body and decides
whether the per-scope-exit MyVarCleanupStack.unregister loop can be
skipped at emit time.

The visitor flags the sub as "needs cleanup" when its body contains
any of:
  - bless / weaken / isweak
  - local
  - defer blocks
  - user sub calls (BinaryOperatorNode "(") — callee may return
    blessed refs
  - method calls (BinaryOperatorNode "->" with IdentifierNode or "("
    on the RHS) — array/hash derefs are recognised and NOT flagged
  - nested SubroutineNode (closure captures)

Builtins (print, push, chr, length, etc.) are parsed as OperatorNode
and don't trigger the sub-call branch, so they're correctly classified
as safe. Overridden builtins imported via `use subs` resolve to user
sub calls at parse time and are flagged automatically.

When the visitor says no-cleanup-needed, EmitStatement skips ONLY the
Phase-E MyVarCleanupStack.unregister loop. Phases 1, 1b (scopeExitCleanup
on blessed scalars/containers) and Phase 3 (MortalList.flush) are
always emitted — skipping those would break blessed-parameter DESTROY,
tie_scalar untie, DBIC txn_scope_guard, and the destroy_eval_die
tests. The correctness-to-perf tradeoff for these is NOT worth taking.

Perf impact:
  life_bitpacked  A/B controlled: 8.25 (force-cleanup) -> 8.46 (-2.5%)
                  5 consecutive runs each, same process, env-toggled
  Other benches:  within ±2% of baseline noise

Correctness:
  - `make` passes except pre-existing destroy_eval_die.t#4
  - tie_scalar.t 12/12 ok
  - DBIC t/storage/txn_scope_guard.t 18/18 ok

Env-var escape hatch: JPERL_FORCE_CLEANUP=1 disables the analysis
globally, forcing the old (slow) emission path. Use when investigating
suspected correctness regressions attributable to this optimization.

This is a modest but real win and follows a clean architectural
pattern (visitor + flag + emit branch). Future work could extend the
visitor to also detect when Phase 1 / 1b can be skipped, but that
requires static type analysis of scalar contents — deferred to a
separate effort.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
fglock and others added 2 commits April 21, 2026 23:19
perf: gate MyVarCleanupStack emission on per-sub AST analysis
Analysis of @perl5/ source to understand what structurally makes
system perl's per-sub-call path ~1.5x faster than master's PerlOnJava
(separate from our walker-hardening overhead vs master).

Key hot-path primitives measured:

1. Refcount: SvREFCNT_inc/dec is 4 inlined machine instructions
   with cold-path-only dispatch. Ours goes through setLarge/
   scopeExitCleanup method calls that often fail to inline.

2. Value stack: PUSHMARK is `*++PL_markstack_ptr = offset` - 3
   instructions. Ours is ThreadLocal.get() + ArrayDeque.push() -
   substantially more overhead per push.

3. Context stack: cxstack is ONE array of PERL_CONTEXT structs;
   cx_pushsub fills one struct per sub call. Ours has 7 separate
   ThreadLocal<Deque<X>> stacks (argsStack, pristineArgsStack,
   hasArgsStack, currentBitsStack, callerBitsStack, callerHintsStack,
   callerSnapshotIdStack) each pushed individually per call.

4. FREETMPS: `if (PL_tmps_ix > PL_tmps_floor) free_tmps()` - one
   compare, branch-not-taken when empty. Ours unconditionally
   INVOKESTATICs MortalList.flush().

5. Padlist: already matches Perl (Java locals).

Writeup ranks interventions by impact/effort:

  #1 FREETMPS-style compare-not-branch for MortalList.flush
     - low effort, 3-5% expected on life_bitpacked

  #2 Consolidate 7 ThreadLocal stacks into one PerlContext struct
     - medium effort, 5-10% expected, directly addresses JFR's
       "extra RuntimeList / ArrayList allocations" finding

  #3 Inline refcount helpers (static method emission path)
     - medium effort, 5-10% on refcount-heavy workloads

Combined estimate: closes ~half the master-to-perl gap (from 2.55x
to ~1.7-1.8x perl on life_bitpacked). Full parity is structural.

References to exact perl5 source lines included for each item.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant