gh-146073: Add fitness/exit quality mechanism for JIT trace frontend by cocolato · Pull Request #148089 · python/cpython

cocolato · 2026-04-04T12:54:16Z

Issue: Improving trace quality by tracking "fitness" and "exit quality" #146073

cocolato · 2026-04-06T16:38:30Z

It appears that the current parameters do not yet guarantee runtime safety; I will continue to work on fixes and optimizations.

markshannon · 2026-04-07T08:28:12Z

I've commented on the issue #146073 (comment)

cocolato · 2026-04-16T08:39:50Z

I ran some tests on macOS, and performance on the fitness branch appears to have dropped significantly.

use fastbench: PYTHONHASHSEED=0 ./python.exe ~/src/fastmark/fastmark.py --scale 1000 richards richards_super raytrace go telco --json fitness.json

Machine:

OS: macOS 26.3.1 (arm64)
SoC/CPU: Apple M4
RAM: 24 GB
Kernel: Darwin 25.3.0

main branch:

Python 3.15.0a8+ (heads/main:9d38143088, Apr 16 2026, 15:19:11) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1057.0 ms      ( 96%)
richards_super               1049.0 ms      (100%)
raytrace                     4005.7 ms      (100%)
go                           1935.1 ms      (100%)
telco                        4061.8 ms      (100%)

fitness branch:

Python 3.15.0a8+ (heads/jit-tracer-fitness:9c75bb67dd, Apr 16 2026, 15:23:16) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1106.0 ms      ( 97%)
richards_super               1083.1 ms      (100%)
raytrace                     4190.3 ms      (100%)
go                           1978.6 ms      (100%)
telco                        4125.5 ms      (100%)

markshannon · 2026-04-21T16:10:07Z

We seem to be going around in circles a bit here.

@cocolato can you try out this script https://github.com/python/cpython/pull/148840/changes#diff-7d8d989c9e02ccababda3709e44e2465010f9aa25843f4764e4e742adcfaf39b to see if it offers any insight?

I don't know if you can extract some of the key features of the benchmarks that are slower to find out why?

markshannon · 2026-04-21T16:11:57Z

Regarding performance. We also need to consider the interplay between trace fitness/length and warmup.
If warmup is too high, and the benchmarks short, overly long traces are going to appear better than they really are.

Ideally we want to cover the hot part of the program fairly quickly, not trace any cold parts and not cover the same piece of code with multiple traces unless there is genuine polymorphism. Easier said than done though.

I would prefer good traces, even it appears a little slower on one or two benchmarks and the performance is more likely to be consistent.

cocolato · 2026-04-22T14:37:30Z

@markshannon I run the new tests, this is the result:

workload	executors	uops	guards	calls	exits	loops
`richards.gv`	19 -> 8	3249 -> 1439	442 -> 208	68 -> 30	17 -> 7	2 -> 1
`gen_in_loop.gv`	2 -> 1	51 -> 42	7 -> 5	0 -> 0	2 -> 0	0 -> 1
`long_loop.gv`	2 -> 1	723 -> 483	3 -> 2	0 -> 0	2 -> 1	0 -> 0
`long_loop_with_calls.gv`	3 -> 2	1714 -> 589	9 -> 7	67 -> 23	2 -> 1	1 -> 1
`long_loop_with_side_exits.gv`	2 -> 1	1287 -> 458	100 -> 36	0 -> 0	2 -> 1	0 -> 0
`mid_loop.gv`	1 -> 1	155 -> 155	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`mid_loop_with_calls.gv`	2 -> 2	551 -> 551	7 -> 7	21 -> 21	0 -> 0	2 -> 2
`mid_loop_with_side_exits.gv`	1 -> 1	275 -> 275	22 -> 22	0 -> 0	0 -> 0	1 -> 1
`short_branchy_loop.gv`	1 -> 1	50 -> 50	5 -> 5	0 -> 0	0 -> 0	1 -> 1
`short_loop.gv`	1 -> 1	50 -> 50	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`short_loop_with_calls.gv`	2 -> 2	176 -> 176	7 -> 7	6 -> 6	0 -> 0	2 -> 2
`short_loop_with_side_exits.gv`	1 -> 1	80 -> 80	7 -> 7	0 -> 0	0 -> 0	1 -> 1

The current fitness mechanism has significantly reduced the overall size of the traces.
It has indeed reduced fragmentation in several heavy workloads.
However, it has not increased the total number of loop closures at all.

So I think we should reduce EXIT_QUALITY_CLOSE_LOOP to close the loop in the long loop trace.

markshannon · 2026-04-22T15:59:38Z

Can you tell why richards is so different?

I don't see how reducing EXIT_QUALITY_CLOSE_LOOP would help. When we reach the end of the loop, we want to close it. To me it looks like the fitness is dropping too fast for some reason and the end of the loop isn't reached.

Also, instead of reducing the fitness for every uop, only start decreasing after the trace is getting long but decrease it more rapidly in that case?

We could add the fitness to the dumps, for more information.
Maybe add uint32_t fitness here and record the fitness when tracing. You'll also need to display it as well.
Then you might be able to see where the fitness gets too low.

Once again, thanks for doing this.

cocolato · 2026-04-22T16:59:56Z

Can you tell why richards is so different?

richards relies on object property access, linked list nodes, context switching, and small function calls, so it generates a large number of short but highly branched hot paths. The JIT frontend does not see a single, stable, long trace, but rather many short traces centered around the scheduler, each containing numerous guards and side exits.

Main branch:

cocolato · 2026-04-23T09:17:04Z

By setting LLTRACE=3, I found that the main reasons for the current drop in fitness are:

The branch penalty is too high, consuming a significant amount of fitness in a single step.
The branch penalty often exceeds 100:

0x7a8050a9a050 45: POP_JUMP_IF_FALSE(6) 0
Fitness check: POP_JUMP_IF_FALSE(6) fitness=127, exit_quality=12, depth=1
  387 ADD_TO_TRACE: _GUARD_IS_FALSE_POP (0, target=48, operand0=0, operand1=0)
  branch penalty: -117 (history=0xe001, taken=1) -> fitness=10
  per-insn cost: -5 (fwd=3, rev=2) -> fitness=5
Trace continuing (fitness=5)
0x7a8050a9a050 53: SWAP(2) 0
Fitness check: SWAP(2) fitness=5, exit_quality=25, depth=1
  388 ADD_TO_TRACE: _EXIT_TRACE (0, target=53, operand0=0, operand1=0)
Fitness terminated: SWAP(2) fitness=5 < exit_quality=25

Here, the trace hasn’t reached the end of the loop yet, but due to a -117 branch penalty, it drops directly to
fitness=5, and is terminated on the next instruction.

Penalty for return underflow at depth=0.
The following log shows that the trace is already at depth=0, but it continues along the guarded return into
the caller, and then incurs another penalty on the next return:

0x600898867560 97: RETURN_VALUE(0) 1
Fitness check: RETURN_VALUE(0) fitness=848, exit_quality=25, depth=0
 103 ADD_TO_TRACE: _MAKE_HEAP_SAFE (0, target=97, operand0=0, operand1=0)
  _RETURN_VALUE: underflow penalty=-67 -> fitness=781
 106 ADD_TO_TRACE: _GUARD_IP_RETURN_VALUE (0, target=0, operand0=0x7a8050a0ff0a, operand1=0)
  per-insn cost: -10 (fwd=7, rev=3) -> fitness=771
Trace continuing (fitness=771)
...
0x7a8050a0fe00 31: RETURN_VALUE(0) 1
Fitness check: RETURN_VALUE(0) fitness=761, exit_quality=25, depth=0
 116 ADD_TO_TRACE: _MAKE_HEAP_SAFE (0, target=31, operand0=0, operand1=0)
  _RETURN_VALUE: underflow penalty=-67 -> fitness=694

cocolato · 2026-04-23T09:34:57Z

I tried reducing their penalty and lowered EXIT_QUALITY_DEFAULT. It now looks like we can close long loops, and overall performance has improved as well:

workload	executors	uops	guards	calls	exits	loops
`richards.gv`	19 -> 7	3249 -> 1479	442 -> 208	58 -> 26	17 -> 6	2 -> 1
`gen_in_loop.gv`	1 -> 1	16 -> 42	2 -> 5	1 -> 1	1 -> 0	0 -> 1
`long_loop.gv`	1 -> 1	277 -> 277	2 -> 2	0 -> 0	1 -> 1	0 -> 0
`long_loop_with_calls.gv`	2 -> 2	376 -> 376	7 -> 7	14 -> 14	1 -> 1	1 -> 1
`long_loop_with_side_exits.gv`	1 -> 1	264 -> 264	21 -> 21	0 -> 0	1 -> 1	0 -> 0
`mid_loop.gv`	1 -> 1	155 -> 155	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`mid_loop_with_calls.gv`	2 -> 2	376 -> 376	7 -> 7	14 -> 14	1 -> 1	1 -> 1
`mid_loop_with_side_exits.gv`	1 -> 1	264 -> 264	21 -> 21	0 -> 0	1 -> 1	0 -> 0
`short_branchy_loop.gv`	0 -> 0	0 -> 0	0 -> 0	0 -> 0	0 -> 0	0 -> 0
`short_loop.gv`	1 -> 1	50 -> 50	2 -> 2	0 -> 0	0 -> 0	1 -> 1
`short_loop_with_calls.gv`	2 -> 2	175 -> 175	7 -> 7	6 -> 6	0 -> 0	2 -> 2
`short_loop_with_side_exits.gv`	1 -> 1	80 -> 80	7 -> 7	0 -> 0	0 -> 0	1 -> 1

+----------------------+----------+-----------------------+
| Benchmark            | baseline | fitness5              |
+======================+==========+=======================+
| chaos                | 44.1 ms  | 42.8 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| nbody                | 51.9 ms  | 50.4 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| unpickle_pure_python | 146 us   | 144 us: 1.02x faster  |
+----------------------+----------+-----------------------+
| pickle_pure_python   | 239 us   | 235 us: 1.01x faster  |
+----------------------+----------+-----------------------+
| json_dumps           | 7.75 ms  | 7.64 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| pyflate              | 275 ms   | 272 ms: 1.01x faster  |
+----------------------+----------+-----------------------+
| xml_etree_generate   | 75.4 ms  | 74.6 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| telco                | 6.01 ms  | 5.95 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| raytrace             | 216 ms   | 214 ms: 1.01x faster  |
+----------------------+----------+-----------------------+
| go                   | 84.5 ms  | 85.4 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| richards             | 16.1 ms  | 16.3 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| regex_compile        | 96.6 ms  | 98.1 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| json_loads           | 18.8 us  | 19.5 us: 1.04x slower |
+----------------------+----------+-----------------------+
| deltablue            | 2.12 ms  | 2.21 ms: 1.05x slower |
+----------------------+----------+-----------------------+
| Geometric mean       | (ref)    | 1.00x faster          |
+----------------------+----------+-----------------------+

cocolato · 2026-04-23T10:05:27Z

@markshannon gentle ping

markshannon · 2026-04-23T10:29:39Z

What are all the ovals with "executor_0x..." in the image for richards on this PR?
Why aren't those executors being rendered?
Two of them might be the cold and cold-dynamic exits, but I count at least 3 different names.

I tried reducing their penalty and lowered EXIT_QUALITY_DEFAULT. It now looks like we can close long loops, and overall performance has improved as well:

Why does lowering EXIT_QUALITY_DEFAULT work? It just prevents loops being closed.
I understand why lowering the penalties would help.
Which penalty are you lowering? The one for back edges, branches or both?

markshannon · 2026-04-23T10:31:34Z

I think we should merge this.

There are a number of flaws in implementation, largely down to lack of visibility of fitness, but there are also many flaws in the current ad-hoc approach on main.

Once this is merged, we can more easier investigate issues and improve the fitness, quality and penalty numbers.
@Fidget-Spinner what do you think?

The two things that we need to fix in order to fine tune the numbers are these:

In debug mode (which we need to visualize the stats) the fitness starts at 1000, not 2500
We can't see the fitness in the trace images

With that in mind, we should:

Add fitness to the trace visualizations
Reduce (ideally to near zero) the number of asserts inlined into the jitted code
Once that is done, use the same starting fitness for debug and normal builds
Reassess the fitness, penalties and exit quality numbers.

All these can be done in different PRs.

cocolato · 2026-04-23T11:20:04Z

What are all the ovals with "executor_0x..." in the image for richards on this PR?

Sorry, They are side-exit executors because I skipped rendering them for debug: https://github.com/cocolato/cpython/blob/8ce4b53fc649dfe0e92d5ca0dc71db40ca1feacb/Python/optimizer.c#L2208-L2210

The fitness branch should like this:

cocolato · 2026-04-23T11:24:23Z

Why does lowering EXIT_QUALITY_DEFAULT work? It just prevents loops being closed.
I understand why lowering the penalties would help.
Which penalty are you lowering? The one for back edges, branches or both?

I lowered both branch penalty and underflow frame penalty. Lowering EXIT_QUALITY_DEFAULT because, when the penalty value is too high, it helps us close the loop.

markshannon · 2026-04-23T11:27:33Z

I really don't understand how lowering EXIT_QUALITY_DEFAULT would help us close the loop.
We stop tracing, and thus close the loop, if fitness < exit_quality. Lowering the quality makes it less likely that fitness < exit_quality

Never mind, I'm being dumb. By lowering EXIT_QUALITY_DEFAULT you are increasing EXIT_QUALITY_CLOSE_LOOP relative to the default.

So it is sort of equivalent to raising the initial fitness.

Fidget-Spinner · 2026-04-23T17:50:15Z

Assuming the perf isnt too bad, I'm happy with the PR and I think we can merge it

markshannon

This looks good now and a clear improvement in terms of maintainability and understanding over the ad-hoc approach we had before.
This not only cleans up the code, but gives a way to get better traces while keeping the code maintainable.

Now that performance is roughly on par, we can merge this.

…ntend (pythonGH-148089) * Replaces ad-hoc logic for ending traces with a simple inequality: `fitness < exit_quality` * Fitness starts high and is reduced for branches, backward edges, calls and trace length * Exit quality reflect how good a spot that instruction is to end a trace. Closing a loop is very, specializable instructions are very low and the others in between.

cocolato and others added 13 commits April 1, 2026 00:57

add fitness && exit quality mechanism

1bfa176

Rewrite the code structure

2f9438a

address review

709c0a1

address many reviews

ef6ac24

Merge branch 'main' into jit-tracer-fitness

21f7122

optimize some constants

b99fe61

fix comment

d09afb5

fix constent

c9957c3

reduce frame penalty

9447546

add debug log

7d3e4c4

address review

2c1b5e0

address review

2409b2f

Merge branch 'python:main' into jit-tracer-fitness

88a91dc

cocolato requested review from FFY00, Fidget-Spinner, ZeroIntensity, ericsnowcurrently and markshannon as code owners April 4, 2026 12:54

bedevere-app Bot added the awaiting review label Apr 4, 2026

bedevere-app Bot mentioned this pull request Apr 4, 2026

Improving trace quality by tracking "fitness" and "exit quality" #146073

Open

cocolato added the skip news label Apr 4, 2026

cocolato added 2 commits April 6, 2026 13:51

Merge branch 'main' into jit-tracer-fitness

4e12f04

fine tune parameters

4bd251e

This comment was marked as outdated.

Sign in to view

remove some special cases

1d93208

cocolato added 3 commits April 10, 2026 17:50

Merge branch 'main' into jit-tracer-fitness

386c23a

rewrite fitness mechanism

83fd8ab

remove static assert

c900563

Fidget-Spinner and others added 3 commits April 15, 2026 17:48

Address review

64f3468

Reduce ENTER_EXECUTOR's exit quality

7661e7b

Merge branch 'main' into jit-tracer-fitness

9c75bb6

Merge branch 'main' into jit-tracer-fitness

bafa264

This comment was marked as outdated.

Sign in to view

cocolato added 2 commits April 23, 2026 18:00

fine tuning

f2bde9e

Merge branch 'main' into jit-tracer-fitness

8ce4b53

don't skip render side exit executor

bde4926

markshannon self-requested a review April 24, 2026 09:27

markshannon approved these changes Apr 24, 2026

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting changes labels Apr 24, 2026

markshannon merged commit 618b726 into python:main Apr 24, 2026
77 of 78 checks passed

bedevere-app Bot removed the awaiting merge label Apr 24, 2026

Uh oh!

Conversation

cocolato commented Apr 4, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

cocolato commented Apr 6, 2026

Uh oh!

markshannon commented Apr 7, 2026

Uh oh!

cocolato commented Apr 16, 2026

Uh oh!

markshannon commented Apr 21, 2026

Uh oh!

markshannon commented Apr 21, 2026

Uh oh!

cocolato commented Apr 22, 2026

Uh oh!

markshannon commented Apr 22, 2026

Uh oh!

cocolato commented Apr 22, 2026

Uh oh!

This comment was marked as outdated.

cocolato commented Apr 23, 2026

Uh oh!

cocolato commented Apr 23, 2026

Uh oh!

cocolato commented Apr 23, 2026

Uh oh!

markshannon commented Apr 23, 2026

Uh oh!

markshannon commented Apr 23, 2026

Uh oh!

cocolato commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cocolato commented Apr 23, 2026

Uh oh!

markshannon commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Apr 23, 2026

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cocolato commented Apr 4, 2026 •

edited by bedevere-app Bot

Loading

cocolato commented Apr 23, 2026 •

edited

Loading

markshannon commented Apr 23, 2026 •

edited

Loading