Skip to content

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089

Merged
markshannon merged 38 commits intopython:mainfrom
cocolato:jit-tracer-fitness
Apr 24, 2026
Merged

gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#148089
markshannon merged 38 commits intopython:mainfrom
cocolato:jit-tracer-fitness

Conversation

@cocolato
Copy link
Copy Markdown
Member

@cocolato cocolato commented Apr 4, 2026

@cocolato

This comment was marked as outdated.

@cocolato
Copy link
Copy Markdown
Member Author

cocolato commented Apr 6, 2026

It appears that the current parameters do not yet guarantee runtime safety; I will continue to work on fixes and optimizations.

@markshannon
Copy link
Copy Markdown
Member

I've commented on the issue #146073 (comment)

@cocolato
Copy link
Copy Markdown
Member Author

I ran some tests on macOS, and performance on the fitness branch appears to have dropped significantly.

use fastbench: PYTHONHASHSEED=0 ./python.exe ~/src/fastmark/fastmark.py --scale 1000 richards richards_super raytrace go telco --json fitness.json

Machine:

OS: macOS 26.3.1 (arm64)
SoC/CPU: Apple M4
RAM: 24 GB
Kernel: Darwin 25.3.0

main branch:

Python 3.15.0a8+ (heads/main:9d38143088, Apr 16 2026, 15:19:11) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1057.0 ms      ( 96%)
richards_super               1049.0 ms      (100%)
raytrace                     4005.7 ms      (100%)
go                           1935.1 ms      (100%)
telco                        4061.8 ms      (100%)

fitness branch:

Python 3.15.0a8+ (heads/jit-tracer-fitness:9c75bb67dd, Apr 16 2026, 15:23:16) [Clang 17.0.0 (clang-1700.6.3.2)]
Benchmark                     Time      Useful Work
richards                     1106.0 ms      ( 97%)
richards_super               1083.1 ms      (100%)
raytrace                     4190.3 ms      (100%)
go                           1978.6 ms      (100%)
telco                        4125.5 ms      (100%)

@markshannon
Copy link
Copy Markdown
Member

We seem to be going around in circles a bit here.

@cocolato can you try out this script https://github.com/python/cpython/pull/148840/changes#diff-7d8d989c9e02ccababda3709e44e2465010f9aa25843f4764e4e742adcfaf39b to see if it offers any insight?

I don't know if you can extract some of the key features of the benchmarks that are slower to find out why?

@markshannon
Copy link
Copy Markdown
Member

Regarding performance. We also need to consider the interplay between trace fitness/length and warmup.
If warmup is too high, and the benchmarks short, overly long traces are going to appear better than they really are.

Ideally we want to cover the hot part of the program fairly quickly, not trace any cold parts and not cover the same piece of code with multiple traces unless there is genuine polymorphism. Easier said than done though.

I would prefer good traces, even it appears a little slower on one or two benchmarks and the performance is more likely to be consistent.

@cocolato
Copy link
Copy Markdown
Member Author

@markshannon I run the new tests, this is the result:

workload executors uops guards calls exits loops
richards.gv 19 -> 8 3249 -> 1439 442 -> 208 68 -> 30 17 -> 7 2 -> 1
gen_in_loop.gv 2 -> 1 51 -> 42 7 -> 5 0 -> 0 2 -> 0 0 -> 1
long_loop.gv 2 -> 1 723 -> 483 3 -> 2 0 -> 0 2 -> 1 0 -> 0
long_loop_with_calls.gv 3 -> 2 1714 -> 589 9 -> 7 67 -> 23 2 -> 1 1 -> 1
long_loop_with_side_exits.gv 2 -> 1 1287 -> 458 100 -> 36 0 -> 0 2 -> 1 0 -> 0
mid_loop.gv 1 -> 1 155 -> 155 2 -> 2 0 -> 0 0 -> 0 1 -> 1
mid_loop_with_calls.gv 2 -> 2 551 -> 551 7 -> 7 21 -> 21 0 -> 0 2 -> 2
mid_loop_with_side_exits.gv 1 -> 1 275 -> 275 22 -> 22 0 -> 0 0 -> 0 1 -> 1
short_branchy_loop.gv 1 -> 1 50 -> 50 5 -> 5 0 -> 0 0 -> 0 1 -> 1
short_loop.gv 1 -> 1 50 -> 50 2 -> 2 0 -> 0 0 -> 0 1 -> 1
short_loop_with_calls.gv 2 -> 2 176 -> 176 7 -> 7 6 -> 6 0 -> 0 2 -> 2
short_loop_with_side_exits.gv 1 -> 1 80 -> 80 7 -> 7 0 -> 0 0 -> 0 1 -> 1
  • The current fitness mechanism has significantly reduced the overall size of the traces.
  • It has indeed reduced fragmentation in several heavy workloads.
  • However, it has not increased the total number of loop closures at all.

So I think we should reduce EXIT_QUALITY_CLOSE_LOOP to close the loop in the long loop trace.

@markshannon
Copy link
Copy Markdown
Member

Can you tell why richards is so different?

I don't see how reducing EXIT_QUALITY_CLOSE_LOOP would help. When we reach the end of the loop, we want to close it. To me it looks like the fitness is dropping too fast for some reason and the end of the loop isn't reached.

Also, instead of reducing the fitness for every uop, only start decreasing after the trace is getting long but decrease it more rapidly in that case?

We could add the fitness to the dumps, for more information.
Maybe add uint32_t fitness here and record the fitness when tracing. You'll also need to display it as well.
Then you might be able to see where the fitness gets too low.

Once again, thanks for doing this.

@cocolato
Copy link
Copy Markdown
Member Author

Can you tell why richards is so different?

richards relies on object property access, linked list nodes, context switching, and small function calls, so it generates a large number of short but highly branched hot paths. The JIT frontend does not see a single, stable, long trace, but rather many short traces centered around the scheduler, each containing numerous guards and side exits.

Main branch:
graphviz (3)

@cocolato

This comment was marked as outdated.

@cocolato
Copy link
Copy Markdown
Member Author

By setting LLTRACE=3, I found that the main reasons for the current drop in fitness are:

  1. The branch penalty is too high, consuming a significant amount of fitness in a single step.
    The branch penalty often exceeds 100:
0x7a8050a9a050 45: POP_JUMP_IF_FALSE(6) 0
Fitness check: POP_JUMP_IF_FALSE(6) fitness=127, exit_quality=12, depth=1
  387 ADD_TO_TRACE: _GUARD_IS_FALSE_POP (0, target=48, operand0=0, operand1=0)
  branch penalty: -117 (history=0xe001, taken=1) -> fitness=10
  per-insn cost: -5 (fwd=3, rev=2) -> fitness=5
Trace continuing (fitness=5)
0x7a8050a9a050 53: SWAP(2) 0
Fitness check: SWAP(2) fitness=5, exit_quality=25, depth=1
  388 ADD_TO_TRACE: _EXIT_TRACE (0, target=53, operand0=0, operand1=0)
Fitness terminated: SWAP(2) fitness=5 < exit_quality=25

Here, the trace hasn’t reached the end of the loop yet, but due to a -117 branch penalty, it drops directly to
fitness=5, and is terminated on the next instruction.

  1. Penalty for return underflow at depth=0.
    The following log shows that the trace is already at depth=0, but it continues along the guarded return into
    the caller, and then incurs another penalty on the next return:
0x600898867560 97: RETURN_VALUE(0) 1
Fitness check: RETURN_VALUE(0) fitness=848, exit_quality=25, depth=0
 103 ADD_TO_TRACE: _MAKE_HEAP_SAFE (0, target=97, operand0=0, operand1=0)
  _RETURN_VALUE: underflow penalty=-67 -> fitness=781
 106 ADD_TO_TRACE: _GUARD_IP_RETURN_VALUE (0, target=0, operand0=0x7a8050a0ff0a, operand1=0)
  per-insn cost: -10 (fwd=7, rev=3) -> fitness=771
Trace continuing (fitness=771)
...
0x7a8050a0fe00 31: RETURN_VALUE(0) 1
Fitness check: RETURN_VALUE(0) fitness=761, exit_quality=25, depth=0
 116 ADD_TO_TRACE: _MAKE_HEAP_SAFE (0, target=31, operand0=0, operand1=0)
  _RETURN_VALUE: underflow penalty=-67 -> fitness=694

@cocolato
Copy link
Copy Markdown
Member Author

I tried reducing their penalty and lowered EXIT_QUALITY_DEFAULT. It now looks like we can close long loops, and overall performance has improved as well:

workload executors uops guards calls exits loops
richards.gv 19 -> 7 3249 -> 1479 442 -> 208 58 -> 26 17 -> 6 2 -> 1
gen_in_loop.gv 1 -> 1 16 -> 42 2 -> 5 1 -> 1 1 -> 0 0 -> 1
long_loop.gv 1 -> 1 277 -> 277 2 -> 2 0 -> 0 1 -> 1 0 -> 0
long_loop_with_calls.gv 2 -> 2 376 -> 376 7 -> 7 14 -> 14 1 -> 1 1 -> 1
long_loop_with_side_exits.gv 1 -> 1 264 -> 264 21 -> 21 0 -> 0 1 -> 1 0 -> 0
mid_loop.gv 1 -> 1 155 -> 155 2 -> 2 0 -> 0 0 -> 0 1 -> 1
mid_loop_with_calls.gv 2 -> 2 376 -> 376 7 -> 7 14 -> 14 1 -> 1 1 -> 1
mid_loop_with_side_exits.gv 1 -> 1 264 -> 264 21 -> 21 0 -> 0 1 -> 1 0 -> 0
short_branchy_loop.gv 0 -> 0 0 -> 0 0 -> 0 0 -> 0 0 -> 0 0 -> 0
short_loop.gv 1 -> 1 50 -> 50 2 -> 2 0 -> 0 0 -> 0 1 -> 1
short_loop_with_calls.gv 2 -> 2 175 -> 175 7 -> 7 6 -> 6 0 -> 0 2 -> 2
short_loop_with_side_exits.gv 1 -> 1 80 -> 80 7 -> 7 0 -> 0 0 -> 0 1 -> 1
+----------------------+----------+-----------------------+
| Benchmark            | baseline | fitness5              |
+======================+==========+=======================+
| chaos                | 44.1 ms  | 42.8 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| nbody                | 51.9 ms  | 50.4 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| unpickle_pure_python | 146 us   | 144 us: 1.02x faster  |
+----------------------+----------+-----------------------+
| pickle_pure_python   | 239 us   | 235 us: 1.01x faster  |
+----------------------+----------+-----------------------+
| json_dumps           | 7.75 ms  | 7.64 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| pyflate              | 275 ms   | 272 ms: 1.01x faster  |
+----------------------+----------+-----------------------+
| xml_etree_generate   | 75.4 ms  | 74.6 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| telco                | 6.01 ms  | 5.95 ms: 1.01x faster |
+----------------------+----------+-----------------------+
| raytrace             | 216 ms   | 214 ms: 1.01x faster  |
+----------------------+----------+-----------------------+
| go                   | 84.5 ms  | 85.4 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| richards             | 16.1 ms  | 16.3 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| regex_compile        | 96.6 ms  | 98.1 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| json_loads           | 18.8 us  | 19.5 us: 1.04x slower |
+----------------------+----------+-----------------------+
| deltablue            | 2.12 ms  | 2.21 ms: 1.05x slower |
+----------------------+----------+-----------------------+
| Geometric mean       | (ref)    | 1.00x faster          |
+----------------------+----------+-----------------------+

@cocolato
Copy link
Copy Markdown
Member Author

@markshannon gentle ping

@markshannon
Copy link
Copy Markdown
Member

What are all the ovals with "executor_0x..." in the image for richards on this PR?
Why aren't those executors being rendered?
Two of them might be the cold and cold-dynamic exits, but I count at least 3 different names.

I tried reducing their penalty and lowered EXIT_QUALITY_DEFAULT. It now looks like we can close long loops, and overall performance has improved as well:

Why does lowering EXIT_QUALITY_DEFAULT work? It just prevents loops being closed.
I understand why lowering the penalties would help.
Which penalty are you lowering? The one for back edges, branches or both?

@markshannon
Copy link
Copy Markdown
Member

I think we should merge this.

There are a number of flaws in implementation, largely down to lack of visibility of fitness, but there are also many flaws in the current ad-hoc approach on main.

Once this is merged, we can more easier investigate issues and improve the fitness, quality and penalty numbers.
@Fidget-Spinner what do you think?

The two things that we need to fix in order to fine tune the numbers are these:

  • In debug mode (which we need to visualize the stats) the fitness starts at 1000, not 2500
  • We can't see the fitness in the trace images

With that in mind, we should:

  1. Add fitness to the trace visualizations
  2. Reduce (ideally to near zero) the number of asserts inlined into the jitted code
  3. Once that is done, use the same starting fitness for debug and normal builds
  4. Reassess the fitness, penalties and exit quality numbers.

All these can be done in different PRs.

@cocolato
Copy link
Copy Markdown
Member Author

cocolato commented Apr 23, 2026

What are all the ovals with "executor_0x..." in the image for richards on this PR?

Sorry, They are side-exit executors because I skipped rendering them for debug: https://github.com/cocolato/cpython/blob/8ce4b53fc649dfe0e92d5ca0dc71db40ca1feacb/Python/optimizer.c#L2208-L2210

The fitness branch should like this:
graphviz (5)

@cocolato
Copy link
Copy Markdown
Member Author

Why does lowering EXIT_QUALITY_DEFAULT work? It just prevents loops being closed.
I understand why lowering the penalties would help.
Which penalty are you lowering? The one for back edges, branches or both?

I lowered both branch penalty and underflow frame penalty. Lowering EXIT_QUALITY_DEFAULT because, when the penalty value is too high, it helps us close the loop.

@markshannon
Copy link
Copy Markdown
Member

markshannon commented Apr 23, 2026

I really don't understand how lowering EXIT_QUALITY_DEFAULT would help us close the loop.
We stop tracing, and thus close the loop, if fitness < exit_quality. Lowering the quality makes it less likely that fitness < exit_quality

Never mind, I'm being dumb. By lowering EXIT_QUALITY_DEFAULT you are increasing EXIT_QUALITY_CLOSE_LOOP relative to the default.

So it is sort of equivalent to raising the initial fitness.

@Fidget-Spinner
Copy link
Copy Markdown
Member

Assuming the perf isnt too bad, I'm happy with the PR and I think we can merge it

@markshannon markshannon self-requested a review April 24, 2026 09:27
Copy link
Copy Markdown
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now and a clear improvement in terms of maintainability and understanding over the ad-hoc approach we had before.
This not only cleans up the code, but gives a way to get better traces while keeping the code maintainable.

Now that performance is roughly on par, we can merge this.

@markshannon markshannon merged commit 618b726 into python:main Apr 24, 2026
77 of 78 checks passed
ljfp pushed a commit to ljfp/cpython that referenced this pull request Apr 25, 2026
…ntend (pythonGH-148089)

* Replaces ad-hoc logic for ending traces with a simple inequality: `fitness < exit_quality`
* Fitness starts high and is reduced for branches, backward edges, calls and trace length
* Exit quality reflect how good a spot that instruction is to end a trace. Closing a loop is very, specializable instructions are very low and the others in between.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants