Inference | KV prefix caching. by lmcafee-nvidia · Pull Request #3063 · NVIDIA/Megatron-LM

lmcafee-nvidia · 2026-01-23T23:03:59Z

What does this PR do ?

Implement KV cache prefix caching for dynamic batching inference, enabling requests with identical prompt prefixes to share computed KV cache blocks instead of recomputing them.

Key Features

Block-level prefix sharing: Requests with matching prompt prefixes share KV cache blocks, reducing redundant computation
Content-based hashing: Blocks are identified by hash of their token content for efficient prefix matching
Request coordination: Multi-rank coordination prevents redundant KV computation when multiple requests match the same prefix blocks
LRU eviction: Cached blocks use LRU eviction based on timestamps when memory is constrained
CLI control: New --no-inference-dynamic-batching-enable-prefix-caching flag to disable when needed

Changes

Block Allocator (megatron/core/inference/contexts/dynamic_block_allocator.py):

Add block hash tracking and content-based matching
Implement hash-to-block-id mapping for prefix lookup
Add reference counting for shared blocks
LRU eviction for cached blocks

Dynamic Context (megatron/core/inference/contexts/dynamic_context.py):

_find_matching_prefix_blocks() - Find cached blocks matching request prefix
_compute_block_hashes() - Compute content hashes for blocks
Block sharing logic in add_request()

Dynamic Engine (megatron/core/inference/engines/dynamic_engine.py):

Request coordination to avoid redundant computation
Pending block tracking during prefill
Integration with chunked prefill scheduling

Inference Request (megatron/core/inference/inference_request.py):

Add enable_prefix_caching flag
Track matched prefix blocks per request

Arguments (megatron/training/arguments.py):

Add --no-inference-dynamic-batching-enable-prefix-caching flag

How It Works

When a request arrives, compute hashes for each block of prompt tokens
Look up existing cached blocks with matching hashes
Reuse matched blocks (increment ref count) instead of allocating new ones
Only compute KV for unmatched (new) tokens
After prefill, cache newly computed blocks for future requests
Coordinate across DP ranks to prevent redundant computation

Test plan

Unit tests for block hash computation and matching (57 tests in test_dynamic_prefix_caching.py)
Unit tests for request coordination
Tests for LRU eviction
Tests for prefix caching enable/disable
Tests for edge cases (empty prompts, single tokens, exact block boundaries)
End-to-end inference test with prefix sharing

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-01-23T23:04:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

megatron/rl/inference/megatron.py

Replace the single ADD event with three separate events to enable precise time-to-first-token measurement: - ADD_ENGINE: When request is added to engine via _add_request() - ADD_CONTEXT: When request is scheduled for prefill - FIRST_TOKEN: When first output token is about to be generated TTFT is now calculated as FIRST_TOKEN - ADD_ENGINE and included in the JSON output from gpt_dynamic_inference.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tests/unit_tests/inference/test_utils.py with TestPriority enum for selective test execution based on priority levels - Add tests/unit_tests/inference/engines/test_dynamic_events.py with comprehensive tests for DynamicInferenceEvent and event lifecycle - Use consistent convention: CRITICAL=1, LOW=4 with skipif pattern TEST_PRIORITY < TestPriority.LEVEL Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update test_events to use new event types (ADD_ENGINE, ADD_CONTEXT, FIRST_TOKEN) instead of old ADD type - Add test_event_timestamps integration test that verifies: - Completed requests have expected event sequence - Event timestamps are monotonically increasing - TTFT (FIRST_TOKEN - ADD_ENGINE) is positive - Total request time >= TTFT Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

megatron/training/arguments.py

megatron/core/inference/contexts/dynamic_block_allocator.py

lmcafee-nvidia · 2026-02-26T03:32:33Z

/ok to test f53d674

examples/inference/gpt/gpt_dynamic_inference_12b.sh

More readable and extensible representation of eviction behavior, following the existing KVCacheManagementMode(str, Enum) pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lmcafee-nvidia · 2026-03-02T14:38:06Z

/ok to test 5758bc3

svcnvidia-nemo-ci · 2026-03-02T15:59:31Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22584143291

The output JSON now includes lifetime_prefill_token_count as a top-level system metric. The functional test treats all top-level keys not in _NON_REQUEST_TOP_LEVEL_KEYS as request IDs, causing an assertion failure in the hybrid inference CI jobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lmcafee-nvidia · 2026-03-02T20:26:33Z

/ok to test 5a692b3

lmcafee-nvidia · 2026-03-02T20:32:56Z

/ok to test dc8fb69

svcnvidia-nemo-ci · 2026-03-02T22:20:54Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22598390284

New files: - tests/unit_tests/inference/contexts/test_dynamic_prefix_caching.py

lmcafee-nvidia changed the title ~~Add block hash tracking for prefix caching~~ Inference | Prefix caching for KV attention layers. Jan 26, 2026

lmcafee-nvidia self-assigned this Jan 26, 2026

lmcafee-nvidia added the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Jan 26, 2026

lmcafee-nvidia added this to the Core 0.15 milestone Jan 26, 2026

lmcafee-nvidia requested review from kvareddy and santhnm2 January 26, 2026 20:43

lmcafee-nvidia force-pushed the prefix-caching branch from 24e05dc to 48df064 Compare January 29, 2026 17:42

lmcafee-nvidia marked this pull request as ready for review February 3, 2026 15:00

lmcafee-nvidia requested review from a team as code owners February 3, 2026 15:00

ko3n1g requested a review from a team February 3, 2026 15:00

lmcafee-nvidia changed the title ~~Inference | Prefix caching for KV attention layers.~~ Inference | KV prefix caching. Feb 3, 2026

yobibyte reviewed Feb 3, 2026

View reviewed changes

megatron/rl/inference/megatron.py Outdated Show resolved Hide resolved

lmcafee-nvidia and others added 3 commits February 4, 2026 11:22