Skip to content

Inference | KV prefix caching.#3063

Merged
lmcafee-nvidia merged 64 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching
Mar 2, 2026
Merged

Inference | KV prefix caching.#3063
lmcafee-nvidia merged 64 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor

@lmcafee-nvidia lmcafee-nvidia commented Jan 23, 2026

What does this PR do ?

Implement KV cache prefix caching for dynamic batching inference, enabling requests with identical prompt prefixes to share computed KV cache blocks instead of recomputing them.

Key Features

  • Block-level prefix sharing: Requests with matching prompt prefixes share KV cache blocks, reducing redundant computation
  • Content-based hashing: Blocks are identified by hash of their token content for efficient prefix matching
  • Request coordination: Multi-rank coordination prevents redundant KV computation when multiple requests match the same prefix blocks
  • LRU eviction: Cached blocks use LRU eviction based on timestamps when memory is constrained
  • CLI control: New --no-inference-dynamic-batching-enable-prefix-caching flag to disable when needed

Changes

Block Allocator (megatron/core/inference/contexts/dynamic_block_allocator.py):

  • Add block hash tracking and content-based matching
  • Implement hash-to-block-id mapping for prefix lookup
  • Add reference counting for shared blocks
  • LRU eviction for cached blocks

Dynamic Context (megatron/core/inference/contexts/dynamic_context.py):

  • _find_matching_prefix_blocks() - Find cached blocks matching request prefix
  • _compute_block_hashes() - Compute content hashes for blocks
  • Block sharing logic in add_request()

Dynamic Engine (megatron/core/inference/engines/dynamic_engine.py):

  • Request coordination to avoid redundant computation
  • Pending block tracking during prefill
  • Integration with chunked prefill scheduling

Inference Request (megatron/core/inference/inference_request.py):

  • Add enable_prefix_caching flag
  • Track matched prefix blocks per request

Arguments (megatron/training/arguments.py):

  • Add --no-inference-dynamic-batching-enable-prefix-caching flag

How It Works

  1. When a request arrives, compute hashes for each block of prompt tokens
  2. Look up existing cached blocks with matching hashes
  3. Reuse matched blocks (increment ref count) instead of allocating new ones
  4. Only compute KV for unmatched (new) tokens
  5. After prefill, cache newly computed blocks for future requests
  6. Coordinate across DP ranks to prevent redundant computation

Test plan

  • Unit tests for block hash computation and matching (57 tests in test_dynamic_prefix_caching.py)
  • Unit tests for request coordination
  • Tests for LRU eviction
  • Tests for prefix caching enable/disable
  • Tests for edge cases (empty prompts, single tokens, exact block boundaries)
  • End-to-end inference test with prefix sharing

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lmcafee-nvidia lmcafee-nvidia changed the title Add block hash tracking for prefix caching Inference | Prefix caching for KV attention layers. Jan 26, 2026
@lmcafee-nvidia lmcafee-nvidia self-assigned this Jan 26, 2026
@lmcafee-nvidia lmcafee-nvidia added the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Jan 26, 2026
@lmcafee-nvidia lmcafee-nvidia added this to the Core 0.15 milestone Jan 26, 2026
@lmcafee-nvidia lmcafee-nvidia marked this pull request as ready for review February 3, 2026 15:00
@lmcafee-nvidia lmcafee-nvidia requested review from a team as code owners February 3, 2026 15:00
@ko3n1g ko3n1g requested a review from a team February 3, 2026 15:00
@lmcafee-nvidia lmcafee-nvidia changed the title Inference | Prefix caching for KV attention layers. Inference | KV prefix caching. Feb 3, 2026
lmcafee-nvidia and others added 3 commits February 4, 2026 11:22
Replace the single ADD event with three separate events to enable
precise time-to-first-token measurement:

- ADD_ENGINE: When request is added to engine via _add_request()
- ADD_CONTEXT: When request is scheduled for prefill
- FIRST_TOKEN: When first output token is about to be generated

TTFT is now calculated as FIRST_TOKEN - ADD_ENGINE and included
in the JSON output from gpt_dynamic_inference.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests/unit_tests/inference/test_utils.py with TestPriority enum
  for selective test execution based on priority levels
- Add tests/unit_tests/inference/engines/test_dynamic_events.py with
  comprehensive tests for DynamicInferenceEvent and event lifecycle
- Use consistent convention: CRITICAL=1, LOW=4 with skipif pattern
  TEST_PRIORITY < TestPriority.LEVEL

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update test_events to use new event types (ADD_ENGINE, ADD_CONTEXT,
  FIRST_TOKEN) instead of old ADD type
- Add test_event_timestamps integration test that verifies:
  - Completed requests have expected event sequence
  - Event timestamps are monotonically increasing
  - TTFT (FIRST_TOKEN - ADD_ENGINE) is positive
  - Total request time >= TTFT

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test f53d674

More readable and extensible representation of eviction behavior,
following the existing KVCacheManagementMode(str, Enum) pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 5758bc3

@lmcafee-nvidia lmcafee-nvidia added this pull request to the merge queue Mar 2, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22584143291

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 2, 2026
The output JSON now includes lifetime_prefill_token_count as a
top-level system metric. The functional test treats all top-level
keys not in _NON_REQUEST_TOP_LEVEL_KEYS as request IDs, causing
an assertion failure in the hybrid inference CI jobs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lmcafee-nvidia lmcafee-nvidia requested a review from a team as a code owner March 2, 2026 20:21
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 5a692b3

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test dc8fb69

@lmcafee-nvidia lmcafee-nvidia enabled auto-merge March 2, 2026 20:33
@lmcafee-nvidia lmcafee-nvidia added this pull request to the merge queue Mar 2, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22598390284

Merged via the queue into NVIDIA:main with commit c9312e6 Mar 2, 2026
50 of 53 checks passed
ilml added a commit to ilml/Megatron-LM that referenced this pull request Mar 20, 2026
New files:
  - tests/unit_tests/inference/contexts/test_dynamic_prefix_caching.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants