Skip to content

[Main] Partial CUDA Graph support for EP Overlap#2184

Merged
ericharper merged 20 commits intoNVIDIA:mainfrom
Wohox:pingtian/support_cuda_graph_for_ep_overlap_main
Jan 16, 2026
Merged

[Main] Partial CUDA Graph support for EP Overlap#2184
ericharper merged 20 commits intoNVIDIA:mainfrom
Wohox:pingtian/support_cuda_graph_for_ep_overlap_main

Conversation

@Wohox
Copy link
Copy Markdown
Contributor

@Wohox Wohox commented Nov 10, 2025

Based on #1920

What does this PR do ?

EP Overlap brings extra cpu overhead and may cause GPU bubble during execution, partial CUDA graph helps release cpu pressure within the selected scope. This PR supports partial CUDA graph for EP Overlap, the supported scopes are attn, moe_router, moe_preprocess (moe and mlp are not supported).

Usage
To enable this feature, refer to the following example:

--overlap-moe-expert-parallel-comm \
--delay-wgrad-compute \
--cuda-graph-scope attn moe_router moe_preprocess \

(--delay-wgrad-compute is optional)

Correctness
The loss value can be bitwise aligned when enable & disable partial CUDA graph.
Screenshot 2025-11-07 at 11 55 01

This PR is for main branch and the PR for dev branch is in PR2168

Dependencies
enable CUDA graph with delay-wgrad-compute relies on TE PR:

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@Wohox Wohox requested review from a team as code owners November 10, 2025 08:57
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Nov 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Wohox Wohox changed the title Pingtian/support cuda graph for ep overlap main [Draft][Main] Partial CUDA Graph support for EP Overlap Nov 10, 2025
@Wohox Wohox requested review from a team as code owners November 13, 2025 09:07
@Wohox Wohox changed the title [Draft][Main] Partial CUDA Graph support for EP Overlap [Main] Partial CUDA Graph support for EP Overlap Nov 13, 2025
@kvareddy
Copy link
Copy Markdown
Contributor

@fanshiqing @jiemingz can you please take a look at this MR?

@jiemingz jiemingz self-assigned this Nov 13, 2025
@jiemingz
Copy link
Copy Markdown
Contributor

jiemingz commented Nov 13, 2025

it looks like this include changes from #1920 , is that supposed to be merged first and this be rebased?

@Wohox Wohox added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Nov 17, 2025
@Wohox
Copy link
Copy Markdown
Contributor Author

Wohox commented Nov 17, 2025

it looks like this include changes from #1920 , is that supposed to be merged first and this be rebased?

@jiemingz Yes, but rebase can happen later, since this MR requires #1920.

@Wohox Wohox force-pushed the pingtian/support_cuda_graph_for_ep_overlap_main branch from b106510 to bb04701 Compare December 2, 2025 06:04
@Wohox
Copy link
Copy Markdown
Contributor Author

Wohox commented Dec 2, 2025

/ok to test 125fa43

Copy link
Copy Markdown
Contributor

@lhb8125 lhb8125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Wohox Wohox force-pushed the pingtian/support_cuda_graph_for_ep_overlap_main branch from 4279c7e to 677bd59 Compare January 14, 2026 02:38
Copy link
Copy Markdown
Contributor

@jiemingz jiemingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Wohox
Copy link
Copy Markdown
Contributor Author

Wohox commented Jan 15, 2026

/ok to test 9eababb

@Wohox
Copy link
Copy Markdown
Contributor Author

Wohox commented Jan 15, 2026

@ericharper Can you help review this PR, still needs approval from NeMo and GPT group, thanks~

@ericharper ericharper enabled auto-merge January 15, 2026 18:22
@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Jan 15, 2026

/ok to test ecbaff4

@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Jan 15, 2026

/ok to test c3219e6

@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented Jan 15, 2026

/ok to test e01fcab

@Wohox
Copy link
Copy Markdown
Contributor Author

Wohox commented Jan 16, 2026

/ok to test e180d4d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high dev2main: mbridge dev to main: this PR is needed in main for mbridge Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. Final Review PR is in the "final review" stage module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants