Draft: Megatron-FSDP and PP compatible support#2302
Draft: Megatron-FSDP and PP compatible support#2302shjwudp wants to merge 21 commits intoNVIDIA:mainfrom
Conversation
1. Upgrade DeviceMesh initialization for M-Core to support heterogeneous parallelism. 2. Fix an issue where parameters remain as dist-param during forward execution in specific cases.
2. Fix grad reduce hang issue
2. Hide pipeline schedule's deallocate_output_tensor activation reference check for Megatron-FSDP compatibility. Deallocation is usually harmless for activations with views.
megatron/core/distributed/fsdp/src/megatron_fsdp/benchmark_utils.py
Outdated
Show resolved
Hide resolved
megatron/core/distributed/fsdp/src/megatron_fsdp/benchmark_utils.py
Outdated
Show resolved
Hide resolved
| else: | ||
| dp_size = dist.get_world_size(dp_cp_group) | ||
| dp_cp_tp_ranks = [None for _ in range(dp_size)] | ||
| dist.all_gather_object(dp_cp_tp_ranks, tp_ranks, group=dp_cp_group) |
There was a problem hiding this comment.
Hmm, because we use all_gather_objects, we cannot make these two calls async... :(
There was a problem hiding this comment.
Do you mean it can be optimized into an all_gather operation?
megatron/training/arguments.py
Outdated
| help='If set, enable full sharding in megatron-fsdp Hybrid Sharded Data Parallel (HSDP) mode.') | ||
| group.add_argument('--num-distributed-optimizer-instances', type=int, default=1, | ||
| help='Number of Distributed Optimizer copies across Data Parallel domain.') | ||
| group.add_argument('--no-mfsdp-comm', action='store_true', |
There was a problem hiding this comment.
Consider using argparse.BooleanOptionalAction so a use can explicitly opt in too
There was a problem hiding this comment.
argparse.BooleanOptionalAction requires Python 3.9, so I think it's best to use it with caution.
There was a problem hiding this comment.
@shjwudp What's the minimum supported Python version for Megatron as of now? 3.9 is way past EOL?
|
Will this support VPP and help making FSDP more viable for training MOE models? |
… TP-duplicated mesh 2. Minor code polish 3. Code formatting
@Skylion007 There are performance concerns with combining VPP and FSDP (VPP makes FSDP prefetching difficult), so I am not sure this will be beneficial for MoE training. However, I will try to make this PR support VPP as well so that we have more options. |
Ah what I really want is to support A2A overlap with FSDP which requires VPP. |
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.