[Megatron-FSDP] Add dtype customization to Megatron-FSDP.#3067
[Megatron-FSDP] Add dtype customization to Megatron-FSDP.#3067cspades merged 19 commits intoNVIDIA:mainfrom
Conversation
8714977 to
c72e1d3
Compare
c72e1d3 to
107e81a
Compare
shjwudp
left a comment
There was a problem hiding this comment.
Thanks for helping clarify the necessity of implementing reduce-scatter based on A2A and the trade-offs with NCCL’s native reduce-scatter.
I think there’s still room to simplify this MR. If we keep the main goal in focus, perhaps the changes related to the gradient reduce pipeline and the bucket fetch/free operations could be reverted. A well-scoped PR focusing on a few clear objectives will make it easier to review, trace, and maintain.
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Show resolved
Hide resolved
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Show resolved
Hide resolved
youngeunkwon0405
left a comment
There was a problem hiding this comment.
Thanks for the contribution. The user-side interface you suggested seems great!
I checked the param_and_gard_buffer part. At first glance, it seems fine to me. I wish we could have actual test results for the following cases with nccl-ub + manual-registration.
- FSDP-only
- AG/RS should be symmetric-kernel
- HSDP (within the single rack of GB200 or GB300)
- AG/RS/AR should be symmetric-kernel
- HSDP (multi-rack of GB200 or GB300. FSDP within 64 GPUs and outer-dp for inter-rack)
- AG/RS should be symmetric-kernel
- HSDP + TP
- AG/RS should be symmetric-kernel
There are two ways to check if the symmetric kernel is called or not
- See directly from nsys-rep
- Set
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=TUNINGand search for[Symmetric]in the log. In case of manual-registration, you are expected to see the [Symmetric] kernels after the first iteration (since we are registering the buffer after the first iteration).
For me, the second way was more convenient to check, but it's up to you.
|
Aligned with @shjwudp on the final design: we'll use separate (dummy / no local data) Should have no functional difference to the current PR state, just refactor underneath for improved maintenance and memory robustness. The idea is independence of de-allocation in the DP-Shard gradient buffer, and allocation in the DP-Outer gradient communications.
Warning: |
…tron_fsdp to remove unnecessary attributes. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…heck. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…m_dtype by deactivating SymMem for gradients. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…t freed, both used to setup NCCL UB communication buckets. Signed-off-by: Cory Ye <cye@nvidia.com>
…sharded buffers. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…precision. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Show resolved
Hide resolved
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Show resolved
Hide resolved
megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py
Outdated
Show resolved
Hide resolved
| """ | ||
| mp_policy_reset = MixedPrecisionPolicy( | ||
| # Preserve the original main parameter + gradient data-type. | ||
| main_params_dtype=self.mp_policy.main_params_dtype, |
There was a problem hiding this comment.
Just a note: Dynamically changing main_params_dtype and main_grads_dtype could be achieved by rebuilding the DataParallelBuffer, but this is not supported for now.
There was a problem hiding this comment.
Yeah... though it would also require calling fsdp_manual_registration again for NCCL UBR as well, it's complex indeed.
Signed-off-by: Cory Ye <cye@nvidia.com>
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22600987668 |
What does this PR do ?
main_params_dtype,main_grads_dtype, andgrad_comm_dtypeto Megatron-FSDP.grad_accum_dtype(high-precision gradient reduce-scatter / all-reduce) will be handled on NVLink by NCCL UBR / SymMem forv2.27+. IB domain reduce-scatter requires NCCL2.29U1+, while all-reduce is not currently supported for IB.All planned performance benchmarks completed with no bugs. Ready for expert & final review!
Details
Mixed-Precision Support (
megatron_fsdp.MixedPrecisionPolicy)main_params_dtype/--megatron-fsdp-main-params-dtype(🍀 NEW! 🍀) andmain_grads_dtype/--megatron-fsdp-main-grads-dtype(🍀 NEW! 🍀) are simple generalizations ofpreserve_fp32_weights(⛔ DEPRECATED ⛔) andgrad_reduce_in_fp32.grad_comm_dtype/--megatron-fsdp-grad-comm-dtype(🍀 NEW! 🍀) controls the data-type used for gradient communication (all-reduce & reduce-scatter).main_grads_dtypeis not equivalent tograd_comm_dtype, a communication bucket with the communication data-type will be allocated. Otherwise, and if not specified, themain_grads_dtypewill be the communication data-type.Megatron-FSDP Gradient Lifecycle
To summarize the gradient pipeline of Megatron-FSDP for the uninitiated:
main, this communication bucket matches the main gradient buffer data-type. So we cannot have low-precision communication buckets with high-precision main gradients.grad_comm_dtype, to support low-precision communication and high-precision reduction with NCCL (v2.27+).main_grads_dtype, typically FP32.no_shardandoptim, this is an local all-reduce or reduce-scatter that can only be called once per optimization cycle to avoid corrupt gradients.no_shardandoptimsharding strategies definitively do not permit a second un-sharded memory allocation in order to maintain both communication and accumulation buffers for the gradient (one for BF16 communication, another for FP32 accumulation) until we finally perform the only DP-reduction right before the optimization step. Thus, we temporarily allocate / deallocate a BF16 communication buffer right before gradient reduction, while persistently allocating an FP32 main gradient bucket.optim_gradsandoptim_grads_params, this is a reduce-scatter into the allocated communication bucket, and shards of the result are accumulated into the main gradient buffer. Because we reduce every layer of every step, we only persistently hold onto a reduced and accumulated shard of the gradient.🚨 Bug Fixes 🚨
optimhad corrupted gradients, where the main gradient would be reduce-scattered into a temporary shard, but the reduced shard would be accumulated back into the source main gradient shard (without zero'ing the buffer), leading to duplicate gradients.copyand+=cases to the DP-Shard gradient reduction.(...)representing the reduced gradient andgNrepresenting the pre-reduce accumulated gradient:(g1 + g2)g1 + (g1 + g2)torch.empty_liketemporary shard, the bug would have doubled the gradient when usingoptim, i.e.(g1 + g2) += (g1 + g2)!optimwe copy the reduced gradient shard into the main gradient buffer if a communication buffer was allocated, otherwise the reduce-scatter directly updates the shard of the main gradient buffer. (Same forno_shardas well, but using all-reduce and copying the reduced un-sharded gradient.)Minor Edits
free_bucket_storage()to remove the criteria that only deallocates buckets for sharded buffers and factor out theparam.main_gradreset toreset_param_main_grad().fetch_bucket()will only allocate temporary buckets if the data-type is different, or if the buffer is sharded. So there is a loophole where a custom data-type allocation will not be deallocated if the buffer is sharded.AllGatherPipeline.release_bucket().reset_param_main_grad()only needs to be called when the FSDP gradient buffer on DP-Shard has completed its collectives and installed the reduced gradient in local data.param.main_gradwill first point to the unreduced gradient bucket, and then point to the DP-Shard reduced main gradient buffer data (or a custom data-type variant of the aforementioned values).check_for_nan_in_gradfor Megatron-LM (called instart_grad_sync) andreport_nan_in_param_gradforfully_shard, which both default toFalseinMegatronFSDP.report_nan_in_param_gradin particular is an expensive operation that can degrade performance by around 5%, but can be extremely useful for quickly debugging the source of NaNs, whether they come from Megatron-FSDP or user models.no_syncin Megatron-LM and an even simplersync()/MegatronFSDP.set_model_auto_sync()for Megatron-external use (the opposite ofno_syncthat basically calls all the necessary functions to make Megatron-FSDP low-code in a vanilla training loop).Tests
All performance tests below use the following configuration (unless otherwise specified):
optim_grads_params--outer-dp-sharding-strategy optimand--num-distributed-optimizer-instances 2.--use-nccl-ub--fsdp-manual-registration--fsdp-double-bufferfor NCCL UB perf experiments.Performance & Accuracy Parity with FP32 Gradient Communication + Accumulation (Reduce-Scatter)
mainbranch.Mixed-Precision BF16 Gradient Communication + FP32 Gradient Reduction / Accumulation
--megatron-fsdp-grad-comm-dtype bf16enables BF16 communication and FP32 reduction / accumulation if NCCL2.27+is used with NCCL UBR for pure FSDP.mainbranch!HFSDP Performance & Accuracy Tests (BF16 Gradient Communication + FP32 Gradient Reduction / Accumulation)
--num-distributed-optimizer-instances 2and--outer-dp-sharding-strategy optimhas parity on loss after 100 steps, and is just shy of 4x faster (3.62x) per global batch from 4 Nodes on Llama 8B, compared with FSDP.Extra Tests
optimgradient fix, and GBS 128 / MBS 1, we have improved loss (5.48vs.5.56) and reduced gradient norm (19.143vs.22.110) as we are no longer duplicating the gradient on the local rank, i.e.grad_i + sum(grad_i)instead of the expectedsum(grad_i).grad_comm_dtype=torch.bfloat16due to the lack of symmetric RS kernels that is difficult to reproduce on H100.--bf16argument) works without any issues with NCCL UBR.torch.empty_likeoutput buffer for HFSDP, i.e. DP-Outer reduce-scatter is in-place as in DP-Shard on 1 Node / 8 GPUs.NaNfor all weight gradients withfully_shard(report_nan_in_param_grad=True)costs a slight performance regression of +5% global step time. Should only be turned on for debugging!Future Work
param_comm_dtypedoesn't have that much use right now outside of the already supported TransformerEngine FP8 AG, so we will defer this to the future when we have plans for quantized AG for non-FP8 parameters, which in itself requires some research into the effect of extra quantization operations on sharded parameters vs. un-sharded parameters in model training.MegatronFSDP.__init__(debug=False)argument for improved unit tests.MixedPrecisionPolicymodification support, currently only easy to use for training steps, or if the user adds hooks or code to modify the gradient communication data-type before the post-backward reduction.Appendix
Type-Promotion Examples
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.