Summary
Request support for NCCL GIN (GPU-Initiated Networking) in Megatron Core to improve MoE training performance on AWS EFA infrastructure.
Motivation
Training large-scale MoE models on AWS EFA can have suboptimal performance due to:
• Network contention between different collective operations (EP all-to-all, DP all-reduce, etc.)
• Lack of network isolation for different process groups
• Inconsistent performance on EFA due to shared network resources
NCCL GIN (available in NCCL 2.18+) provides network-level isolation for NCCL communicators, enabling assignment of specific network interfaces to specific process groups—particularly beneficial for cloud infrastructure like AWS EFA.
Current State
Megatron Core has excellent DeepEP/HybridEP integration for MoE token dispatching, but the existing NCCL configuration in parallel_state.py lacks GIN-related options needed for EFA optimization.
Ask
Add NCCL GIN configuration support to enable network isolation for different process groups (EP, DP, TP, etc.) on AWS EFA environments.
References
• Efficient Cross-Node MoE Communication (arXiv:2511.15076) https://arxiv.org/pdf/2511.15076
• pplx-garden MoE Kernels https://github.com/perplexityai/pplx-garden
Summary
Request support for NCCL GIN (GPU-Initiated Networking) in Megatron Core to improve MoE training performance on AWS EFA infrastructure.
Motivation
Training large-scale MoE models on AWS EFA can have suboptimal performance due to:
• Network contention between different collective operations (EP all-to-all, DP all-reduce, etc.)
• Lack of network isolation for different process groups
• Inconsistent performance on EFA due to shared network resources
NCCL GIN (available in NCCL 2.18+) provides network-level isolation for NCCL communicators, enabling assignment of specific network interfaces to specific process groups—particularly beneficial for cloud infrastructure like AWS EFA.
Current State
Megatron Core has excellent DeepEP/HybridEP integration for MoE token dispatching, but the existing NCCL configuration in parallel_state.py lacks GIN-related options needed for EFA optimization.
Ask
Add NCCL GIN configuration support to enable network isolation for different process groups (EP, DP, TP, etc.) on AWS EFA environments.
References
• Efficient Cross-Node MoE Communication (arXiv:2511.15076) https://arxiv.org/pdf/2511.15076
• pplx-garden MoE Kernels https://github.com/perplexityai/pplx-garden