Skip to content

Feature Request: NCCL GIN (GPU-Initiated Networking) Support for AWS EFA #2647

@sbhavani

Description

@sbhavani

Summary

Request support for NCCL GIN (GPU-Initiated Networking) in Megatron Core to improve MoE training performance on AWS EFA infrastructure.

Motivation

Training large-scale MoE models on AWS EFA can have suboptimal performance due to:

• Network contention between different collective operations (EP all-to-all, DP all-reduce, etc.)
• Lack of network isolation for different process groups
• Inconsistent performance on EFA due to shared network resources

NCCL GIN (available in NCCL 2.18+) provides network-level isolation for NCCL communicators, enabling assignment of specific network interfaces to specific process groups—particularly beneficial for cloud infrastructure like AWS EFA.

Current State

Megatron Core has excellent DeepEP/HybridEP integration for MoE token dispatching, but the existing NCCL configuration in parallel_state.py lacks GIN-related options needed for EFA optimization.

Ask

Add NCCL GIN configuration support to enable network isolation for different process groups (EP, DP, TP, etc.) on AWS EFA environments.

References

• Efficient Cross-Node MoE Communication (arXiv:2511.15076) https://arxiv.org/pdf/2511.15076
• pplx-garden MoE Kernels https://github.com/perplexityai/pplx-garden

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions