Skip to content

[ROADMAP] Megatron Core Roadmap #4003

@sbhavani

Description

@sbhavani

This roadmap outlines the key features, enhancements, and improvements planned for Megatron Core. This is a tentative roadmap and subject to change based on community feedback and priorities.

For detailed information on past releases, see the Changelog. For MoE-specific roadmap, see MoE Roadmap #1729.


Future Releases

Parallelism

Performance

Model Support

Inference

Ease of Use

Precision

Multimodal

  • Heterogeneous Parallelism for MIMO [QUESTION] Support for Heterogeneous Parallelism in Multimodal Training #1375
    • MiMo is the canonical early-fusion multimodal architecture in Megatron Core, enabling modular vision, audio, and video encoders to plug into a shared LLM with out-of-the-box distributed training support
    • Enable independent nD parallelism (TP/DP/CP/EP/PP) for each module (encoders and LLM)
      • Colocated training (encoder and LLM share GPUs)
      • Non-colocated training (encoders and LLM on disjoint GPU sets)
    • FSDP support for whole-model sharding

Infrastructure & Ecosystem

Finetuning


v0.16 Highlights (Released February 2026)

Parallelism

Performance & Memory

Inference

Model Support

Megatron FSDP

Ease of Use

Precision

  • FP4 training recipes - Low-precision training configurations and best practices

RL


v0.15 Highlights (Released December 2025)

Parallelism

  • HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (45400df)
  • Advanced communication group management - Flexible creation and management of communication groups
  • Megatron FSDP - NVIDIA-optimized FSDP implementation (GitHub)
  • Configurable Megatron FSDP communication double buffering - Improved FSDP communication efficiency and throughput with persistent param/grad collective buffers

Performance

  • Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E)
  • Adam and AdamW optimizer - Configurable decoupled weight decay and precision-aware settings
  • CPU activation offloading via TransformerEngine
  • Spike No More embedding optimizations - Enhanced embedding initialization strategies

Model Support & Training

Inference

  • Speculative decoding implementation
  • Asynchronous inference support
  • CUDA Graph runner lookup table cache (up to 2x E2E speedup)
  • FP8 inference - Full FP8 inference pipeline support
  • Dynamic audio shapes with variable sequence lengths (2.5x throughput improvement)

RL

  • Importance sampling and partial rollouts - Advanced RL capabilities
  • Sequence packing for RL - Improved RL training efficiency

Ease of Use

  • Megatron Bridge - HF Bi-directional Converter
  • uv support - Enhanced dependency management
  • ModelOpt pruning example - Practical model optimization examples
  • Documentation overhaul - New developer guide with quickstart and tutorials (Docs)

How to Provide Feedback

We welcome community input on prioritization! Please:

  1. React to items - Use 👍 on issues/PRs you'd like prioritized
  2. Comment on this issue - Share your use cases and requirements
  3. Open feature requests - Create issues with the enhancement label
  4. Contribute - PRs are welcome for any roadmap item!

Credits

This roadmap reflects the collective efforts of NVIDIA and our collaborators.


Last updated: March 2026

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions