[ROADMAP] Megatron Core Roadmap

This roadmap outlines the key features, enhancements, and improvements planned for Megatron Core. This is a tentative roadmap and subject to change based on community feedback and priorities.

**For detailed information on past releases, see the [Changelog](https://github.com/NVIDIA/Megatron-LM/blob/main/CHANGELOG.md).** **For MoE-specific roadmap, see [MoE Roadmap #1729](https://github.com/NVIDIA/Megatron-LM/issues/1729).**

---

## Future Releases

### Parallelism

- **Megatron FSDP Enhancements**
  - Fully Sharded Hybrid Sharded Data Parallel (FSHSDP) refinements
  - Non-MCore offload support
  - PyTorch Expert Parallel Pipeline Parallel research
  - PP support https://github.com/NVIDIA/Megatron-LM/pull/2302
- **FSDP checkpoint conversion** - Convert between fsdp_dtensor, torch_dist, and HF formats https://github.com/NVIDIA/Megatron-LM/issues/2805
- **Megatron FSDP + torch.compile** - Combined FSDP and compilation support https://github.com/NVIDIA/Megatron-LM/pull/2425

### Performance

- **μP (Maximal Update Parameterization)** - Per-layer scaling for improved training dynamics https://github.com/NVIDIA/Megatron-LM/pull/3058
- **Distributed Muon optimizer enhancements** - Improved distributed capabilities https://github.com/NVIDIA/Megatron-LM/pull/1768, https://github.com/NVIDIA/Megatron-LM/issues/2936
- **Embeddings/output parameter sharding** with LayerWiseOptimizer https://github.com/NVIDIA/Megatron-LM/issues/2163
- **Long context support**
  - Dual Chunk Attention mechanism https://github.com/NVIDIA/Megatron-LM/issues/2797
  - RoPE ABF (Absolute Bias Free) variant
- **Attention variants** - NSA and other sparse attention mechanisms
- **Multiple model training** in single process - Multi-model training efficiency
- **Improved activation offloading** - Enhanced memory management 
- **torch.compile compatibility** - TP-compatible compilation https://github.com/NVIDIA/Megatron-LM/issues/2598

### Model Support

- **Multimodal + Diffusion Model Consolidation** - Unify multimodal and diffusion model stacks in Megatron Core by upstreaming and integrating capabilities from [NVIDIA-NeMo/DFM](https://github.com/NVIDIA-NeMo/DFM) https://github.com/NVIDIA/Megatron-LM/issues/1592
- **MiMo-V2-Flash** - Hybrid Attention + Fine-Grained MoE https://github.com/NVIDIA/Megatron-LM/issues/2976
- **Video Generation Models** - Wan, DiT architectures https://github.com/NVIDIA/Megatron-LM/issues/2796
- **Discrete Diffusion Language Models** https://github.com/NVIDIA/Megatron-LM/issues/2728
- **Qwen3-Next** - Enhanced Qwen3-Next support with optimized kernels
- **World models** - Environment simulation and prediction models

### Inference

- **Dynamic Inference Context for T5** https://github.com/NVIDIA/Megatron-LM/issues/3016
- **Async CPU/GPU compute overlap** during dynamic inference https://github.com/NVIDIA/Megatron-LM/issues/2019

### Ease of Use

- **Unified configs between Megatron-LM and Megatron Bridge** - Consistent configuration across training and conversion workflows (e.g., https://github.com/NVIDIA/Megatron-LM/pull/2896)
  - Remove global variables from MLM - Cleaner codebase https://github.com/NVIDIA/Megatron-LM/issues/2315
- **Training loop modularization** - Consistent training loop between Megatron-LM and Megatron Bridge
- **HF Tokenizer integration** - Native Hugging Face tokenizer support
- **Per-layer logging and memory estimation** - Detailed profiling capabilities
- **Model Provider Interface** from Bridge upstreamed to MLM https://github.com/NVIDIA/Megatron-LM/issues/2314
- **On-the-fly tokenization** for dataloader https://github.com/NVIDIA/Megatron-LM/issues/2727
- **Improved wandb integration** - Training workflow and experiment tracking

### Precision

- **Advanced FP4 support** - Extended FP4 capabilities
- **FP8 param gather with CPU offloading** https://github.com/NVIDIA/Megatron-LM/issues/2407
- **MXFP8 fp8-param-gather** https://github.com/NVIDIA/Megatron-LM/issues/2582

### Multimodal

- **Heterogeneous Parallelism for MIMO** https://github.com/NVIDIA/Megatron-LM/issues/1375
  - **MiMo** is the canonical early-fusion multimodal architecture in Megatron Core, enabling modular vision, audio, and video encoders to plug into a shared LLM with out-of-the-box distributed training support
  - Enable **independent nD parallelism** (TP/DP/CP/EP/PP) for each module (encoders and LLM)
    - **Colocated training** (encoder and LLM share GPUs)
    - **Non-colocated training** (encoders and LLM on disjoint GPU sets)
  - **FSDP support** for whole-model sharding

### Infrastructure & Ecosystem

- **Enhanced cross-datacenter training UX** https://github.com/NVIDIA/Megatron-LM/issues/2795
- **NCCL GIN support** for AWS EFA https://github.com/NVIDIA/Megatron-LM/issues/2647
- **Windows support** https://github.com/NVIDIA/Megatron-LM/issues/2609

### Finetuning

- **SFT Support** for GPTModel https://github.com/NVIDIA/Megatron-LM/pull/3542

---

## v0.16 Highlights (Released February 2026)

### Parallelism

- **Partial CUDA Graph for EP Overlap** - Release CPU pressure for EP A2A overlap https://github.com/NVIDIA/Megatron-LM/pull/2184, https://github.com/NVIDIA/Megatron-LM/pull/2810
- **Hybrid Data x Context Parallelism** - New parallelism strategy combining DP and CP https://github.com/NVIDIA/Megatron-LM/pull/2054, https://github.com/NVIDIA/Megatron-LM/pull/2282
- **Seq1F1B** - Efficient extension of pipeline parallelism for long-context training https://github.com/NVIDIA/Megatron-LM/pull/3116
- **Local parallelism for RL and multimodal** - Different model components can use different layouts
- **FSDP for DeepSeek-V3** - Fully Sharded Data Parallel for DSv3

### Performance & Memory

- **Fused Linear and Cross Entropy** - Fuse lm_head and CE to avoid materializing intermediate logits https://github.com/NVIDIA/Megatron-LM/pull/2256
- **Optimizer State Offloading** - Offload optimizer states and master weights to CPU https://github.com/NVIDIA/Megatron-LM/pull/2987
- **Layerwise Distributed Muon Optimizer** - With torch_dist checkpoint format and Muon support https://github.com/NVIDIA/Megatron-LM/pull/1928, https://github.com/NVIDIA/Megatron-LM/pull/2261
- **Shampoo (Second Order) optimizer** - Advanced second-order optimization (see [NVIDIA-NeMo/Emerging-Optimizers](https://github.com/NVIDIA-NeMo/Emerging-Optimizers))
- **CUDA Graph Improvements** - Capture time, replay time, memory footprint optimizations https://github.com/NVIDIA/Megatron-LM/pull/2572

### Inference

- **Prefix caching for KV attention** - Efficient KV cache reuse https://github.com/NVIDIA/Megatron-LM/pull/3063
- **OpenAI API server** - OpenAI-compatible inference API https://github.com/NVIDIA/Megatron-LM/pull/3107
- **vLLM fakequant export** - Export support for vLLM quantization https://github.com/NVIDIA/Megatron-LM/pull/3050

### Model Support

- **Qwen3-VL** https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/1943
- **Kimi K2** https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/683
- **BitNet / Falcon H1** https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/1462
- **mRoPE for MTP** - Multi-modal RoPE support for Multi-Token Prediction https://github.com/NVIDIA/Megatron-LM/pull/3114
- **Nemotron-3-Nano PTQ** - FP8/NVFP4 PTQ support for Nemotron models https://github.com/NVIDIA/Megatron-LM/pull/3079
- **Kimi Delta Attention (KDA)** https://github.com/NVIDIA/Megatron-LM/issues/2446

### Megatron FSDP

- **FP8 Params Support** - MXFP8/Blockwise FP8 params for Megatron-FSDP https://github.com/NVIDIA/Megatron-LM/pull/2239
- **Outer DP sharding strategy CLI** - User-configurable HSDP strategy https://github.com/NVIDIA/Megatron-LM/pull/3053
- **Dtype customization** - Flexible dtype configuration https://github.com/NVIDIA/Megatron-LM/pull/3067
- **All-gather in start param sync** - Optimized parameter synchronization https://github.com/NVIDIA/Megatron-LM/pull/3095
- **Megatron FSDP performance** - EP, EP+PP, HSDP+ZeRO-1 optimizations

### Ease of Use

- **Megatron Bridge + Megatron-LM Example** - End-to-end checkpoint conversion and training https://github.com/NVIDIA/Megatron-LM/pull/3018
- **Replace ModuleSpec with Protocols** - Typing improvements for LayerNorm and MLP https://github.com/NVIDIA/Megatron-LM/pull/3084, https://github.com/NVIDIA/Megatron-LM/pull/3090
- **Nemo Curator example** - Data curation workflow integration

### Precision

- **FP4 training recipes** - Low-precision training configurations and best practices

### RL

- **KV cache CPU offload** - Offload KV cache during training with fixed virtual address https://github.com/NVIDIA/Megatron-LM/pull/3048
- **Off-policyness tracking** - Track off-policy metrics across RL steps https://github.com/NVIDIA/Megatron-LM/pull/3030
- **Profiling improvements** - Enhanced profiling for RL workloads https://github.com/NVIDIA/Megatron-LM/pull/3110

---

## v0.15 Highlights (Released December 2025)

### Parallelism

- **HyperCommGrid: N-Dimensional Communication Grid** for Model Parallelism ([45400df](https://github.com/NVIDIA/Megatron-LM/commit/45400df7da7fa23e3aff86804e5ac254d9a8d3c0))
- **Advanced communication group management** - Flexible creation and management of communication groups
- **Megatron FSDP** - NVIDIA-optimized FSDP implementation ([GitHub](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src))
- **Configurable Megatron FSDP communication double buffering** - Improved FSDP communication efficiency and throughput with persistent param/grad collective buffers

### Performance

- **Fused QKV preprocessing** with precomputed RoPE caches (3x preprocessing speedup, 10-14% E2E)
- **Adam and AdamW optimizer** - Configurable decoupled weight decay and precision-aware settings
- **CPU activation offloading** via TransformerEngine
- **Spike No More embedding optimizations** - Enhanced embedding initialization strategies

### Model Support & Training

- **Qwen3-Next** https://github.com/NVIDIA/Megatron-LM/pull/1989
- **SWA mixing with full attention control** - Sliding window attention improvements
- **Knowledge Distillation (KD)** with hybrid training loop
- **YaRN support** for GPT-OSS models - Extended context length support

### Inference

- **Speculative decoding** implementation
- **Asynchronous inference** support
- **CUDA Graph runner lookup table cache** (up to 2x E2E speedup)
- **FP8 inference** - Full FP8 inference pipeline support
- **Dynamic audio shapes** with variable sequence lengths (2.5x throughput improvement)

### RL

- **Importance sampling and partial rollouts** - Advanced RL capabilities
- **Sequence packing for RL** - Improved RL training efficiency

### Ease of Use

- [**Megatron Bridge**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src) - HF Bi-directional Converter
- **uv support** - Enhanced dependency management
- **ModelOpt pruning example** - Practical model optimization examples
- **Documentation overhaul** - New developer guide with quickstart and tutorials ([Docs](https://docs.nvidia.com/megatron-core/developer-guide/latest/get-started/quickstart.html))

---

## How to Provide Feedback

We welcome community input on prioritization! Please:

1. **React to items** - Use 👍 on issues/PRs you'd like prioritized
2. **Comment on this issue** - Share your use cases and requirements
3. **Open feature requests** - Create issues with the `enhancement` label
4. **Contribute** - PRs are welcome for any roadmap item!

---

## Credits

This roadmap reflects the collective efforts of NVIDIA and our collaborators.

---

*Last updated: March 2026*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROADMAP] Megatron Core Roadmap #4003

Future Releases

Parallelism

Performance

Model Support

Inference

Ease of Use

Precision

Multimodal

Infrastructure & Ecosystem

Finetuning

v0.16 Highlights (Released February 2026)

Parallelism

Performance & Memory

Inference

Model Support

Megatron FSDP

Ease of Use

Precision

RL

v0.15 Highlights (Released December 2025)

Parallelism

Performance

Model Support & Training

Inference

RL

Ease of Use

How to Provide Feedback

Credits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROADMAP] Megatron Core Roadmap #4003

Description

Future Releases

Parallelism

Performance

Model Support

Inference

Ease of Use

Precision

Multimodal

Infrastructure & Ecosystem

Finetuning

v0.16 Highlights (Released February 2026)

Parallelism

Performance & Memory

Inference

Model Support

Megatron FSDP

Ease of Use

Precision

RL

v0.15 Highlights (Released December 2025)

Parallelism

Performance

Model Support & Training

Inference

RL

Ease of Use

How to Provide Feedback

Credits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions