Refactor Wan Model Training & Add Wan-VACE Training Support by ninatu · Pull Request #352 · AI-Hypercomputer/maxdiffusion

ninatu · 2026-03-11T14:35:33Z

This PR introduces several improvements and fixes to the Wan model training, as well as adds support for training Wan-VACE models.

Key changes include:

Bug fixes:
- Resolved training mode bug when dropout > 0 (e.g., ensured rngs parameter is passed to layer_forward for gradient checkpointing with dropout)
- Fixed prepare_sample_fn usage for 'tfrecord' dataset type.
- Addressed checkpoint loading issues with larger TPU slices and different topologies for Wan 2.1.
- Corrected timestep sampling for continuous sampling
Config updates:
- Ensured adam_weight_decay is a float.
- Added tensorboard_dir parameter for logging.
- Now uses config.learning_rate instead of a hardcoded value.
- Set default dropout to 0.0 in WAN configs (instead of 0.1).
Wan-VACE Support:
- Refactoring: Common training components (initialization, scheduler, TFLOPs calculation, training/eval loops) have been abstracted into a new BaseWanTrainer ABC to improve code structure and reusability.
- Added new scripts (train_wan_vace.py), trainer (wan_vace_trainer.py), and checkpointing logic (wan_vace_checkpointing_2_1.py) to enable training of WAN-VACE models.
New Features:
- Introduced config.disable_training_weights to optionally disable mid-point loss weighting.
- Added logging for max_grad_norm and max_abs_grad.

github-actions · 2026-03-11T14:35:44Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

src/maxdiffusion/configs/base_wan_14b.yml

entrpn · 2026-03-12T21:03:46Z

As this is a fairly large refactor:

@prishajain1 can you do a review of the checkpointing changes?
@susanbao can you take a quick look at the training changes?

src/maxdiffusion/checkpointing/wan_checkpointer_2_1.py

entrpn · 2026-03-31T18:04:02Z

src/maxdiffusion/checkpointing/wan_vace_checkpointer_2_1.py

+      )
+
+    max_logging.log("Restoring WAN checkpoint")
+    restored_checkpoint = self.checkpoint_manager.restore(


is this replicating the sharding across devices? If so, would this be able to load on a trillium tpu with 32GB of HBM?

The mesh is created using CPU devices, not TPU devices. So the model is being loaded into RAM.

entrpn · 2026-03-31T18:07:02Z

src/maxdiffusion/pipelines/wan/wan_vace_pipeline_2_1.py

+    try:
+      state[path].value = device_put_replicated(val, sharding)
+    except Exception as e:
+      max_logging.log(f"Failed to device_put_replicated {path}: {e}")


under what conditions is the exception code is executed?

This code is executed when weights are not fully available on all hosts, which occurs when a checkpoint is loaded in a multi-host training or inference environment. Without process_allgather in this case, the code raises the following error: ValueError: When the second argument to device_put is a Device, the first argument must be a fully addressable array or a non-addressable array with a single device sharding.

entrpn · 2026-04-01T15:39:36Z

@ninatu overall looks good, can you squash your commits and run the linter and this should be good to go.

Key changes include: 1. Bug fixes: * Resolved training mode bug when dropout > 0 (e.g., ensured rngs parameter is passed to layer_forward for gradient checkpointing with dropout) * Fixed prepare_sample_fn usage for 'tfrecord' dataset type. * Addressed checkpoint loading issues with larger TPU slices and different topologies for Wan 2.1. * Corrected timestep sampling for continuous sampling 2. Config updates: * Ensured adam_weight_decay is a float. * Added tensorboard_dir parameter for logging. * Now uses config.learning_rate instead of a hardcoded value. * Set default dropout to 0.0 in WAN configs (instead of 0.1). 3. Wan-VACE Support: * Refactoring: Common training components (initialization, scheduler, TFLOPs calculation, training/eval loops) have been abstracted into a new BaseWanTrainer ABC to improve code structure and reusability. * Added new scripts (train_wan_vace.py), trainer (wan_vace_trainer.py), and checkpointing logic (wan_vace_checkpointing_2_1.py) to enable training of WAN-VACE models. 4. New Features: * Introduced config.disable_training_weights to optionally disable mid-point loss weighting. * Added logging for max_grad_norm and max_abs_grad. Co-authored-by: martinarroyo <martinarroyo@google.com>

ninatu · 2026-04-01T20:36:23Z

@entrpn, thanks, I squashed! It requires approval again

ninatu requested a review from entrpn as a code owner March 11, 2026 14:35

entrpn reviewed Mar 12, 2026

View reviewed changes

src/maxdiffusion/configs/base_wan_14b.yml Outdated Show resolved Hide resolved

prishajain1 reviewed Mar 13, 2026

View reviewed changes

src/maxdiffusion/checkpointing/wan_checkpointer_2_1.py Outdated Show resolved Hide resolved

prishajain1 reviewed Mar 13, 2026

View reviewed changes

src/maxdiffusion/checkpointing/wan_checkpointer_2_1.py Show resolved Hide resolved

entrpn reviewed Mar 31, 2026

View reviewed changes

entrpn previously approved these changes Apr 1, 2026

View reviewed changes

ninatu dismissed entrpn’s stale review via f9f9506 April 1, 2026 20:34

ninatu force-pushed the ninatu/wan_training branch from 8c82c74 to f9f9506 Compare April 1, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Wan Model Training & Add Wan-VACE Training Support#352

Refactor Wan Model Training & Add Wan-VACE Training Support#352
ninatu wants to merge 1 commit intomainfrom
ninatu/wan_training

ninatu commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Uh oh!

entrpn commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

entrpn Mar 31, 2026

Uh oh!

ninatu Mar 31, 2026

Uh oh!

entrpn Mar 31, 2026

Uh oh!

ninatu Mar 31, 2026

Uh oh!

entrpn commented Apr 1, 2026

Uh oh!

ninatu commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ninatu commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Uh oh!

entrpn commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

entrpn Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ninatu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

entrpn Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ninatu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

entrpn commented Apr 1, 2026

Uh oh!

ninatu commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants