[TTS] Refactor Magpie to support codec conversion and bandwidth extension by rlangman · Pull Request #15191 · NVIDIA-NeMo/NeMo

rlangman · 2025-12-15T18:17:48Z

This PR fixes a bug in MagpieTTS training in which self._codec_converter.convert_original_to_new() is being called on audio codec tokens after the Lhotse data loader has already pre-populated them with BOS and EOS tokens.

The addition of BOS and EOS tokens is now always done inside the model class, after convert_original_to_new() is called. These tokens are also removed before convert_new_to_original() is called, so that they are not mistakenly passed into the codec model.

Additionally, this includes changes to the magpie code and codec interfaces to support doing bandwidth extension within the codec, where the sample rate of the input and output audio is different.

nemo/collections/tts/models/magpietts.py

nemo/collections/tts/data/text_to_speech_dataset.py

nemo/collections/tts/data/text_to_speech_dataset_lhotse.py

nemo/collections/tts/models/audio_codec.py

nemo/collections/tts/models/magpietts.py

rfejgin · 2025-12-15T22:38:57Z

nemo/collections/tts/models/magpietts.py

+        # codes: (B, C, T')
+        # codes_len: (B,)
+        codes = codes[:, :, num_tokens:]
+        codes_len = codes_len - num_tokens


It would be good to add a debug check confirming that post this operation no BOS codes remain in the tensor. To catch cases like having frame_stacking_factor's worth of BOS tokens and forgetting to remove them all.

It has to be done under some kind of debug flag though since this would cause a host-gpu synchronization (this only matters if this is something that happens in a "hot" path like every training step or every inference step).

Given that we are supposed to avoid of if/else in our training and inference code due to overhead, debug flags does not sound feasible. Unit tests would be more appropriate.

rfejgin · 2025-12-15T23:05:53Z

nemo/collections/tts/models/magpietts.py

-        )  # (B, T, E) # Computing this to be use in the alignment encoder
-        audio_codes_embedded = audio_codes_embedded_all[
-            :, :-1, :
-        ]  # (B, T', E) Input to the decoder; this is already in the frame-stacked domain, hence the -1 (not `frame_stacking_factor`)


This code (pre change) is bit delicate so I'm nervous about changing it; took a while to get this right as there are a few corner cases.

I'll review in greater detail but I suggest to train with 4x-frame-stacking (not just 2x) as I've observed 4x to uncover some corner cases that don't come up in 2x-frame-stacking. Also it would be a good idea to collect some inference metrics; when I got some details here wrong it showed up as slightly elevated WERs (by ~1 percentage point) in the 4x-stacked case. The baseline would be the issue, might have to train with the pre-refactored code too.

That said, the refactored code does look cleaner!

So for testing I need to validate that the code runs when frame stacking is set to 4.

It sounds like we also want to train 2 models with 2x frame stacking before and after this change and validate that they have the same evaluation performance. Would we block the merging of this PR on that multi-week test?

I have training runs with 4x frame sacking ongoing, using code before and after this commit. Training and validation losses are identical. I can run evaluation when they are finished.

Thanks @rlangman for going through the extra validation. Please run evaluation once training is done or long enough (say 300k steps). If metrics are similar I'm good with merging. For reference: with a 4x-stacked model with 5-second context, trained on English, these are the metrics I get (though you could also use your pre-change model as baseline):

Unseen
SSIM=0.70
WER=1.2%
UTMOSv2=3.55
FCD=0.681

Seen
SSIM=0.77
WER=1.05%
UTMOSv2=3.53
FCD=0.594

(if you used a longer context of 10 seconds the metrics will be better)

@rlangman One additional thing worth checking with regards to frame stacking:
Listen to a few output samples from the 4x-stacked model. Look for cutoff in the last word of the utterance - there shouldn't be any (or at least not more than a non-stacked model). Cutoff mid-last-word is one issue we had when EOS related corner cases were not handled correctly (earlier on), and the refactored code touched those code areas. Maybe also look for first word cutoff in case something similar happens with BOS.

This is the evaluation on LibriTTS unseen speakers.

Before code change:

SSIM: 0.702 CER: 1.25% WER: 2.16% UTMOSv2: 3.23

After code change:

SSIM: 0.696 CER: 0.96% WER: 1.84% UTMOSv2: 3.24

Compared to equivalent non-frame stacked versions, there is a drop in speaker similarity (0.8 -> 0.7). There is also higher CER (0.3% -> 1%) due to the frame-stacked models being prone to earlier termination. However the results between the two frame-stacked models are similar.

Okay, great! Thanks for doing the extra work to verify this.

The lower SSIM for the 4x-frame-stacked model compared to non-frame-stacked is not surprising, especially since this is a small local transformer and we have not done the "repeat context" trick. The fact that the two frame-stacked models have similar stats is what we were looking for and found here.

As discussed offline, I am also training a pair of 4x-stacked, large-LT, long-context models which we can use to do extra verification later on in en even more more controlled setting (without termination issues). But no need to wait for that, these results are already quite convincing.

Good to merge from my end.

rfejgin

Nice work. Please see comments inline

nemo/collections/tts/data/text_to_speech_dataset_lhotse.py

rfejgin · 2025-12-19T18:35:05Z

nemo/collections/tts/models/magpietts.py

+        codes = codes * mask.unsqueeze(1)
+        return codes, codes_len
+
+    def remove_embedded_eos_token(self, embedded, embedded_len):


regarding remove_embedded_eos_token(): It might make sense to add a comment saying that when using frame stacking, this embedding may be formed out of a mix of EOS and non-EOS tokens (but that is still correct, as discussed). Maybe a better name would be just "remove_final_embedding"

Signed-off-by: Ryan <rlangman@nvidia.com>

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

Signed-off-by: Ryan <rlangman@nvidia.com>

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

…sion (NVIDIA-NeMo#15191) * [TTS] Refactor Magpie to support codec conversion and bandwidth extension Signed-off-by: Ryan <rlangman@nvidia.com> * Fix frame stacking logic Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Modify data loader arguments in inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Change lhotse audio codes to unsigned int Signed-off-by: Ryan <rlangman@nvidia.com> * Update new inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Convert codes to int32 in lhotse Signed-off-by: Ryan <rlangman@nvidia.com> * Fix CI test errors Signed-off-by: Ryan <rlangman@nvidia.com> * Fix predicted length Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Fix BOS and EOS handling in PO Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> Signed-off-by: rlangman <rlangman@users.noreply.github.com> Co-authored-by: rlangman <rlangman@users.noreply.github.com> Signed-off-by: Akhil Varanasi <akhilvaranasi23@gmail.com>

…sion (NVIDIA-NeMo#15191) * [TTS] Refactor Magpie to support codec conversion and bandwidth extension Signed-off-by: Ryan <rlangman@nvidia.com> * Fix frame stacking logic Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Modify data loader arguments in inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Change lhotse audio codes to unsigned int Signed-off-by: Ryan <rlangman@nvidia.com> * Update new inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Convert codes to int32 in lhotse Signed-off-by: Ryan <rlangman@nvidia.com> * Fix CI test errors Signed-off-by: Ryan <rlangman@nvidia.com> * Fix predicted length Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Fix BOS and EOS handling in PO Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> Signed-off-by: rlangman <rlangman@users.noreply.github.com> Co-authored-by: rlangman <rlangman@users.noreply.github.com>

github-actions bot added the TTS label Dec 15, 2025

blisc added the Run CICD label Dec 15, 2025

blisc temporarily deployed to test December 15, 2025 20:28 — with GitHub Actions Inactive

blisc reviewed Dec 15, 2025

View reviewed changes

nemo/collections/tts/models/magpietts.py Outdated Show resolved Hide resolved

blisc reviewed Dec 15, 2025

View reviewed changes

nemo/collections/tts/models/magpietts.py Outdated Show resolved Hide resolved

blisc requested a review from rfejgin December 15, 2025 20:35