Skip to content

[TTS] Refactor Magpie to support codec conversion and bandwidth extension#15191

Merged
blisc merged 13 commits intoNVIDIA-NeMo:mainfrom
rlangman:magpietts_eos_fix
Jan 7, 2026
Merged

[TTS] Refactor Magpie to support codec conversion and bandwidth extension#15191
blisc merged 13 commits intoNVIDIA-NeMo:mainfrom
rlangman:magpietts_eos_fix

Conversation

@rlangman
Copy link
Copy Markdown
Collaborator

This PR fixes a bug in MagpieTTS training in which self._codec_converter.convert_original_to_new() is being called on audio codec tokens after the Lhotse data loader has already pre-populated them with BOS and EOS tokens.

The addition of BOS and EOS tokens is now always done inside the model class, after convert_original_to_new() is called. These tokens are also removed before convert_new_to_original() is called, so that they are not mistakenly passed into the codec model.

Additionally, this includes changes to the magpie code and codec interfaces to support doing bandwidth extension within the codec, where the sample rate of the input and output audio is different.

@blisc blisc requested a review from rfejgin December 15, 2025 20:35
# codes: (B, C, T')
# codes_len: (B,)
codes = codes[:, :, num_tokens:]
codes_len = codes_len - num_tokens
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a debug check confirming that post this operation no BOS codes remain in the tensor. To catch cases like having frame_stacking_factor's worth of BOS tokens and forgetting to remove them all.

It has to be done under some kind of debug flag though since this would cause a host-gpu synchronization (this only matters if this is something that happens in a "hot" path like every training step or every inference step).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are supposed to avoid of if/else in our training and inference code due to overhead, debug flags does not sound feasible. Unit tests would be more appropriate.

) # (B, T, E) # Computing this to be use in the alignment encoder
audio_codes_embedded = audio_codes_embedded_all[
:, :-1, :
] # (B, T', E) Input to the decoder; this is already in the frame-stacked domain, hence the -1 (not `frame_stacking_factor`)
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code (pre change) is bit delicate so I'm nervous about changing it; took a while to get this right as there are a few corner cases.

I'll review in greater detail but I suggest to train with 4x-frame-stacking (not just 2x) as I've observed 4x to uncover some corner cases that don't come up in 2x-frame-stacking. Also it would be a good idea to collect some inference metrics; when I got some details here wrong it showed up as slightly elevated WERs (by ~1 percentage point) in the 4x-stacked case. The baseline would be the issue, might have to train with the pre-refactored code too.

That said, the refactored code does look cleaner!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for testing I need to validate that the code runs when frame stacking is set to 4.

It sounds like we also want to train 2 models with 2x frame stacking before and after this change and validate that they have the same evaluation performance. Would we block the merging of this PR on that multi-week test?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have training runs with 4x frame sacking ongoing, using code before and after this commit. Training and validation losses are identical. I can run evaluation when they are finished.

Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rlangman for going through the extra validation. Please run evaluation once training is done or long enough (say 300k steps). If metrics are similar I'm good with merging. For reference: with a 4x-stacked model with 5-second context, trained on English, these are the metrics I get (though you could also use your pre-change model as baseline):

Unseen
SSIM=0.70
WER=1.2%
UTMOSv2=3.55
FCD=0.681

Seen
SSIM=0.77
WER=1.05%
UTMOSv2=3.53
FCD=0.594

(if you used a longer context of 10 seconds the metrics will be better)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlangman One additional thing worth checking with regards to frame stacking:
Listen to a few output samples from the 4x-stacked model. Look for cutoff in the last word of the utterance - there shouldn't be any (or at least not more than a non-stacked model). Cutoff mid-last-word is one issue we had when EOS related corner cases were not handled correctly (earlier on), and the refactored code touched those code areas. Maybe also look for first word cutoff in case something similar happens with BOS.

Copy link
Copy Markdown
Collaborator Author

@rlangman rlangman Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the evaluation on LibriTTS unseen speakers.

Before code change:

SSIM: 0.702
CER: 1.25%
WER: 2.16%
UTMOSv2: 3.23

After code change:

SSIM: 0.696
CER: 0.96%
WER: 1.84%
UTMOSv2: 3.24

Compared to equivalent non-frame stacked versions, there is a drop in speaker similarity (0.8 -> 0.7). There is also higher CER (0.3% -> 1%) due to the frame-stacked models being prone to earlier termination. However the results between the two frame-stacked models are similar.

Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, great! Thanks for doing the extra work to verify this.

The lower SSIM for the 4x-frame-stacked model compared to non-frame-stacked is not surprising, especially since this is a small local transformer and we have not done the "repeat context" trick. The fact that the two frame-stacked models have similar stats is what we were looking for and found here.

As discussed offline, I am also training a pair of 4x-stacked, large-LT, long-context models which we can use to do extra verification later on in en even more more controlled setting (without termination issues). But no need to wait for that, these results are already quite convincing.

Good to merge from my end.

Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Please see comments inline

codes = codes * mask.unsqueeze(1)
return codes, codes_len

def remove_embedded_eos_token(self, embedded, embedded_len):
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding remove_embedded_eos_token(): It might make sense to add a comment saying that when using frame stacking, this embedding may be formed out of a mix of EOS and non-EOS tokens (but that is still correct, as discussed). Maybe a better name would be just "remove_final_embedding"

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
@blisc blisc merged commit be2fac6 into NVIDIA-NeMo:main Jan 7, 2026
139 checks passed
AkCodes23 pushed a commit to AkCodes23/NeMo that referenced this pull request Jan 28, 2026
…sion (NVIDIA-NeMo#15191)

* [TTS] Refactor Magpie to support codec conversion and bandwidth extension

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix frame stacking logic

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Modify data loader arguments in inference script

Signed-off-by: Ryan <rlangman@nvidia.com>

* Change lhotse audio codes to unsigned int

Signed-off-by: Ryan <rlangman@nvidia.com>

* Update new inference script

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Convert codes to int32 in lhotse

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix CI test errors

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix predicted length

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Fix BOS and EOS handling in PO

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Akhil Varanasi <akhilvaranasi23@gmail.com>
nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
…sion (NVIDIA-NeMo#15191)

* [TTS] Refactor Magpie to support codec conversion and bandwidth extension

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix frame stacking logic

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Modify data loader arguments in inference script

Signed-off-by: Ryan <rlangman@nvidia.com>

* Change lhotse audio codes to unsigned int

Signed-off-by: Ryan <rlangman@nvidia.com>

* Update new inference script

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Convert codes to int32 in lhotse

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix CI test errors

Signed-off-by: Ryan <rlangman@nvidia.com>

* Fix predicted length

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

* Fix BOS and EOS handling in PO

Signed-off-by: Ryan <rlangman@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: rlangman <rlangman@users.noreply.github.com>

---------

Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Co-authored-by: rlangman <rlangman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants