[TTS] Refactor Magpie to support codec conversion and bandwidth extension#15191
[TTS] Refactor Magpie to support codec conversion and bandwidth extension#15191blisc merged 13 commits intoNVIDIA-NeMo:mainfrom
Conversation
| # codes: (B, C, T') | ||
| # codes_len: (B,) | ||
| codes = codes[:, :, num_tokens:] | ||
| codes_len = codes_len - num_tokens |
There was a problem hiding this comment.
It would be good to add a debug check confirming that post this operation no BOS codes remain in the tensor. To catch cases like having frame_stacking_factor's worth of BOS tokens and forgetting to remove them all.
It has to be done under some kind of debug flag though since this would cause a host-gpu synchronization (this only matters if this is something that happens in a "hot" path like every training step or every inference step).
There was a problem hiding this comment.
Given that we are supposed to avoid of if/else in our training and inference code due to overhead, debug flags does not sound feasible. Unit tests would be more appropriate.
| ) # (B, T, E) # Computing this to be use in the alignment encoder | ||
| audio_codes_embedded = audio_codes_embedded_all[ | ||
| :, :-1, : | ||
| ] # (B, T', E) Input to the decoder; this is already in the frame-stacked domain, hence the -1 (not `frame_stacking_factor`) |
There was a problem hiding this comment.
This code (pre change) is bit delicate so I'm nervous about changing it; took a while to get this right as there are a few corner cases.
I'll review in greater detail but I suggest to train with 4x-frame-stacking (not just 2x) as I've observed 4x to uncover some corner cases that don't come up in 2x-frame-stacking. Also it would be a good idea to collect some inference metrics; when I got some details here wrong it showed up as slightly elevated WERs (by ~1 percentage point) in the 4x-stacked case. The baseline would be the issue, might have to train with the pre-refactored code too.
That said, the refactored code does look cleaner!
There was a problem hiding this comment.
So for testing I need to validate that the code runs when frame stacking is set to 4.
It sounds like we also want to train 2 models with 2x frame stacking before and after this change and validate that they have the same evaluation performance. Would we block the merging of this PR on that multi-week test?
There was a problem hiding this comment.
I have training runs with 4x frame sacking ongoing, using code before and after this commit. Training and validation losses are identical. I can run evaluation when they are finished.
There was a problem hiding this comment.
Thanks @rlangman for going through the extra validation. Please run evaluation once training is done or long enough (say 300k steps). If metrics are similar I'm good with merging. For reference: with a 4x-stacked model with 5-second context, trained on English, these are the metrics I get (though you could also use your pre-change model as baseline):
Unseen
SSIM=0.70
WER=1.2%
UTMOSv2=3.55
FCD=0.681
Seen
SSIM=0.77
WER=1.05%
UTMOSv2=3.53
FCD=0.594
(if you used a longer context of 10 seconds the metrics will be better)
There was a problem hiding this comment.
@rlangman One additional thing worth checking with regards to frame stacking:
Listen to a few output samples from the 4x-stacked model. Look for cutoff in the last word of the utterance - there shouldn't be any (or at least not more than a non-stacked model). Cutoff mid-last-word is one issue we had when EOS related corner cases were not handled correctly (earlier on), and the refactored code touched those code areas. Maybe also look for first word cutoff in case something similar happens with BOS.
There was a problem hiding this comment.
This is the evaluation on LibriTTS unseen speakers.
Before code change:
SSIM: 0.702
CER: 1.25%
WER: 2.16%
UTMOSv2: 3.23
After code change:
SSIM: 0.696
CER: 0.96%
WER: 1.84%
UTMOSv2: 3.24
Compared to equivalent non-frame stacked versions, there is a drop in speaker similarity (0.8 -> 0.7). There is also higher CER (0.3% -> 1%) due to the frame-stacked models being prone to earlier termination. However the results between the two frame-stacked models are similar.
There was a problem hiding this comment.
Okay, great! Thanks for doing the extra work to verify this.
The lower SSIM for the 4x-frame-stacked model compared to non-frame-stacked is not surprising, especially since this is a small local transformer and we have not done the "repeat context" trick. The fact that the two frame-stacked models have similar stats is what we were looking for and found here.
As discussed offline, I am also training a pair of 4x-stacked, large-LT, long-context models which we can use to do extra verification later on in en even more more controlled setting (without termination issues). But no need to wait for that, these results are already quite convincing.
Good to merge from my end.
rfejgin
left a comment
There was a problem hiding this comment.
Nice work. Please see comments inline
2d55705 to
1d5c5b5
Compare
| codes = codes * mask.unsqueeze(1) | ||
| return codes, codes_len | ||
|
|
||
| def remove_embedded_eos_token(self, embedded, embedded_len): |
There was a problem hiding this comment.
regarding remove_embedded_eos_token(): It might make sense to add a comment saying that when using frame stacking, this embedding may be formed out of a mix of EOS and non-EOS tokens (but that is still correct, as discussed). Maybe a better name would be just "remove_final_embedding"
1d5c5b5 to
a51e83e
Compare
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: rlangman <rlangman@users.noreply.github.com>
…sion (NVIDIA-NeMo#15191) * [TTS] Refactor Magpie to support codec conversion and bandwidth extension Signed-off-by: Ryan <rlangman@nvidia.com> * Fix frame stacking logic Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Modify data loader arguments in inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Change lhotse audio codes to unsigned int Signed-off-by: Ryan <rlangman@nvidia.com> * Update new inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Convert codes to int32 in lhotse Signed-off-by: Ryan <rlangman@nvidia.com> * Fix CI test errors Signed-off-by: Ryan <rlangman@nvidia.com> * Fix predicted length Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Fix BOS and EOS handling in PO Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> Signed-off-by: rlangman <rlangman@users.noreply.github.com> Co-authored-by: rlangman <rlangman@users.noreply.github.com> Signed-off-by: Akhil Varanasi <akhilvaranasi23@gmail.com>
…sion (NVIDIA-NeMo#15191) * [TTS] Refactor Magpie to support codec conversion and bandwidth extension Signed-off-by: Ryan <rlangman@nvidia.com> * Fix frame stacking logic Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Modify data loader arguments in inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Change lhotse audio codes to unsigned int Signed-off-by: Ryan <rlangman@nvidia.com> * Update new inference script Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Convert codes to int32 in lhotse Signed-off-by: Ryan <rlangman@nvidia.com> * Fix CI test errors Signed-off-by: Ryan <rlangman@nvidia.com> * Fix predicted length Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> * Fix BOS and EOS handling in PO Signed-off-by: Ryan <rlangman@nvidia.com> * Apply isort and black reformatting Signed-off-by: rlangman <rlangman@users.noreply.github.com> --------- Signed-off-by: Ryan <rlangman@nvidia.com> Signed-off-by: rlangman <rlangman@users.noreply.github.com> Co-authored-by: rlangman <rlangman@users.noreply.github.com>
This PR fixes a bug in MagpieTTS training in which self._codec_converter.convert_original_to_new() is being called on audio codec tokens after the Lhotse data loader has already pre-populated them with BOS and EOS tokens.
The addition of BOS and EOS tokens is now always done inside the model class, after convert_original_to_new() is called. These tokens are also removed before convert_new_to_original() is called, so that they are not mistakenly passed into the codec model.
Additionally, this includes changes to the magpie code and codec interfaces to support doing bandwidth extension within the codec, where the sample rate of the input and output audio is different.