Reconstruction:
RVQ-1:8 or VQ+RVQ-1:7
Original Speech Original Text (Normalized): please call stella
| Tokenizer | 전체 1:8 | Transcription (Normalized) | WER↓ | CER↓ | SIM↑ (WavLM-TDNN) |
SIM↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
please call stella | 0 | 0 | 0.99 | 0.78 | 4.04 | |
| Mimi (Official ckpt) |
please call stella | 0 | 0 | 0.96 | 0.64 | 3.72 | |
| FACodec (Official ckpt) |
please call stella | 0 | 0 | 0.99 | 0.76 | 4.00 |
RVQ-1 or VQ
Original Speech Original Text (Normalized): please call stella
| Tokenizer | VQ1 | Transcription (Normalized) | WER↓ | CER↓ | SIM↑ (WavLM-TDNN) |
SIM↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
please call stella | 0 | 0 | 0.88 | 0.34 | 1.96 | |
| Mimi (Official ckpt) |
please hold your hand | 100 | 61.11 | 0.42 | 0.20 | 1.0 |
RVQ-1+Random or VQ+Random
Original Speech Original Text (Normalized): please call stella
| Tokenizer | VQ+Random | Transcription (Normalized) | WER↓ | CER↓ | SIM↑ (WavLM-TDNN) |
SIM↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
please subscribe | 66.67 | 55.56 | 0.54 | -0.04 | 1.0 | |
| Mimi (Official ckpt) |
i am going to go to the bathroom | 100 | 100 | 0.54 | 0.21 | 1.0 |
Random+RVQ-2:8 or Random+RVQ1:7
Original Speech Original Text (Normalized): please call stella
| Tokenizer | Random+RVQ1:7 | Transcription (Normalized) | WER↓ | CER↓ | SIM↑ (WavLM-TDNN) |
SIM↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
блин аи черт ну блядь | 100 | 100 | 0.71 | 0.43 | 2.01 | |
| Mimi (Official ckpt) |
please cool stuff | 66.67 | 33.33 | 0.90 | 0.62 | 3.70 |
One-shot Voice Conversion:
같은 텍스트, 다른 화자:
Source Speech Original Text (Normalized): please call stella
Target Speech Original Text (Normalized): please call stella
SIM (WavLM-TDNN): 0.37
SIM (ERes2Net): 0.33
| Tokenizer | Recon Speech | Transcription (Normalized) | WER↓ | CER↓ | SIM1↑ (WavLM-TDNN) |
SIM2↑ (WavLM-TDNN) |
SIM1↑ (ERes2Net) |
SIM2↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
please cool stuff | 66.67 | 33.33 | 0.72 | 0.68 | 0.34 | 0.36 | 1.56 | |
| SpeechTokenizer (RVQ-2:4) |
please call stella | 0 | 0 | 0.78 | 0.67 | 0.31 | 0.28 | 1.80 | |
| Mimi (Official ckpt) |
please close the . | 100 | 50 | 0.46 | 0.89 | 0.27 | 0.58 | 1.47 | |
| FACodec (Official_Timbre) |
please call stella | 0 | 0 | 0.99 | 0.46 | 0.71 | 0.41 | 3.79 | |
| FACodec (Content) |
please call stella | 0 | 0 | 0.66 | 0.82 | 0.30 | 0.57 | 2.15 |
다른 텍스트, 다른 화자 (1):
Source Speech Original Text: Please call Stella.
Target Speech Original Text: Ask her to bring these things with her from the store.
SIM (WavLM-TDNN): 0.55
SIM (ERes2Net): 0.35
| Tokenizer | Recon Speech | Transcription (Normalized) | WER↓ | CER↓ | SIM1↑ (WavLM-TDNN) |
SIM2↑ (WavLM-TDNN) |
SIM1↑ (ERes2Net) |
SIM2↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
please cool stuff | 66.67 | 33.33 | 0.59 | 0.75 | 0.32 | 0.52 | 1.20 | |
| SpeechTokenizer (RVQ-2:4) |
please call stella | 0 | 0 | 0.84 | 0.63 | 0.24 | 0.39 | 1.54 | |
| Mimi (Official ckpt) |
i will start by bringing these things | 100 | 100 | 0.40 | 0.85 | 0.26 | 0.63 | 1.0 | |
| FACodec (Official_Timbre) |
please call stella | 0 | 0 | 0.99 | 0.62 | 0.69 | 0.45 | 3.71 | |
| FACodec (Content) |
please call stella | 0 | 0 | 0.57 | 0.84 | 0.38 | 0.55 | 2.30 |
다른 텍스트, 다른 화자 (2):
Source Speech Original Text: Ask her to bring these things with her from the store.
Target Speech Original Text: We also need a small plastic snake and a big toy frog for the kids.
SIM (WavLM-TDNN): 0.63
SIM (ERes2Net): 0.24
| Tokenizer | Recon Speech | Transcription (Normalized) | WER↓ | CER↓ | SIM1↑ (WavLM-TDNN) |
SIM2↑ (WavLM-TDNN) |
SIM1↑ (ERes2Net) |
SIM2↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
ask her to bring these things from the store | 18.18 | 16.98 | 0.44 | 0.70 | 0.34 | 0.56 | 1.70 | |
| SpeechTokenizer (RVQ-2:4) |
we asked her to bring these things from the store | 36.36 | 26.42 | 0.56 | 0.78 | 0.39 | 0.60 | 1.76 | |
| Mimi (Official ckpt) |
we also need a small plastic snake and a big toy | 100 | 77.36 | 0.60 | 0.97 | 0.26 | 0.77 | 1.26 | |
| FACodec (Official_Timbre) |
ask her to bring these things with her from the store | 0 | 0 | 0.84 | 0.79 | 0.49 | 0.41 | 3.38 | |
| FACodec (Content) |
ask her to bring these things with her from the store | 0 | 0 | 0.61 | 0.63 | 0.40 | 0.58 | 1.93 |
다른 텍스트, 다른 화자 (3):
Source Speech Original Text: The rainbow is a division of white light into many beautiful colors.
Target Speech Original Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
SIM (WavLM-TDNN): 0.82
SIM (ERes2Net): 0.49
| Tokenizer | Recon Speech | Transcription (Normalized) | WER↓ | CER↓ | SIM1↑ (WavLM-TDNN) |
SIM2↑ (WavLM-TDNN) |
SIM1↑ (ERes2Net) |
SIM2↑ (ERes2Net) |
VISQOL↑ |
|---|---|---|---|---|---|---|---|---|---|
| SpeechTokenizer (Official ckpt) |
the rain pours into a shelf while laying to many beautiful colors | 66.67 | 37.31 | 0.83 | 0.91 | 0.38 | 0.64 | 2.26 | |
| SpeechTokenizer (RVQ-2:4) |
the rainbows into a sniff void light into many beautiful colors | 41.67 | 23.88 | 0.83 | 0.86 | 0.42 | 0.58 | 2.41 | |
| Mimi (Official ckpt) |
6 spoons of fresh snow peas 5 thick subs and | 100 | 80.60 | 0.88 | 0.97 | 0.52 | 0.76 | 1.20 | |
| FACodec (Official_Timbre) |
the rainbow is a division of white light into many beautiful colors | 0 | 0 | 0.97 | 0.83 | 0.83 | 0.62 | 4.40 | |
| FACodec (Content) |
the rainbow is a division of white light into many beautiful colors | 0 | 0 | 0.90 | 0.85 | 0.60 | 0.77 | 3.02 |