Reconstruction:

RVQ-1:8 or VQ+RVQ-1:7

Original Speech Original Text (Normalized): please call stella

Tokenizer 전체 1:8 Transcription (Normalized) WER↓ CER↓ SIM↑
(WavLM-TDNN)
SIM↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
please call stella 0 0 0.99 0.78 4.04
Mimi
(Official ckpt)
please call stella 0 0 0.96 0.64 3.72
FACodec
(Official ckpt)
please call stella 0 0 0.99 0.76 4.00

RVQ-1 or VQ

Original Speech Original Text (Normalized): please call stella

Tokenizer VQ1 Transcription (Normalized) WER↓ CER↓ SIM↑
(WavLM-TDNN)
SIM↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
please call stella 0 0 0.88 0.34 1.96
Mimi
(Official ckpt)
please hold your hand 100 61.11 0.42 0.20 1.0

RVQ-1+Random or VQ+Random

Original Speech Original Text (Normalized): please call stella

Tokenizer VQ+Random Transcription (Normalized) WER↓ CER↓ SIM↑
(WavLM-TDNN)
SIM↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
please subscribe 66.67 55.56 0.54 -0.04 1.0
Mimi
(Official ckpt)
i am going to go to the bathroom 100 100 0.54 0.21 1.0

Random+RVQ-2:8 or Random+RVQ1:7

Original Speech Original Text (Normalized): please call stella

Tokenizer Random+RVQ1:7 Transcription (Normalized) WER↓ CER↓ SIM↑
(WavLM-TDNN)
SIM↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
блин аи черт ну блядь 100 100 0.71 0.43 2.01
Mimi
(Official ckpt)
please cool stuff 66.67 33.33 0.90 0.62 3.70

One-shot Voice Conversion:

같은 텍스트, 다른 화자:

Source Speech Original Text (Normalized): please call stella

Target Speech Original Text (Normalized): please call stella

SIM (WavLM-TDNN): 0.37

SIM (ERes2Net): 0.33

Tokenizer Recon Speech Transcription (Normalized) WER↓ CER↓ SIM1↑
(WavLM-TDNN)
SIM2↑
(WavLM-TDNN)
SIM1↑
(ERes2Net)
SIM2↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
please cool stuff 66.67 33.33 0.72 0.68 0.34 0.36 1.56
SpeechTokenizer
(RVQ-2:4)
please call stella 0 0 0.78 0.67 0.31 0.28 1.80
Mimi
(Official ckpt)
please close the . 100 50 0.46 0.89 0.27 0.58 1.47
FACodec
(Official_Timbre)
please call stella 0 0 0.99 0.46 0.71 0.41 3.79
FACodec
(Content)
please call stella 0 0 0.66 0.82 0.30 0.57 2.15
다른 텍스트, 다른 화자 (1):

Source Speech Original Text: Please call Stella.

Target Speech Original Text: Ask her to bring these things with her from the store.

SIM (WavLM-TDNN): 0.55

SIM (ERes2Net): 0.35

Tokenizer Recon Speech Transcription (Normalized) WER↓ CER↓ SIM1↑
(WavLM-TDNN)
SIM2↑
(WavLM-TDNN)
SIM1↑
(ERes2Net)
SIM2↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
please cool stuff 66.67 33.33 0.59 0.75 0.32 0.52 1.20
SpeechTokenizer
(RVQ-2:4)
please call stella 0 0 0.84 0.63 0.24 0.39 1.54
Mimi
(Official ckpt)
i will start by bringing these things 100 100 0.40 0.85 0.26 0.63 1.0
FACodec
(Official_Timbre)
please call stella 0 0 0.99 0.62 0.69 0.45 3.71
FACodec
(Content)
please call stella 0 0 0.57 0.84 0.38 0.55 2.30
다른 텍스트, 다른 화자 (2):

Source Speech Original Text: Ask her to bring these things with her from the store.

Target Speech Original Text: We also need a small plastic snake and a big toy frog for the kids.

SIM (WavLM-TDNN): 0.63

SIM (ERes2Net): 0.24

Tokenizer Recon Speech Transcription (Normalized) WER↓ CER↓ SIM1↑
(WavLM-TDNN)
SIM2↑
(WavLM-TDNN)
SIM1↑
(ERes2Net)
SIM2↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
ask her to bring these things from the store 18.18 16.98 0.44 0.70 0.34 0.56 1.70
SpeechTokenizer
(RVQ-2:4)
we asked her to bring these things from the store 36.36 26.42 0.56 0.78 0.39 0.60 1.76
Mimi
(Official ckpt)
we also need a small plastic snake and a big toy 100 77.36 0.60 0.97 0.26 0.77 1.26
FACodec
(Official_Timbre)
ask her to bring these things with her from the store 0 0 0.84 0.79 0.49 0.41 3.38
FACodec
(Content)
ask her to bring these things with her from the store 0 0 0.61 0.63 0.40 0.58 1.93
다른 텍스트, 다른 화자 (3):

Source Speech Original Text: The rainbow is a division of white light into many beautiful colors.

Target Speech Original Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

SIM (WavLM-TDNN): 0.82

SIM (ERes2Net): 0.49

Tokenizer Recon Speech Transcription (Normalized) WER↓ CER↓ SIM1↑
(WavLM-TDNN)
SIM2↑
(WavLM-TDNN)
SIM1↑
(ERes2Net)
SIM2↑
(ERes2Net)
VISQOL↑
SpeechTokenizer
(Official ckpt)
the rain pours into a shelf while laying to many beautiful colors 66.67 37.31 0.83 0.91 0.38 0.64 2.26
SpeechTokenizer
(RVQ-2:4)
the rainbows into a sniff void light into many beautiful colors 41.67 23.88 0.83 0.86 0.42 0.58 2.41
Mimi
(Official ckpt)
6 spoons of fresh snow peas 5 thick subs and 100 80.60 0.88 0.97 0.52 0.76 1.20
FACodec
(Official_Timbre)
the rainbow is a division of white light into many beautiful colors 0 0 0.97 0.83 0.83 0.62 4.40
FACodec
(Content)
the rainbow is a division of white light into many beautiful colors 0 0 0.90 0.85 0.60 0.77 3.02