Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.
Figure 1: Speaker Retention Unit-to-Mel based Speaker Consistency S2UT System Workflow Overview
Figure 2: Illustration of the workflow for the Self-Supervised Pretrain and Finetune
| Sample | Ground Truth | Translation | |||||
|---|---|---|---|---|---|---|---|
| Source (French) | Target (English) | S2ST | SC_S2ST | Pretrain_SC_S2ST | ASR+MT+TTS | S2UT+FreeVC | |
| Sample1 | |||||||
| Reference | Cette maladie est assez fréquente pour se voir régulièrement en médecine générale | this sickness is frequent enough to be seen regularly in general medicine | |||||
| ASR | this sickness is frequent enough to be seen regularly in general medicine | his disease is quite frequent to resume a general medicine | this disease is quite frequent to resume a general medicine | this disease is quite frequent to resume a general medicine | his disease is common enough to be relegated to general medicine | this disease is quite frequent to resume a general medicine | |
| Sample2 | |||||||
| Reference | Comme ces structures, les avasa étaient établis en dehors de la ville | like these structures the avasa were settled outside the city | |||||
| ASR | like these structures the avaso were settled outside the city | like his lectures a basa was established in the city | like his structures a buzz that was established in this city | like this structures a basa was established in the city | as these ruptures the abysses were established outside a city | like his lectures a bass on was established in the city | |
| Sample3 | |||||||
| Reference | À cette époque, seules les parties ouest et nord faisaient partie du comté | at that time only the western and northern parts were included in the county | |||||
| ASR | at that time only the western and northern parts were included in the county | at that time only the west and northern parts were part of the county | at that time only the western northern parts were part of the county | at that time only the western northern parts were part of the county | at that time only the western and northern parts were part of the county | at that time only the western northern parts were part of the county | |
| Sample4 | |||||||
| Reference | Il est rapatrié le et reprend son métier d'instituteur | he is repatriated and resumes his profession as a schoolteacher | |||||
| ASR | he is repaciated and resumes his profession as a school teacher | he was repaidriate and takes his institutor | he was repatriot and takes his instituta | he was repayed and takes his instituta | he is repetriate and takes up his job as a teacher | he was repaidriate and takes his institutor | |
| Sample5 | |||||||
| Reference | Ce sentiment règne en moi, malgré moi. | this feeling reigns in me despite myself | |||||
| ASR | this feeling reigns in me despite myself | this feeling reigned in me despite to me | this feeling reigned in me despite to me | this feeling reigned in me despite to me | feeling rains me despite me | this feeling reigned in me despite to me | |
| Sample | Ground Truth | Translation | |||||
|---|---|---|---|---|---|---|---|
| Source (Spanish) | Target (English) | S2ST | SC_S2ST | Pretrain_SC_S2ST | ASR+MT+TTS | S2UT+FreeVC | |
| Sample1 | |||||||
| Reference | El estilo original de la catedral se conserva sobre todo en su fachada principal | the original style of the cathedral is preserved above all in its main facade | |||||
| ASR | the original style of the cathedral is preserved above all in its main foresawed | the original style of the cathedral kept on his main facade | the original style of the cathedral kept on his main for sud | the original style of the cathedral kept on his main facade | the original style of the cathedral is preserved above all in its main for sale | the original style of the cathedral kept on his main facade | |
| Sample2 | |||||||
| Reference | No posee un olor intenso y su textura es húmeda | it does not possess an intense smell and its texture is humid | |||||
| ASR | it does not possess an intense smell and its texture is humid | it does not have an intense olar and its texture is human | it does not have an intense solar and its texture is human | it does not have an intense solar and its texture is human | does not have an intense smell and its texture is moist | it does not have an intense olor in its texture as human | |
| Sample3 | |||||||
| Reference | Buena parte de sus miembros son católicos tradicionalistas | a big part of its members are traditionalist catholics | |||||
| ASR | a big part of its members are traditionalist catholics | a good part of his members or traditionalist catholics | a good part of his members are traditionalist catholics | a good part of his members are traditionalist catholics | many of its members are traditionalist catholics | a good part of his members or traditionalist catholics | |
| Sample4 | |||||||
| Reference | El campeonato está dividido en dos, torneos Apertura y Clausura | the championship is divided in two opening and closing tournaments | |||||
| ASR | the championship is divided in two opening and closing tournaments | the championship is divided into two tournaments opening in closure | the championship is divided into two termediments opening inclosure | the championship is divided into two terminments opening inclosure | the championship is divided into two opening and closing tournaments | the championship is divided into two tournaments opening inclosure | |
| Sample5 | |||||||
| Reference | Pertenecían a la comunidad cristiana de Sevilla, liderada por el obispo Sabino. | they belonged to the christian community of seville lead by the bishop sabino | |||||
| ASR | they belong to the christian community of savil lead by the bishap savino | they belong to the christian community of savil leadring by the bishop sabino | they belong to the christian community of savil lead by the bishop savino | they belong to the christian community of savil lead by the bishop savino | belonging to the christian settled a community liberated by the sabbin of bishop | they belong to the christian community of savil lead by the biship savino | |