Zero-shot voice conversion based on feature disentanglement

doi:10.1016/j.specom.2024.103143

	Zero-shot voice conversion based on feature disentanglement
	Na Guo1 ; Jianguo Wei 1; Yongwei Li 2; Wenhuan Lu 1; Jianhua Tao 3
第一作者	Na Guo
通讯作者	Li, Yongwei([email protected])
通讯作者邮箱	[email protected] (y. li)
摘要	Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.
关键词	Zero-shot voice conversion Mixed speaker layer normalization Adaptive attention weight normalization Dynamic convolution
	2024
语种	英语
DOI	10.1016/j.specom.2024.103143
发表期刊	Speech Communication
ISSN	0167-6393
卷号	165 页码:10
期刊论文类型	综述
收录类别	EI
资助项目	National Key R&D Pro-gram of China[2023YFB2603902] ; Tianjin Science and Technology Program[21JCZXJC00190] ; National Natural Science Foundation of China[62201571]
出版者	ELSEVIER
WOS关键词	SPARSE REPRESENTATION ; ADAPTATION ; SPEAKER
WOS研究方向	Acoustics ; Computer Science
WOS类目	Acoustics ; Computer Science, Interdisciplinary Applications
WOS记录号	WOS:001340314300001
资助机构	National Key R&D Pro-gram of China ; Tianjin Science and Technology Program ; National Natural Science Foundation of China
引用统计
文献类型	期刊论文
条目标识符	http://ir.psych.ac.cn/handle/311026/48789
专题	中国科学院行为科学重点实验室
作者单位	1.College of Intelligence and Computing, Tianjin University, Tianjin, China 2.CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing, China 3.Department of Automation, Tsinghua University, Beijing, China
推荐引用方式 GB/T 7714	Na Guo,Jianguo Wei,Yongwei Li,et al. Zero-shot voice conversion based on feature disentanglement[J]. Speech Communication,2024,165:10.
APA	Na Guo,Jianguo Wei,Yongwei Li,Wenhuan Lu,&Jianhua Tao.(2024).Zero-shot voice conversion based on feature disentanglement.Speech Communication,165,10.
MLA	Na Guo,et al."Zero-shot voice conversion based on feature disentanglement".Speech Communication 165(2024):10.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Zero-shot voice conv（1881KB）	期刊论文	出版稿	限制开放	CC BY-NC-SA	请求全文