Institutional Repository of Key Laboratory of Behavioral Science, CAS
Zero-shot voice conversion based on feature disentanglement | |
Na Guo1; Jianguo Wei1; Yongwei Li2; Wenhuan Lu1; Jianhua Tao3 | |
第一作者 | Na Guo |
通讯作者 | Li, Yongwei([email protected]) |
通讯作者邮箱 | [email protected] (y. li) |
摘要 | Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models. |
关键词 | Zero-shot voice conversion Mixed speaker layer normalization Adaptive attention weight normalization Dynamic convolution |
2024 | |
语种 | 英语 |
DOI | 10.1016/j.specom.2024.103143 |
发表期刊 | Speech Communication |
ISSN | 0167-6393 |
卷号 | 165页码:10 |
期刊论文类型 | 综述 |
收录类别 | EI |
资助项目 | National Key R&D Pro-gram of China[2023YFB2603902] ; Tianjin Science and Technology Program[21JCZXJC00190] ; National Natural Science Foundation of China[62201571] |
出版者 | ELSEVIER |
WOS关键词 | SPARSE REPRESENTATION ; ADAPTATION ; SPEAKER |
WOS研究方向 | Acoustics ; Computer Science |
WOS类目 | Acoustics ; Computer Science, Interdisciplinary Applications |
WOS记录号 | WOS:001340314300001 |
资助机构 | National Key R&D Pro-gram of China ; Tianjin Science and Technology Program ; National Natural Science Foundation of China |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://ir.psych.ac.cn/handle/311026/48789 |
专题 | 中国科学院行为科学重点实验室 |
作者单位 | 1.College of Intelligence and Computing, Tianjin University, Tianjin, China 2.CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing, China 3.Department of Automation, Tsinghua University, Beijing, China |
推荐引用方式 GB/T 7714 | Na Guo,Jianguo Wei,Yongwei Li,et al. Zero-shot voice conversion based on feature disentanglement[J]. Speech Communication,2024,165:10. |
APA | Na Guo,Jianguo Wei,Yongwei Li,Wenhuan Lu,&Jianhua Tao.(2024).Zero-shot voice conversion based on feature disentanglement.Speech Communication,165,10. |
MLA | Na Guo,et al."Zero-shot voice conversion based on feature disentanglement".Speech Communication 165(2024):10. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
Zero-shot voice conv(1881KB) | 期刊论文 | 出版稿 | 限制开放 | CC BY-NC-SA | 请求全文 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论