ABCD
1Main
22/20/2024Towards audio language modeling - an overview
3AuthorsWu et al.
4Citations
5Linkhttps://arxiv.org/pdf/2402.13236.pdf
6
71/31/2024EVA-GAN: Enhanced Various Audio Generation via Scalable GANs
8AuthorsNvidia
9Citations
10Linkhttps://arxiv.org/pdf/2402.00892.pdf
11Link2https://double-blind-eva-gan.cc/
12
131/30/2024MusicGen: Simple and Controllable Music Generation
14AuthorsMeta: Copet et al.
15Citations
16Linkhttps://arxiv.org/pdf/2306.05284.pdf
17
181/5/2024M2UGen: Multi-modal Music Understanding and Generation with the Power of LLMs
19AuthorsHussain et al.
20Citations
21Linkhttps://arxiv.org/pdf/2311.11255.pdf
22
231/5/2024Pheme: Efficient and Conversational Speech Generation
24AuthorsPoly AI: Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulic
25Citations
26Linkhttps://arxiv.org/pdf/2401.02839.pdf
27
281/5/2024Towards ASR Robust Spoken Language Understanding Through In-Context Learning with Word Confusion Networks
29AuthorsAmazon: Kevin Everson et al, Amazon.
30Citations
31Linkhttps://arxiv.org/abs/2401.02921
32
3311/26/2023WavJourney: Compositional Audio Creation with LLMs
34AuthorsLiu et al.
35Citations
36Link
37
3810/23/2023Mousai: Efficient Text-to-Music Diffusion Models
39AuthorsSchneider et al.
40Citations
41Linkhttps://arxiv.org/pdf/2301.11757.pdf
42
4310/12/2023PromptTTS 2: Describing and Generating Voices with Text Prompt
44AuthorsMicrosoft: Leng et al.
45Citations
46Linkhttps://arxiv.org/pdf/2309.02285.pdf
47
489/21/2023ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
49AuthorsBaidu: Zhu et al.
50Citations
51Linkhttps://arxiv.org/pdf/2302.04456.pdf
52
539/9/2023AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
54AuthorsLiu et al.
55Citations
56Linkhttps://arxiv.org/pdf/2301.12503.pdf
57
588/14/2023SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
59AuthorsMicrosoft: Wang et al.
60Citations
61Linkhttps://arxiv.org/pdf/2308.06873.pdf
62
637/31/2023VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
64AuthorsSK Telecom: Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim
65Citations
66Linkhttps://arxiv.org/abs/2307.16430
67
687/8/2023Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
69AuthorsSu et al.
70Citations
71Linkhttps://arxiv.org/pdf/2303.16897.pdf
72
736/25/2023InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
74AuthorsYang et al.
75Citations
76Linkhttps://arxiv.org/pdf/2301.13662.pdf
77
785/30/2023NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
79AuthorsMicrosoft: Shen et al.
80Citations
81Linkhttps://arxiv.org/pdf/2304.09116.pdf
82
835/29/2023Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
84AuthorsHuang et al.
85Citations
86Linkhttps://arxiv.org/pdf/2305.18474.pdf
87
885/25/2023Efficient Neural Music Generation
89AuthorsByteDance: Lam et al.
90Citations
91Linkhttps://arxiv.org/pdf/2305.15719.pdf
92
935/23/2023Better speech synthesis through scaling
94AuthorsJames Betker
95Citations
96Linkhttps://arxiv.org/abs/2305.07243
97
985/3/2023Diverse and Vivid Sound Generation from Text Descriptions
99AuthorsLi et al.
100Citations
101Linkhttps://arxiv.org/pdf/2305.01980.pdf
102
1034/24/2023TANGO: Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
104AuthorsGhosal et al.
105Citations
106Linkhttps://arxiv.org/abs/2304.13731
107
1084/5/2023AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
109AuthorsMicrosoft: Wang et al.
110Citations
111Linkhttps://arxiv.org/pdf/2304.00830.pdf
112
1133/8/2023FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model
114AuthorsXue et al.
115Citations
116Linkhttps://arxiv.org/pdf/2303.02939v3.pdf
117
1183/7/2023Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X)
119AuthorsMicrosoft: Zhang et al.
120Citations
121Linkhttps://arxiv.org/pdf/2303.03926.pdf
122
1233/6/2023Noise2Music: Text-conditioned Music Generation with Diffusion Models
124AuthorsGoogle: Huang et al.
125Citations
126Linkhttps://arxiv.org/pdf/2302.03917.pdf
127
1282/7/2023Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
129AuthorsGoogle: Kharitonov et al.
130Citations
131Linkhttps://arxiv.org/abs/2302.03540
132
1331/30/2023SingSong: Generating musical accompaniments from singing
134AuthorsGoogle: Donahue et al.
135Citations
136Linkhttps://arxiv.org/pdf/2301.12662.pdf
137
1381/30/2023Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
139AuthorsHuang et al.
140Citations
141Linkhttps://arxiv.org/abs/2301.12661
142
1431/30/2023ArchiSound: Audio Generation with Diffusion
144AuthorsFlavio Schneider
145Citations
146Linkhttps://arxiv.org/abs/2301.13267
147
1481/26/2023MusicLM: Generating Music From Text
149AuthorsGoogle: Agostinelli et al.
150Citations
151Linkhttps://arxiv.org/pdf/2301.11325.pdf
152
1531/5/2023Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (Vall-E)
154AuthorsMicrosoft: Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
155Citations
156Linkhttps://arxiv.org/abs/2301.02111
157
15811/22/2022PromptTTS: Controllable TTS with Text Descriptions
159AuthorsGuo et al.
160Citations
161Linkhttps://arxiv.org/pdf/2211.12171.pdf
162
1637/20/2022Diffsound: Discrete Diffusion Model for Text-to-sound Generation
164AuthorsYang et al
165Citations
166Linkhttps://arxiv.org/pdf/2207.09983v1.pdf
167
1684/20/2022ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers
169AuthorsQian et al.
170Citations
171Linkhttps://arxiv.org/abs/2204.09224
172
1733/30/2022Generative Spoken Dialogue Language Modeling
174AuthorsMeta: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux
175Citations
176Linkhttps://arxiv.org/abs/2203.16502
177
17811/3/2021A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
179AuthorsUbisoft: van Niekerk et al.
180Citations
181Linkhttps://arxiv.org/abs/2111.02392
182
1835/13/2021Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
184AuthorsHuawei: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov
185Citations
186Linkhttps://arxiv.org/abs/2105.06337
187
1883/4/2021Perceiver: General Perception with Iterative Attention
189AuthorsGoogle: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
190Citations
191Linkhttps://arxiv.org/abs/2103.03206
192
19310/12/2020HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
194AuthorsKakao: Kong et al.
195Citations1234
196Linkhttps://arxiv.org/abs/2010.05646
197
1986/8/2020FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
199AuthorsMicrosoft: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
200Citations
201Linkhttps://arxiv.org/abs/2006.04558
202
2035/22/2020Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
204AuthorsKakao: Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon
205Citations
206Linkhttps://arxiv.org/abs/2005.11129
207
2085/12/2020Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
209AuthorsNvidia: Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
210Citations
211Linkhttps://arxiv.org/abs/2005.05957
212
2135/12/2020AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN
214AuthorsTencent: Zewang Zhang et al.
215Citations
216Linkhttps://arxiv.org/abs/2005.05642
217
218
2192/4/2020Boffin TTS: Few-Shot Speaker Adaptation by Bayesian Optimization
220AuthorsAmazon: Henry Moss et al.
221Citations
222Linkhttps://arxiv.org/abs/2002.01953
223
2242020Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
225AuthorsYamamoto et al.
226Citations
227Link
228
22910/23/2019Zero-Shot Multi-Speaker Text-to--Speech with State-of-the-art Neural Speaker Embeddings
230AuthorsErica Cooper et al.
231Citations
232Linkhttps://arxiv.org/abs/1910.10838
233
23410/8/2019MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
235AuthorsMila, Lyrebird: Kumar et al.
236Citations881
237Linkhttps://arxiv.org/abs/1910.06711
238
2395/22/2019FastSpeech: Fast, Robust and Controllable Text to Speech
240AuthorsMicrosoft: Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
241Citations
242Linkhttps://arxiv.org/abs/1905.09263
243
2445/2/2019High quality, lightweight and adaptable TTS using LPCNet
245AuthorsIBM: Zvi Kons et al.
246Citations
247Linkhttps://arxiv.org/abs/1905.00590
248
2491/2/2019Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
250AuthorsGoogle: Ye Jia et al.
251Citations
252Linkhttps://arxiv.org/abs/1806.04558
253
25411/24/2018Representation Mixing for TTS Synthesis
255AuthorsMila: Kyle Kastner, et al.
256Citations
257Linkhttps://arxiv.org/abs/1811.07240
258
25910/31/2018WaveGlow: A Flow-based Generative Network for Speech Synthesis
260AuthorsRyan Prenger, Rafael Valle, Bryan Catanzaro
261Citations
262Linkhttps://arxiv.org/abs/1811.00002
263
26410/12/2018Neural Voice Cloning with a Few Samples
265AuthorsBaidu: Sercan Arik et al
266Citations
267Linkhttps://arxiv.org/abs/1802.06006
268
2699/27/2018Sample Efficient Adaptive Text-to-Speech
270AuthorsGoogle: Chen et al.
271Citations
272Linkhttps://arxiv.org/abs/1809.10460
273
2749/19/2018Neural Speech Synthesis with Transformer Network
275AuthorsMicrosoft: Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou
276Citations
277Linkhttps://arxiv.org/abs/1809.08895
278
2796/12/2018Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
280AuthorsGoogle: Ye Jia et al.
281Citations
282Linkhttps://arxiv.org/abs/1806.04558
283
2843/24/2018Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
285AuthorsGoogle: RJ Skerry-Ryan et al
286Citations
287Linkhttps://arxiv.org/abs/1803.09047
288
2893/23/2018Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
290AuthorsGoogle: Yuxuan Wang et al.
291Citations
292Linkhttps://arxiv.org/abs/1803.09017
293
2942/23/2018Efficient Neural Audio Synthesis
295AuthorsGoogle: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, Koray Kavukcuoglu
296Citations908
297Linkhttps://arxiv.org/abs/1802.08435
298
29912/16/2017Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions (Tacotron 2)
300AuthorsGoogle: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu
301Citations2831
302Linkhttps://arxiv.org/abs/1712.05884
303
30410/28/2017Generalized End-to-End Loss for Speaker Verification
305AuthorsGoogle: Li Wan et al.
306Citations
307Linkhttps://arxiv.org/abs/1710.10467
308
30910/28/2017Speaker diarization with LSTM
310AuthorsGoogle: Quan Wang et al.
311Citations
312Linkhttps://arxiv.org/abs/1710.10468
313
3145/5/2017Deep Speaker: an End-to-End Neural Speaker Embedding System
315AuthorsBaidu: Chao Li et al.
316Citations
317Linkhttps://arxiv.org/abs/1705.02304
318
3192017Char2Wav: End-To-End Speech Synthesis
320AuthorsMila: Jose Sotelo, et al.
321Citations
322Linkhttps://mila.quebec/wp-content/uploads/2017/02/end-end-speech.pdf
323
3242017Deep Neural Network Embeddings for Text-Independent Speaker Verification
325AuthorsSnyder et al.
326Citations
327Linkhttps://www.danielpovey.com/files/2017_interspeech_embeddings.pdf
328
3292017Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation
330AuthorsKinnunen et al.
331Citations28
332Linkhttps://ieeexplore.ieee.org/document/7953215
333
3349/12/2016WaveNet: A Generative Model for Raw Audio
335AuthorsGoogle: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
336Citations
337Linkhttps://arxiv.org/abs/1609.03499
338
33912/17/2014Deep Speech: Scaling up end-to-end speech recognition
340AuthorsBaidu: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng
341Citations
342Linkhttps://arxiv.org/abs/1412.5567
343
3446/3/2014Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
345AuthorsUmontreal: Cho et al.
346Citations
347Linkhttps://arxiv.org/abs/1406.1078
348
349
3502010Front-end factor analysis for speaker verification. IEEE Transact Audio Speech Lang Process
351AuthorsDehak et al.
352Citations2152
353Linkhttps://ieeexplore.ieee.org/document/5545402
354
355
3561996Unit selection in a concatenative speech synthesis system using a large speech database
357AuthorsAJ Hunt, AW Black
358
359
360
361
362
363
364
365
366
367
368ICASSP - IEEE Conf on Acoustics Speech Signal Processing