|
Main | | | |
| 2/20/2024 | Towards audio language modeling - an overview | |
| | Authors | Wu et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2402.13236.pdf |
| | | |
| 1/31/2024 | EVA-GAN: Enhanced Various Audio Generation via Scalable GANs | |
| | Authors | Nvidia |
| | Citations | |
| | Link | https://arxiv.org/pdf/2402.00892.pdf |
| | Link2 | https://double-blind-eva-gan.cc/ |
| | | |
| 1/30/2024 | MusicGen: Simple and Controllable Music Generation | |
| | Authors | Meta: Copet et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2306.05284.pdf |
| | | |
| 1/5/2024 | M2UGen: Multi-modal Music Understanding and Generation with the Power of LLMs | |
| | Authors | Hussain et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2311.11255.pdf |
| | | |
| 1/5/2024 | Pheme: Efficient and Conversational Speech Generation | |
| | Authors | Poly AI: Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulic |
| | Citations | |
| | Link | https://arxiv.org/pdf/2401.02839.pdf |
| | | |
| 1/5/2024 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning with Word Confusion Networks | |
| | Authors | Amazon: Kevin Everson et al, Amazon. |
| | Citations | |
| | Link | https://arxiv.org/abs/2401.02921 |
| | | |
| 11/26/2023 | WavJourney: Compositional Audio Creation with LLMs | |
| | Authors | Liu et al. |
| | Citations | |
| | Link | |
| | | |
| 10/23/2023 | Mousai: Efficient Text-to-Music Diffusion Models | |
| | Authors | Schneider et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2301.11757.pdf |
| | | |
| 10/12/2023 | PromptTTS 2: Describing and Generating Voices with Text Prompt | |
| | Authors | Microsoft: Leng et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2309.02285.pdf |
| | | |
| 9/21/2023 | ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models | |
| | Authors | Baidu: Zhu et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2302.04456.pdf |
| | | |
| 9/9/2023 | AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | |
| | Authors | Liu et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2301.12503.pdf |
| | | |
| 8/14/2023 | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer | |
| | Authors | Microsoft: Wang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2308.06873.pdf |
| | | |
| 7/31/2023 | VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design | |
| | Authors | SK Telecom: Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim |
| | Citations | |
| | Link | https://arxiv.org/abs/2307.16430 |
| | | |
| 7/8/2023 | Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos | |
| | Authors | Su et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2303.16897.pdf |
| | | |
| 6/25/2023 | InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt | |
| | Authors | Yang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2301.13662.pdf |
| | | |
| 5/30/2023 | NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | |
| | Authors | Microsoft: Shen et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2304.09116.pdf |
| | | |
| 5/29/2023 | Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | |
| | Authors | Huang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2305.18474.pdf |
| | | |
| 5/25/2023 | Efficient Neural Music Generation | |
| | Authors | ByteDance: Lam et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2305.15719.pdf |
| | | |
| 5/23/2023 | Better speech synthesis through scaling | |
| | Authors | James Betker |
| | Citations | |
| | Link | https://arxiv.org/abs/2305.07243 |
| | | |
| 5/3/2023 | Diverse and Vivid Sound Generation from Text Descriptions | |
| | Authors | Li et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2305.01980.pdf |
| | | |
| 4/24/2023 | TANGO: Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model | |
| | Authors | Ghosal et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2304.13731 |
| | | |
| 4/5/2023 | AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models | |
| | Authors | Microsoft: Wang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2304.00830.pdf |
| | | |
| 3/8/2023 | FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model | |
| | Authors | Xue et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2303.02939v3.pdf |
| | | |
| 3/7/2023 | Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X) | |
| | Authors | Microsoft: Zhang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2303.03926.pdf |
| | | |
| 3/6/2023 | Noise2Music: Text-conditioned Music Generation with Diffusion Models | |
| | Authors | Google: Huang et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2302.03917.pdf |
| | | |
| 2/7/2023 | Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision | |
| | Authors | Google: Kharitonov et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2302.03540 |
| | | |
| 1/30/2023 | SingSong: Generating musical accompaniments from singing | |
| | Authors | Google: Donahue et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2301.12662.pdf |
| | | |
| 1/30/2023 | Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | |
| | Authors | Huang et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2301.12661 |
| | | |
| 1/30/2023 | ArchiSound: Audio Generation with Diffusion | |
| | Authors | Flavio Schneider |
| | Citations | |
| | Link | https://arxiv.org/abs/2301.13267 |
| | | |
| 1/26/2023 | MusicLM: Generating Music From Text | |
| | Authors | Google: Agostinelli et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2301.11325.pdf |
| | | |
| 1/5/2023 | Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (Vall-E) | |
| | Authors | Microsoft: Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei |
| | Citations | |
| | Link | https://arxiv.org/abs/2301.02111 |
| | | |
| 11/22/2022 | PromptTTS: Controllable TTS with Text Descriptions | |
| | Authors | Guo et al. |
| | Citations | |
| | Link | https://arxiv.org/pdf/2211.12171.pdf |
| | | |
| 7/20/2022 | Diffsound: Discrete Diffusion Model for Text-to-sound Generation | |
| | Authors | Yang et al |
| | Citations | |
| | Link | https://arxiv.org/pdf/2207.09983v1.pdf |
| | | |
| 4/20/2022 | ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers | |
| | Authors | Qian et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2204.09224 |
| | | |
| 3/30/2022 | Generative Spoken Dialogue Language Modeling | |
| | Authors | Meta: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux |
| | Citations | |
| | Link | https://arxiv.org/abs/2203.16502 |
| | | |
| 11/3/2021 | A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | |
| | Authors | Ubisoft: van Niekerk et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2111.02392 |
| | | |
| 5/13/2021 | Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech | |
| | Authors | Huawei: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov |
| | Citations | |
| | Link | https://arxiv.org/abs/2105.06337 |
| | | |
| 3/4/2021 | Perceiver: General Perception with Iterative Attention | |
| | Authors | Google: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira |
| | Citations | |
| | Link | https://arxiv.org/abs/2103.03206 |
| | | |
| 10/12/2020 | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | |
| | Authors | Kakao: Kong et al. |
| | Citations | 1234 |
| | Link | https://arxiv.org/abs/2010.05646 |
| | | |
| 6/8/2020 | FastSpeech 2: Fast and High-Quality End-to-End Text to Speech | |
| | Authors | Microsoft: Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu |
| | Citations | |
| | Link | https://arxiv.org/abs/2006.04558 |
| | | |
| 5/22/2020 | Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search | |
| | Authors | Kakao: Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon |
| | Citations | |
| | Link | https://arxiv.org/abs/2005.11129 |
| | | |
| 5/12/2020 | Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis | |
| | Authors | Nvidia: Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro |
| | Citations | |
| | Link | https://arxiv.org/abs/2005.05957 |
| | | |
| 5/12/2020 | AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN | |
| | Authors | Tencent: Zewang Zhang et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2005.05642 |
| | | |
| | | |
| 2/4/2020 | Boffin TTS: Few-Shot Speaker Adaptation by Bayesian Optimization | |
| | Authors | Amazon: Henry Moss et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/2002.01953 |
| | | |
| 2020 | Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | |
| | Authors | Yamamoto et al. |
| | Citations | |
| | Link | |
| | | |
| 10/23/2019 | Zero-Shot Multi-Speaker Text-to--Speech with State-of-the-art Neural Speaker Embeddings | |
| | Authors | Erica Cooper et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1910.10838 |
| | | |
| 10/8/2019 | MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis | |
| | Authors | Mila, Lyrebird: Kumar et al. |
| | Citations | 881 |
| | Link | https://arxiv.org/abs/1910.06711 |
| | | |
| 5/22/2019 | FastSpeech: Fast, Robust and Controllable Text to Speech | |
| | Authors | Microsoft: Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu |
| | Citations | |
| | Link | https://arxiv.org/abs/1905.09263 |
| | | |
| 5/2/2019 | High quality, lightweight and adaptable TTS using LPCNet | |
| | Authors | IBM: Zvi Kons et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1905.00590 |
| | | |
| 1/2/2019 | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | |
| | Authors | Google: Ye Jia et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1806.04558 |
| | | |
| 11/24/2018 | Representation Mixing for TTS Synthesis | |
| | Authors | Mila: Kyle Kastner, et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1811.07240 |
| | | |
| 10/31/2018 | WaveGlow: A Flow-based Generative Network for Speech Synthesis | |
| | Authors | Ryan Prenger, Rafael Valle, Bryan Catanzaro |
| | Citations | |
| | Link | https://arxiv.org/abs/1811.00002 |
| | | |
| 10/12/2018 | Neural Voice Cloning with a Few Samples | |
| | Authors | Baidu: Sercan Arik et al |
| | Citations | |
| | Link | https://arxiv.org/abs/1802.06006 |
| | | |
| 9/27/2018 | Sample Efficient Adaptive Text-to-Speech | |
| | Authors | Google: Chen et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1809.10460 |
| | | |
| 9/19/2018 | Neural Speech Synthesis with Transformer Network | |
| | Authors | Microsoft: Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou |
| | Citations | |
| | Link | https://arxiv.org/abs/1809.08895 |
| | | |
| 6/12/2018 | Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis | |
| | Authors | Google: Ye Jia et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1806.04558 |
| | | |
| 3/24/2018 | Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron | |
| | Authors | Google: RJ Skerry-Ryan et al |
| | Citations | |
| | Link | https://arxiv.org/abs/1803.09047 |
| | | |
| 3/23/2018 | Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis | |
| | Authors | Google: Yuxuan Wang et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1803.09017 |
| | | |
| 2/23/2018 | Efficient Neural Audio Synthesis | |
| | Authors | Google: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, Koray Kavukcuoglu |
| | Citations | 908 |
| | Link | https://arxiv.org/abs/1802.08435 |
| | | |
| 12/16/2017 | Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions (Tacotron 2) | |
| | Authors | Google: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu |
| | Citations | 2831 |
| | Link | https://arxiv.org/abs/1712.05884 |
| | | |
| 10/28/2017 | Generalized End-to-End Loss for Speaker Verification | |
| | Authors | Google: Li Wan et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1710.10467 |
| | | |
| 10/28/2017 | Speaker diarization with LSTM | |
| | Authors | Google: Quan Wang et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1710.10468 |
| | | |
| 5/5/2017 | Deep Speaker: an End-to-End Neural Speaker Embedding System | |
| | Authors | Baidu: Chao Li et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1705.02304 |
| | | |
| 2017 | Char2Wav: End-To-End Speech Synthesis | |
| | Authors | Mila: Jose Sotelo, et al. |
| | Citations | |
| | Link | https://mila.quebec/wp-content/uploads/2017/02/end-end-speech.pdf |
| | | |
| 2017 | Deep Neural Network Embeddings for Text-Independent Speaker Verification | |
| | Authors | Snyder et al. |
| | Citations | |
| | Link | https://www.danielpovey.com/files/2017_interspeech_embeddings.pdf |
| | | |
| 2017 | Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation | |
| | Authors | Kinnunen et al. |
| | Citations | 28 |
| | Link | https://ieeexplore.ieee.org/document/7953215 |
| | | |
| 9/12/2016 | WaveNet: A Generative Model for Raw Audio | |
| | Authors | Google: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu |
| | Citations | |
| | Link | https://arxiv.org/abs/1609.03499 |
| | | |
| 12/17/2014 | Deep Speech: Scaling up end-to-end speech recognition | |
| | Authors | Baidu: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng |
| | Citations | |
| | Link | https://arxiv.org/abs/1412.5567 |
| | | |
| 6/3/2014 | Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation | |
| | Authors | Umontreal: Cho et al. |
| | Citations | |
| | Link | https://arxiv.org/abs/1406.1078 |
| | | |
| | | |
| 2010 | Front-end factor analysis for speaker verification. IEEE Transact Audio Speech Lang Process | |
| | Authors | Dehak et al. |
| | Citations | 2152 |
| | Link | https://ieeexplore.ieee.org/document/5545402 |
| | | |
| | | |
| 1996 | Unit selection in a concatenative speech synthesis system using a large speech database | |
| | Authors | AJ Hunt, AW Black |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | ICASSP - IEEE Conf on Acoustics Speech Signal Processing | |