Preview

The Herald of the Siberian State University of Telecommunications and Information Science

Advanced search

Emotional Speech Synthesis with Emotion Embeddings

https://doi.org/10.55648/1998-6920-2021-15-4-23-31

Abstract

Several neural network architectures provide high-quality speech synthesis. Several neural network architectures provide high-quality speech synthesis. In this article, emotional speech synthesis with global style tokens is researched. A novel method of emotional speech synthesis with emotional text embeddings is described.

About the Author

V. .. Boldakov
СибГУТИ
Russian Federation


References

1. Shen J., Pang R., Weiss R et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions //2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, April 15 - April 20, 2018. P. 4779-4783.

2. Ren Y., Hu C., Tan X. et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. [Электронный ресурс]. URL: https://arxiv.org/abs/2006.04558 (дата обращения: 09.09.2021).

3. Felbo B., Mislove A., SogaardA. etal. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm // Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September, 2017. P. 1615-1625.

4. Wang Y, Stanton D., Zhang Y. et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. [Электронный ресурс]. URL: https://arxiv.org/abs/1803.09017 (дата обращения: 19.09.2021).

5. McAuliffe M., Socolof M., Mihuc S. et al. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi // INTERSPEECH 2017: Conference of the International Speech Communication Association, Stockholm, Sweden, August 20 - August 24, 2017. P. 498-502.

6. Zhu X., Zhang Y., Yang S. et al. Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis // IEEE Access. 2019. V. 7. P. 65955-65964.

7. Болдаков В. С. Примеры синтеза эмоциональной речи на базе Tacotron 2. [Электронный ресурс]. URL: https://bit.ly/3nOPHRN (дата обращения: 09.09.2021).

8. Болдаков В. С. Примеры синтеза эмоциональной речи на базе FastSpeech 2. [Электронный ресурс]. URL: https://bit.ly/39i0T15 (дата обращения: 09.09.2021).

9. Ito K., Johnson L. The LJ Speech Dataset. [Электронный ресурс]. URL: https://keithito.com/LJ-Speech-Dataset/(дата обращения: 09.09.2021).

10. Luo L., Wang Y. et al. EmotionX-HSU: Adopting Pre-trained BERT for Emotion Classification. [Электронный ресурс]. URL: https://arxiv.org/pdf/1907.09669.pdf (дата обращения: 19.09.2021).


Review

For citations:


Boldakov V... Emotional Speech Synthesis with Emotion Embeddings. The Herald of the Siberian State University of Telecommunications and Information Science. 2021;(4):23-31. (In Russ.) https://doi.org/10.55648/1998-6920-2021-15-4-23-31

Views: 469


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6920 (Print)