This article is automatically translated.
These are reading audio data by the author (a man in his 20s). Since freely usable datasets are scarce and I found it challenging, I recorded my own voice to make it freely available. The reading scripts are identical to those in the Voice Statistics Corpus. There are no specific usage restrictions, so please feel free to use them.
I’m Hiroshiba from Dwango Media Village. I have been researching voice conversion using Deep Learning, and after achieving some results, I presented a poster at the Music Symposium 2018. I decided to organize the knowledge I have gained about voice conversion and write this article.
First, I will explain the mechanism of speech and the acoustic features used in voice conversion. Then, I will touch on general statistical voice conversion methods, and explain recent voice conversion methods focusing on the differences from these methods.
Speech has acoustic features such as pitch and timbre, and linguistic features such as phoneme sequences and words. Voice conversion changes only the acoustic features without altering the linguistic features. Therefore, it is necessary to extract acoustic features from the speech in voice conversion.
Pitch corresponds to the vibration frequency of the vocal cords in the throat, and timbre is modeled by the shape of the vocal tract. For more details on this, Professor Saruwatari’s signal processing lecture materials from the University of Tokyo are very helpful.
It is necessary to quantify abstract features such as pitch and timbre. One model considers generating speech by filtering a periodic impulse sequence (source-filter model). Thus, pitch can be seen as the interval of the impulse sequence, and timbre can be viewed as filter characteristics. Speech can be synthesized by convolving the impulse sequence with the filter characteristics (i.e., spectral envelope). The inverse of the interval of the impulse sequence is called the fundamental frequency or f0.
The spectral envelope well represents the timbre but has the problem of high dimensionality, making it difficult to handle. Since the spectral envelope has periodicity in the frequency direction, a low-dimensional feature can be obtained by performing an inverse Fourier transform in the frequency direction without significantly losing its representational ability. This is called the cepstrum (cepstrum is an anagram of spectrum).
Considering human hearing, there is an acoustic feature called mel-cepstrum that retains representational ability better than the cepstrum. This feature is often used in statistical voice conversion. Depending on the sampling frequency, the dimensionality of the spectral envelope can be 512 or 1024, while the mel-cepstrum can be reduced to around 60 dimensions without significant loss in quality. (Note: Mel-frequency cepstral coefficients (MFCCs) are a different concept from mel-cepstrum.)
To extract acoustic features from speech or synthesize speech from acoustic features, a Vocoder is used. Vocoders include STRAIGHT and WORLD. WORLD is convenient for commercial use as it is distributed under the modified BSD license.
Just the fundamental frequency and spectral envelope are not enough to reconstruct clean speech. Depending on the vocoder, for example, WORLD requires aperiodic signals for reconstruction.
Note that the spectrum obtained by Fourier transforming speech is not the same as the spectral envelope mentioned above. The spectrum obtained by Fourier transformation does not consider the periodicity of the impulse sequence, hence it includes differences due to the position of the window function.
Voice conversion can be achieved by estimating a transformation function for acoustic features and applying it to the input speech. Such a voice conversion method is called statistical voice conversion.
In the early days of statistical voice conversion, the mainstream method was to estimate a transformation function by machine learning a Gaussian Mixture Model (GMM) using paired data of the input speaker’s speech and the target speaker’s speech reading the same sentences [0]. Such paired data are called parallel data, while other data are called non-parallel data.
Although there are various acoustic features such as fundamental frequency, spectral envelope, and aperiodic signals, the transformation function estimated by machine learning is often only for the spectral envelope (mel-cepstrum). Aperiodic signals are often copied from the source speaker as they do not vary much between speakers. The fundamental frequency is usually linearly transformed using the mean and variance of the input and target speakers.
Even when reading the same sentences, there are timing differences due to the presence or absence of silent periods and speech speed differences. When estimating the transformation function using parallel data, these timing differences need to be corrected. This correction is called alignment. Alignment often uses Dynamic Time Warping (DTW) with mel-cepstrum excluding the 0th dimension (power).
Recent voice conversion methods include using speech generation methods without vocoders, using Deep Learning, using GANs, inputting sequences, using non-parallel data, and various other improvements. Here are some representative methods.
Errors occur when extracting acoustic features from speech. To avoid this, a method was proposed to estimate the difference in the spectrum and apply a filter to the original speech to eliminate this difference, thus performing voice conversion without going through acoustic features [1]. Another method proposed changing the pitch by pre-modulating the frequency of the input speech. These methods are implemented in sprocket, distributed under the MIT license, which allows running the entire process from training to testing.
GMM is being replaced by Deep Learning [2]. Recently, many papers propose new methods based on this basic method.
This method estimates the transformation function with a multi-layer neural network, trains it using parallel data, and applies a Generative Adversarial Network (GAN) to suppress excessive smoothing [3].
These methods estimate functions that mutually convert non-parallel speech data [4][5][6]. These methods were initially advanced in the image field. While testing CycleGAN, vowels sometimes swapped during conversion, thus learning the higher and lower mel-cepstrum separately is necessary [4].
Acoustic features have temporal dependencies, such as how the pronunciation of the same phoneme changes depending on the preceding and following phoneme sequences. Methods have been proposed to consider these temporal dependencies for voice conversion using LSTM or CNN [7][8]. Our recently published method [link] also falls into this category. The source code for training and testing is available under the MIT license, so it can be easily tried here.
This method reduces features to a quantized representation space, then restores them using WaveNet, a self-regressive model, thus learning voice conversion using non-parallel data [9]. The quality of sample voices is very high, which surprised the public. Since it uses WaveNet, a large amount of voice data is probably required during training, and waveform generation during conversion is very slow. Using Parallel WaveNet or WaveRNN might speed up generation without compromising accuracy.
A dataset is needed for machine learning. Non-parallel speech data can be relatively easily collected, but preparing parallel data is challenging. A good dataset might be found from the Speech Corpus List of the National Institute of Informatics. Alternatively, the Voice Statistics Corpus read by professional voice actors might be more approachable. Additionally, I am distributing audio data read by the author, which uses the same scripts as the Voice Statistics Corpus. There are no specific usage restrictions, so please feel free to use them.
I originally researched image conversion using Deep Learning. At some point, I thought that if VR became popular, there would be a demand for virtualizing voices too, so I started experimenting with voice conversion. However, most voice conversion still uses traditional features, and it was challenging to understand this trend. By the time my research was organized and ready for publication, the Vtuber boom had arrived. I believe many people are now considering challenging voice conversion using Deep Learning. Therefore, I wrote this article as an introduction for those who know Deep Learning but are not familiar with voice conversion. I hope voice conversion spreads rapidly and accelerates the virtualization of voices.
[0] Toda, T., Black, A. W. and Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE T. ASLP, 2007.
[1] K. Kobayashi and T. Toda,: sprocket: Open-Source Voice Conversion Software, Proc. Odyssey, 2018.
[2] Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., Meng, H. M. and Deng, L.: Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends, IEEE SPM, 2015.
[3] Saito, Y., Takamichi, S. and Saruwatari, H.: Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks, IEEE T ASLP, 2018.
[4] Fuming, F., Junichi Y., Isao, E., and Jaime, L. T.: High-quality nonparallel voice conversion based on cycle-consistent adversarial network, ICASSP, 2018.
[5] Takuhiro, K. and Hirokazu, K.: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks, arXiv, 2017.
[6] Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., and Wang, H. M.: Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks, INTERSPEECH, 2017.
[7] Kaneko, T., Kameoka, H., Hiramatsu, K. and Kashino, K.: Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks, INTERSPEECH, 2017.
[8] 廣芝 和之, 能勢 隆, 宮本 颯, 伊藤 彰則, 小田桐 優理: 畳込みニューラルネットワークを用いた音響特徴量変換とスペクトログラム高精細化による声質変換, 音楽シンポジウム, 2018. ../../casestudy/2stack_voice_conversion/
[9] van den Oord, A., and Vinyals, O.: Neural Discrete Representation Learning, NIPS, 2017.