This article is automatically translated.
This is Shun Ueda from Dwango Media Village. Previously, we developed a Japanese phoneme alignment tool called pydomino (https://github.com/DwangoMediaVillage/pydomino) and wrote an article about it. That article focused on distinctive features, which further decompose phonemes (such as the vowel a or the consonant m) into finer elements. However, this method required “hard alignment” data with detailed labeling of when each phoneme was spoken. Creating such data required manual labeling or automatic labeling using existing tools, which limited the amount of available data.
To address this issue, we developed a new method that detects only phoneme transition events (the moments when phonemes change) in speech data. We achieved this by training a neural network model optimized using Connectionist Temporal Classification (CTC) Loss [1] . This approach allows training without hard alignment data, enabling the use of a larger amount of speech data. Experiments confirmed that this new method achieves higher accuracy than the previous method based on distinctive features.
The updated pydomino incorporating this new approach can be easily installed from the following GitHub repository and used without a GPU. This article provides a detailed explanation of how the model is trained and how alignment is estimated.
git clone --recursive https://github.com/DwangoMediaVillage/pydomino
cd pydomino
pip install .
In our previous article, we introduced a method that treats Japanese phonemes as sets of binary labels representing distinctive features and infers phoneme alignment from the prediction results. However, training a neural network to predict distinctive features required hard-aligned label data. Since manually annotated hard alignment data is limited, we used mechanical hard alignment labels generated by the Julius phoneme alignment toolkit. However, preparing hard alignment data increased workload and limited the amount of usable training data.
This article proposes a method to solve this problem. Our approach optimizes a model using CTC Loss to predict the occurrence of phoneme transition events at the time frame level and applies the Viterbi algorithm [2] for alignment inference. In this optimization, the blank token represents “no phoneme transition event occurring.” This framework enables training without hard alignment data, allowing the use of large-scale Japanese speech datasets as they are.
The phoneme set \( \Omega \) targeted by this alignment tool consists of the same 39 types used for discriminative features. For more details, please refer to the article on alignment tools based on discriminative features. The input and output during inference remain as follows:
Input | 16kHz single-channel audio waveform | \( \boldsymbol{x} \in [0, 1]^{T} \) |
Read-aloud phoneme sequence | \( \boldsymbol{l} \in \Omega^{M} \) | |
Output | The time intervals during which each phoneme was spoken | \( \boldsymbol{Z} \in \mathbb{R}_{+}^{M \times 2} \) |
Here, the read phoneme sequence \( \boldsymbol{l} = (l_1, l_2, \cdots l_M) \) includes pau
phonemes at both ends(\( l_1 = l_M = \mathrm{pau} \)). Each phoneme’s spoken time interval \(\boldsymbol{Z} = [\boldsymbol{z}_1, \boldsymbol{z}_2, \cdots \boldsymbol{z}_M]^{\top}\) is represented as \(\boldsymbol{z}_m = [z_{m1}, z_{m2}]\), indicating that phoneme \( l_m \) is spoken between \( z_{m1} \) seconds and \( z_{m2} \) seconds.
Unlike models using distinctive features, this method converts phoneme labels into phoneme transition sequences without palatalization. For example, for speech data where only “意識” (i sh I k I
) is spoken, the phoneme transition sequence becomes [pau, i→sh, sh→I, I→k, k→I, pau]
, where x→y
represents a transition from phoneme x
to phoneme y
. As a preprocessing step, if the input phoneme sequence lacks pau
tokens at the beginning and end, they are inserted to ensure proper representation of speech onset and termination.
The complete set of phoneme transition tokens used in this study excludes transitions that do not occur in Japanese pronunciation. For example, the transition from k
to t
is not possible and is thus excluded from network predictions. In this paper, we focused only on the phoneme transition tokens marked with ✓ below. The total number of phoneme transition tokens is 556.
Previous \ Next | pau | Consonant | Voiced Vowel | Unvoiced Vowel | N | cl |
---|---|---|---|---|---|---|
pau | ✓ | ✓ | ✓ | ✓ | ||
Consonant | ✓ | ✓ | ||||
Voiced Vowel | ✓ | ✓ | ✓ | ✓ | ✓ | |
Unvoiced Vowel | ✓ | ✓ | ||||
N | ✓ | ✓ | ✓ | ✓ | ||
cl | ✓ | ✓ | ✓ | ✓ |
The preprocessing for speech audio follows the same approach as in the article on the alignment tool using distinctive features, utilizing a log Mel spectrogram with 100 frames per second.
The prediction of phoneme transition events utilizes the Encoder part of the Transformer [3] model.
$$ \begin{aligned} \boldsymbol{S} &= \mathrm{LogMelSpectrogram}(\boldsymbol{x})\\ \boldsymbol{\pi}, \boldsymbol{\phi} &=\mathrm{TransformerEncoder}(\boldsymbol{S})\end{aligned} $$Here, \( \boldsymbol{\pi} \in \mathbb{R}^{T' \times |\Delta|} \) and \( \boldsymbol{\phi} \in \mathbb{R}^{T'} \) represent the following:
Here, \(T'\) is the number of time frames, \( \Delta \) is the set of all possible phoneme transition patterns, and \( |\Delta| = 556 \) is the total number of elements in \( \Delta \). The loss during training is computed using CTC Loss.
The phoneme alignment is generated using the Viterbi algorithm [2] , described in Algorithms 1–4, by taking the following as input:
In this context, the blank
token prediction probability is interpreted as the “probability that no transition occurs.”
As in the article on discriminative features, we introduce the minimum number of time frames \( N \in \mathbb{N} \) assigned to each phoneme. When \(N=1\) , the method is equivalent to the standard Viterbi algorithm.
Additionally, we define the following for the summation:
$$ \begin{aligned} \sum_{s=t}^{t} \phi_s &= \phi_t\\ \sum_{s=t'}^{t} \phi_s &= 0 &(t' > t) \end{aligned} $$This ensures that the summation operates correctly within the constraints of the time frame indices.
\begin{algorithm} \caption{Viterbi Algorithm} \begin{algorithmic} \INPUT $\boldsymbol{\pi} \in \mathbb{R}^{T' \times I}, \boldsymbol{\phi} \in \mathbb{R}^{T'}, \boldsymbol{w} \in \Delta^{M + 1}, N \in \mathbb{Z}_{+}$ \OUTPUT $\boldsymbol{Z} \in \mathbb{R}_{\ge 0}^{M}$ \STATE $\boldsymbol{A} \in \mathbb{R}^{T' \times (M + 1)} =$\CALL{initialize}{$\boldsymbol{\pi}, \boldsymbol{w}$} \STATE $\boldsymbol{\beta} \in \mathbb{B}^{T' \times (M + 1)} =$\CALL{forward}{$\boldsymbol{A}, \boldsymbol{\phi}, N$} \STATE $\boldsymbol{Z} \in \mathbb{R}_{\ge 0}^{M \times 2} =$\CALL{backtrace}{$\boldsymbol{\beta}, N$} \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Initialize} \begin{algorithmic} \INPUT $\boldsymbol{\pi} \in \mathbb{R}^{T' \times |\Omega|}, \boldsymbol{w} \in \Omega^{M + 1}$ \OUTPUT $\boldsymbol{A} \in \mathbb{R}^{T' \times (M + 1)}$ \FOR{$t = 1$ \TO $T'$} \FOR{$m = 1$ \TO $M + 1$} \STATE $a_{tm} = \pi_{t,w_m}$ \ENDFOR \ENDFOR \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Forwarding} \begin{algorithmic} \INPUT $\boldsymbol{A} \in \mathbb{R}^{T' \times (M+1)}, \boldsymbol{\phi} \in \mathbb{R}^{T'}, N \in \mathbb{Z}_{+}$ \OUTPUT $\boldsymbol{\beta} \in \mathbb{B}^{T' \times (M+1)}$ \STATE $\boldsymbol{\alpha} \in \mathbb{R}^{T' \times (2M+3)} = \{ - \infty\}^{T' \times (2M+3)}$ \STATE $\boldsymbol{\beta} \in \mathbb{B}^{T' \times (M+1)} = \{ \boldsymbol{\mathrm{false}} \}^{T' \times M+1}$ \FOR{$m = 1$ \to $2M+3$} \IF{$m=1$} \STATE $\alpha_{1, 1} = \phi_{1, 1}$ \FOR{$t = 2$ \TO $T'$} \STATE $\alpha_{t, m} = \alpha_{t-1, m} + \phi_{t}$ \ENDFOR \ELSEIF {$m = 2$} \STATE $\alpha_{1, 2} = a_{1, 1}$ \FOR{$t = 2$ \TO $T'$} \STATE $\alpha_{t, m} = \alpha_{t-1, m-1} + a_{t, 1}$ \ENDFOR \ELSEIF {$m = 4, 6, 8, \cdots, 2M + 2$} \FOR{$t = m / 2 * N$ \TO $T'$} \STATE $\alpha_{t, m} = \alpha_{t-1, m-1} + a_{t, 1}$ \ENDFOR \ELSEIF {$m = 3, 5, 7, 9, \cdots, 2M +1$} \FOR{$t = (m-1) / 2 * N$ \TO $T'$} \STATE $x^{(\mathrm{transition\_before\_N\_frames\_ago})} = \alpha_{t-1, m} + \phi_{t}$ \IF {$t - N + 2 \le 0$} \STATE $x^{(\mathrm{transition\_N\_frames\_ago})} = - \infty$ \ELSE \STATE $x^{(\mathrm{transition\_N\_frames\_ago})} = \alpha_{t-N+1, m-1} + \sum_{t' = t - N + 2}^{t}\phi_{t'}$ \ENDIF \IF {$x^{(\mathrm{transition\_N\_frames\_ago})} > x^{(\mathrm{transition\_before\_N\_frames\_ago})}$} \STATE $\beta_{t-N, m-1} = $\TRUE \ENDIF \STATE $\alpha_{t, m} = \max(x^{(\mathrm{transition\_N\_frames\_ago})}, x^{(\mathrm{transition\_before\_N\_frames\_ago})})$ \ENDFOR \ELSE \FOR{$t = m / 2 * N$ \TO $T'$} \STATE $x^{(\mathrm{transition\_before\_1\_frames\_ago})} = \alpha_{t-1, m} + \phi_{t}$ \STATE $x^{(\mathrm{transition\_1\_frames\_ago})} = \alpha_{t-1, m-1} + \phi_{t}$ \IF {$x^{(\mathrm{transition\_1\_frames\_ago})} > x^{(\mathrm{transition\_before\_1\_frames\_ago})}$} \STATE $\beta_{t-1, m-1} = $\TRUE \ENDIF \STATE $\alpha_{t, m} = \max(x^{(\mathrm{transition\_1\_frames\_ago})}, x^{(\mathrm{transition\_before\_1\_frames\_ago})})$ \ENDFOR \ENDIF \ENDFOR \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Backtracing} \begin{algorithmic} \INPUT $\boldsymbol{\beta} \in \mathbb{B}^{T' \times M + 1}, N \in \mathbb{Z}_{+}$ \OUTPUT $\boldsymbol{Z} \in \mathbb{R}_{\ge 0}^{M \times 2}$ \STATE $t=T'$ \STATE $m=M$ \STATE $z_{M, 2} = t$ \WHILE{$t > 0$} \IF{$\beta_{tm}$} \STATE ${t = t - N}$ \STATE $z_{m, 0} = z_{m-1, 1} = t / 100$ \STATE ${m = m - 1}$ \ELSE \STATE $t = t - 1$ \ENDIF \ENDWHILE \STATE $z_{1, 1} = 0$ \end{algorithmic} \end{algorithm}
From here, we will explain the method for calculating the forward log probabilities.
The forward log probability \( \alpha_{t, i} \),i for the initial non-blank transition token can be calculated as:
$$ \alpha_{t, i} = \sum_{t'=1}^{t-1} \log \phi_{t'} + \log \pi_{t, i} $$Next, we consider the computation of the forward log probability when transitioning from one phoneme transition token to another. The desired forward log probability is given by:
$$ \begin{aligned} \alpha_{t, i} &= \max_{1 \le s \le t-N}\left\{ \alpha_{s, i-2} + \log \pi_{t, i} + \sum_{t' = s+1}^{t-1} \log \phi_{t'} \right\}\\ &= \max_{1 \le s \le t-N}\left\{ \alpha_{s, i-2} + \sum_{t' = s+1}^{t-1} \log \phi_{t'} \right\} + \log \pi_{t, i} \end{aligned} $$However, this calculation has a computational complexity of \(\mathcal{O}((T')^3 (M + 1)) \). To reduce this computational cost, a dynamic programming approach is employed.
In this first term, at the positions \(i\) of the initial and final blank tokens, by fixing \(i\) and sequentially computing for \(t = 1, 2, \cdots, T'\) :
$$ \alpha_{t, i} = \max\left\{ \alpha_{t-1, i} + \log \phi_{t}, \alpha_{t -N + 1, i-1} + \sum_{t' = t - N + 2}^{t} \log \phi_{t'} \right\} $$and for the non-blank tokens except for the first one, the computation:
$$ \alpha_{t, i} = \alpha_{t-1, i-1} + \log \pi_{t, i} $$is equivalent. This allows us to reduce the computational cost to be reduced to \(\mathcal{O}(N T' (M + 1)) \).
In the calculation of the forward log probability for the final blank token, there is no alignment constraint for the intervals of silence before and after the speech, so it can be computed as:
$$ \alpha_{t, i} = \max\left( \alpha_{t-1, i-1}, \alpha_{t-1, i} \right) + \log \phi_t $$This allows alignment to be predicted while maintaining the constraint that at least \( N \) frames of time are allocated to a single phoneme.
In this experiment, we conduct a comparative study with the distinctive feature prediction LSTM model introduced in the previous article. The training and evaluation datasets are the same as those in the distinctive feature article, specifically the CSJ dataset [4] and the ITA corpus multimodal data [5]. The evaluation metrics are also identical to those in the previous article, so please refer to that for details. The distinctive feature prediction LSTM model is implemented using our publicly available pydomino tool.
For phoneme transition prediction, we used a Transformer Encoder composed of \(4\) layers of attention. Each attention layer consists of a non-causal self-attention layer and a feed-forward layer. The self-attention layer has \(4\) heads, with an attention dimension of \(256\). The intermediate dimension of the feed-forward layer is \(2048\).
The performance comparison results, using the ITA multimodal dataset and alignment error rate as the evaluation metric, are as follows. Note that the optimal minimum assigned time frame per phoneme \( N \) differs between the distinctive feature prediction LSTM model and the phoneme transition prediction Transformer model, as each model achieves the best alignment error rate at different values of \( N \). Specifically, the optimal \( N \) value is \( N = 5 \) for the distinctive feature prediction LSTM model and \( N = 2 \) for the phoneme transition prediction Transformer model.
手法 | アラインメント誤り率(%) |
---|---|
Distinctive Feature Prediction LSTM | 10.410 |
Phoneme Transition Prediction Transformer | 8.576 ± 0.226 |
The comparison results show that the phoneme transition prediction Transformer achieves better Japanese phoneme alignment accuracy than the distinctive feature prediction LSTM introduced in the previous article.
Furthermore, the ITA multimodal dataset contains relatively long silence segments at the beginning and end of each audio signal. We investigated whether these silence segments at both ends influence the alignment error rate comparison.
When the silent segments at both ends of the ITA corpus multimodal dataset’s audio data were excluded from the input, the alignment error rates were as follows. Similar to the previous comparison experiment using the entire audio data, we adopted the optimal minimum assigned time frame per phoneme \( N \) for each model to achieve the best alignment error rate. The optimal values were \( N = 5\) for the distinctive feature prediction LSTM and \(N = 3\) for the phoneme transition prediction Transformer.
手法 | アラインメント誤り率(%) |
---|---|
Distinctive Feature Prediction LSTM | 16.787 |
Phoneme Transition Prediction Transformer | 14.049 ± 0.417 |
This confirms that the introduction of the Transformer based on phoneme transition events improves alignment performance, regardless of whether the entire audio data is used as input or only the speech segments are included.
In the previous comparison experiment, we confirmed that the proposed method outperforms the current pydomino model. However, this alone does not rule out the possibility that factors other than replacing the neural network architecture from LSTM to Transformer contributed to the performance improvement.
To investigate this, we prepared a distinctive feature prediction Transformer and a phoneme transition prediction LSTM in addition to the existing models. By comparing the alignment performance of these four different Japanese phoneme alignment models, we aim to isolate the impact of the neural network architecture change.
The results are as follows.
Minimum Assigned Time Frame \(N\) | Distinctive Feature Prediction LSTM | Distinctive Feature Prediction Transformer | Phoneme Transition Prediction LSTM | Phoneme Transition Prediction Transformer |
---|---|---|---|---|
1 | 11.640 | 14.498±0.091 | 49.027±16.393 | 8.707±0.264 |
2 | 11.386 | 14.186±0.066 | 58.734±22.439 | 8.576±0.226 |
3 | 11.174 | 13.857±0.085 | 60.695±22.814 | 8.628±0.205 |
4 | 10.785 | 13.463±0.101 | 62.862±23.107 | 9.571±0.367 |
5 | 10.410 | 13.029±0.118 | 64.080±22.672 | 12.166±0.807 |
6 | 11.081 | 13.366±0.129 | 65.915±19.498 | 18.111±1.518 |
7 | 17.445 | 18.557±0.115 | 71.297±12.895 | 32.963±1.875 |
Since the phoneme transition prediction Transformer achieves better alignment accuracy than the distinctive feature prediction Transformer, we can conclude that the improvement in performance is not only due to the model itself but also to the use of phoneme transitions as the target. Additionally, since using phoneme transitions with LSTM results in worse performance compared to using distinctive features, this suggests that the combination of phoneme transition prediction and Transformer is crucial for achieving better performance.
Additionally, we also measured the phoneme alignment error rate when the silent segments at both ends of the audio data were excluded from the input.
Minimum Assigned Time Frame \(N\) | Distinctive Feature Prediction LSTM | Distinctive Feature Prediction Transformer | Phoneme Transition Prediction LSTM | Phoneme Transition Prediction Transformer |
---|---|---|---|---|
1 | 19.225 | 23.869±0.188 | 50.062±14.365 | 14.652±0.622 |
2 | 18.654 | 23.280±0.161 | 52.802±16.598 | 14.300±0.416 |
3 | 18.073 | 22.712±0.125 | 54.360±17.095 | 14.049±0.417 |
4 | 17.410 | 22.043±0.096 | 55.397±17.144 | 14.287±0.429 |
5 | 16.787 | 21.343±0.113 | 55.791±16.946 | 15.899±0.452 |
6 | 18.045 | 21.862±0.127 | 55.483±15.051 | 21.090±0.475 |
7 | 28.100 | 30.148±0.135 | 58.239±10.517 | 34.465±0.509 |
The same findings as in the previous results were also confirmed to be valid for the data containing only the voiced segments.
In this article, we introduced a phoneme alignment learning approach that does not require hard alignment by training a model to predict phoneme transition events using CTC Loss. Evaluation experiments using the ITA Corpus Multimodal Data showed that this method achieves better alignment accuracy compared to the previously introduced phoneme alignment based on discriminative features.
The framework for estimating alignment by optimizing a model to predict state transition moments with CTC Loss is not limited to phoneme alignment. It is expected to be applicable to other time-series problems involving events that occur as discrete points.
[1] GRAVES, Alex, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. 2006. p. 369-376. https://www.cs.toronto.edu/~graves/icml_2006.pdf
[2] VITERBI, Andrew. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 1967, 13.2: 260-269. https://ieeexplore.ieee.org/abstract/document/1054010/
[3] VASWANI, Ashish, et al. Attention is All you Need. Advances in Neural Information Processing Systems, 2017. https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[4] MAEKAWA, Kikuo. Corpus of Spontaneous Japanese: Its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. 2003. https://www2.ninjal.ac.jp/kikuo/SSPR03.pdf
[5] ITAコーパスマルチモーダルデータベース https://zunko.jp/multimodal_dev/login.php