Dwango Media Village(DMV)

This article is automatically translated.

This is Shun Ueda from Dwango Media Village. Previously, we developed a Japanese phoneme alignment tool called pydomino (https://github.com/DwangoMediaVillage/pydomino) and wrote an article about it. That article focused on distinctive features, which further decompose phonemes (such as the vowel a or the consonant m) into finer elements. However, this method required “hard alignment” data with detailed labeling of when each phoneme was spoken. Creating such data required manual labeling or automatic labeling using existing tools, which limited the amount of available data.

To address this issue, we developed a new method that detects only phoneme transition events (the moments when phonemes change) in speech data. We achieved this by training a neural network model optimized using Connectionist Temporal Classification (CTC) Loss [1] . This approach allows training without hard alignment data, enabling the use of a larger amount of speech data. Experiments confirmed that this new method achieves higher accuracy than the previous method based on distinctive features.

The updated pydomino incorporating this new approach can be easily installed from the following GitHub repository and used without a GPU. This article provides a detailed explanation of how the model is trained and how alignment is estimated.

git clone --recursive https://github.com/DwangoMediaVillage/pydomino
cd pydomino
pip install .

Introduction

In our previous article, we introduced a method that treats Japanese phonemes as sets of binary labels representing distinctive features and infers phoneme alignment from the prediction results. However, training a neural network to predict distinctive features required hard-aligned label data. Since manually annotated hard alignment data is limited, we used mechanical hard alignment labels generated by the Julius phoneme alignment toolkit. However, preparing hard alignment data increased workload and limited the amount of usable training data.

This article proposes a method to solve this problem. Our approach optimizes a model using CTC Loss to predict the occurrence of phoneme transition events at the time frame level and applies the Viterbi algorithm [2] for alignment inference. In this optimization, the blank token represents “no phoneme transition event occurring.” This framework enables training without hard alignment data, allowing the use of large-scale Japanese speech datasets as they are.

The phoneme set $Ω$ targeted by this alignment tool consists of the same 39 types used for discriminative features. For more details, please refer to the article on alignment tools based on discriminative features. The input and output during inference remain as follows:


Input	16kHz single-channel audio waveform	$x \in [0, 1]^{T}$
	Read-aloud phoneme sequence	$l \in Ω^{M}$
Output	The time intervals during which each phoneme was spoken	$Z \in R_{+}^{M \times 2}$

Here, the read phoneme sequence $l = (l_{1}, l_{2}, \dots l_{M})$ includes pau phonemes at both ends（ $l_{1} = l_{M} = pau$ ）. Each phoneme’s spoken time interval $Z = [z_{1}, z_{2}, \dots z_{M}]^{⊤}$ is represented as $z_{m} = [z_{m 1}, z_{m 2}]$ , indicating that phoneme $l_{m}$ is spoken between $z_{m 1}$ seconds and $z_{m 2}$ seconds.

Method: Phoneme Alignment Learning Based on Phoneme Transition Event Timing Prediction

Preprocessing for Phoneme Labels

Unlike models using distinctive features, this method converts phoneme labels into phoneme transition sequences without palatalization. For example, for speech data where only “意識” (i sh I k I) is spoken, the phoneme transition sequence becomes [pau→i, i→sh, sh→I, I→k, k→I, I→pau], where x→y represents a transition from phoneme x to phoneme y. As a preprocessing step, if the input phoneme sequence lacks pau tokens at the beginning and end, they are inserted to ensure proper representation of speech onset and termination.

The complete set of phoneme transition tokens used in this study excludes transitions that do not occur in Japanese pronunciation. For example, the transition from k to t is not possible and is thus excluded from network predictions. In this paper, we focused only on the phoneme transition tokens marked with ✓ below. The total number of phoneme transition tokens is 556.

Previous \ Next	pau	Consonant	Voiced Vowel	Unvoiced Vowel	N	cl
pau		✓	✓		✓	✓
Consonant			✓	✓
Voiced Vowel	✓	✓	✓		✓	✓
Unvoiced Vowel	✓	✓
N	✓	✓	✓			✓
cl	✓	✓	✓		✓

Preprocessing for Audio Waveforms

The preprocessing for speech audio follows the same approach as in the article on the alignment tool using distinctive features, utilizing a log Mel spectrogram with 100 frames per second.

Network Architecture

The prediction of phoneme transition events utilizes the Encoder part of the Transformer [3] model.

\begin{aligned} S & = LogMelSpectrogram (x) \\ π, ϕ & = TransformerEncoder (S) \end{aligned}

Here, $π \in R^{T^{'} \times | Δ |}$ and $ϕ \in R^{T^{'}}$ represent the following:

$π_{t i}$ The probability of phoneme transition $i$ occurring at time $t$ .
$ϕ_{t i}$ The probability that no phoneme transition $i$ occurs at time $t$ .

Here, $T^{'}$ is the number of time frames, $Δ$ is the set of all possible phoneme transition patterns, and $| Δ | = 556$ is the total number of elements in $Δ$ . The loss during training is computed using CTC Loss.

Inferrence of Phoneme Alignments

The phoneme alignment is generated using the Viterbi algorithm [2] , described in Algorithms 1–4, by taking the following as input:

The transition probability predictions $π$ and $ϕ$ , output by the network.
The phoneme transition token sequence $w \in Δ^{M + 1}$ , generated from the user-provided read-aloud phoneme sequence.

In this context, the blank token prediction probability is interpreted as the “probability that no transition occurs.”

As in the article on discriminative features, we introduce the minimum number of time frames $N \in N$ assigned to each phoneme. When $N = 1$ , the method is equivalent to the standard Viterbi algorithm.

Additionally, we define the following for the summation:

\begin{aligned} \sum_{s = t}^{t} ϕ_{s} & = ϕ_{t} \\ \sum_{s = t^{'}}^{t} ϕ_{s} & = 0 & (t^{'} > t) \end{aligned}

This ensures that the summation operates correctly within the constraints of the time frame indices.

Algorithm 1 Viterbi Algorithm

Input: $π \in R^{T^{'} \times I}, ϕ \in R^{T^{'}}, w \in Δ^{M + 1}, N \in Z_{+}$

Output: $Z \in R_{\geq 0}^{M}$

$A \in R^{T^{'} \times (M + 1)} =$ initialize( $π, w$ )

$β \in B^{T^{'} \times (M + 1)} =$ forward( $A, ϕ, N$ )

$Z \in R_{\geq 0}^{M \times 2} =$ backtrace( $β, N$ )

Algorithm 2 Initialize

Input: $π \in R^{T^{'} \times | Ω |}, w \in Ω^{M + 1}$

Output: $A \in R^{T^{'} \times (M + 1)}$

for $t = 1$ to $T^{'}$ do

for $m = 1$ to $M + 1$ do

$a_{t m} = π_{t, w_{m}}$

end for

Algorithm 3 Forwarding

Input: $A \in R^{T^{'} \times (M + 1)}, ϕ \in R^{T^{'}}, N \in Z_{+}$

Output: $β \in B^{T^{'} \times (M + 1)}$

$α \in R^{T^{'} \times (2 M + 3)} = {- \infty}^{T^{'} \times (2 M + 3)}$

$β \in B^{T^{'} \times (M + 1)} = {false}^{T^{'} \times M + 1}$

for $m = 1$ to $2 M + 3$ do

if $m = 1$ then

$α_{1, 1} = ϕ_{1, 1}$

for $t = 2$ to $T^{'}$ do

$α_{t, m} = α_{t - 1, m} + ϕ_{t}$

end for

else if $m = 2$ then

$α_{1, 2} = a_{1, 1}$

for $t = 2$ to $T^{'}$ do

$α_{t, m} = α_{t - 1, m - 1} + a_{t, 1}$

end for

else if $m = 4, 6, 8, \dots, 2 M + 2$ then

for $t = m / 2 * N$ to $T^{'}$ do

$α_{t, m} = α_{t - 1, m - 1} + a_{t, 1}$

end for

else if $m = 3, 5, 7, 9, \dots, 2 M + 1$ then

for $t = (m - 1) / 2 * N$ to $T^{'}$ do

$x^{(transition_before_N_frames_ago)} = α_{t - 1, m} + ϕ_{t}$

if $t - N + 2 \leq 0$ then

$x^{(transition_N_frames_ago)} = - \infty$

else

$x^{(transition_N_frames_ago)} = α_{t - N + 1, m - 1} + \sum_{t^{'} = t - N + 2}^{t} ϕ_{t^{'}}$

end if

if $x^{(transition_N_frames_ago)} > x^{(transition_before_N_frames_ago)}$ then

$β_{t - N, m - 1} =$ true

end if

$α_{t, m} = max (x^{(transition_N_frames_ago)}, x^{(transition_before_N_frames_ago)})$

end for

else

for $t = m / 2 * N$ to $T^{'}$ do

$x^{(transition_before_1_frames_ago)} = α_{t - 1, m} + ϕ_{t}$

$x^{(transition_1_frames_ago)} = α_{t - 1, m - 1} + ϕ_{t}$

if $x^{(transition_1_frames_ago)} > x^{(transition_before_1_frames_ago)}$ then

$β_{t - 1, m - 1} =$ true

end if

$α_{t, m} = max (x^{(transition_1_frames_ago)}, x^{(transition_before_1_frames_ago)})$

end for

end if

end for

Algorithm 4 Backtracing

Input: $β \in B^{T^{'} \times M + 1}, N \in Z_{+}$

Output: $Z \in R_{\geq 0}^{M \times 2}$

$t = T^{'}$

$m = M$

$z_{M, 2} = t$

while $t > 0$ do

if $β_{t m}$ then

$t = t - N$

$z_{m, 0} = z_{m - 1, 1} = t / 100$

$m = m - 1$

else

$t = t - 1$

end if

end while

$z_{1, 1} = 0$

From here, we will explain the method for calculating the forward log probabilities.

The forward log probability $α_{t, i}$ ,i for the initial non-blank transition token can be calculated as:

α_{t, i} = \sum_{t^{'} = 1}^{t - 1} \log ϕ_{t^{'}} + \log π_{t, i}

Next, we consider the computation of the forward log probability when transitioning from one phoneme transition token to another. The desired forward log probability is given by:

\begin{aligned} α_{t, i} & = max_{1 \leq s \leq t - N} {α_{s, i - 2} + \log π_{t, i} + \sum_{t^{'} = s + 1}^{t - 1} \log ϕ_{t^{'}}} \\ = max_{1 \leq s \leq t - N} {α_{s, i - 2} + \sum_{t^{'} = s + 1}^{t - 1} \log ϕ_{t^{'}}} + \log π_{t, i} \end{aligned}

However, this calculation has a computational complexity of $O ((T^{'})^{3} (M + 1))$ . To reduce this computational cost, a dynamic programming approach is employed.

In this first term, at the positions $i$ of the initial and final blank tokens, by fixing $i$ and sequentially computing for $t = 1, 2, \dots, T^{'}$ :

α_{t, i} = max {α_{t - 1, i} + \log ϕ_{t}, α_{t - N + 1, i - 1} + \sum_{t^{'} = t - N + 2}^{t} \log ϕ_{t^{'}}}

and for the non-blank tokens except for the first one, the computation:

α_{t, i} = α_{t - 1, i - 1} + \log π_{t, i}

is equivalent. This allows us to reduce the computational cost to be reduced to $O (N T^{'} (M + 1))$ .

In the calculation of the forward log probability for the final blank token, there is no alignment constraint for the intervals of silence before and after the speech, so it can be computed as:

α_{t, i} = max (α_{t - 1, i - 1}, α_{t - 1, i}) + \log ϕ_{t}

This allows alignment to be predicted while maintaining the constraint that at least $N$ frames of time are allocated to a single phoneme.

Experiments

In this experiment, we conduct a comparative study with the distinctive feature prediction LSTM model introduced in the previous article. The training and evaluation datasets are the same as those in the distinctive feature article, specifically the CSJ dataset [4] and the ITA corpus multimodal data [5]. The evaluation metrics are also identical to those in the previous article, so please refer to that for details. The distinctive feature prediction LSTM model is implemented using our publicly available pydomino tool.

Network Architecture

For phoneme transition prediction, we used a Transformer Encoder composed of $4$ layers of attention. Each attention layer consists of a non-causal self-attention layer and a feed-forward layer. The self-attention layer has $4$ heads, with an attention dimension of $256$ . The intermediate dimension of the feed-forward layer is $2048$ .

Experiment Results

Comparison with Japanese Phoneme Alignment Based on Discriminative Features

The performance comparison results, using the ITA multimodal dataset and alignment error rate as the evaluation metric, are as follows. Note that the optimal minimum assigned time frame per phoneme $N$ differs between the distinctive feature prediction LSTM model and the phoneme transition prediction Transformer model, as each model achieves the best alignment error rate at different values of $N$ . Specifically, the optimal $N$ value is $N = 5$ for the distinctive feature prediction LSTM model and $N = 2$ for the phoneme transition prediction Transformer model.

手法	アラインメント誤り率(%)
Distinctive Feature Prediction LSTM	10.410
Phoneme Transition Prediction Transformer	8.576 ± 0.226

The comparison results show that the phoneme transition prediction Transformer achieves better Japanese phoneme alignment accuracy than the distinctive feature prediction LSTM introduced in the previous article.

Furthermore, the ITA multimodal dataset contains relatively long silence segments at the beginning and end of each audio signal. We investigated whether these silence segments at both ends influence the alignment error rate comparison.

When the silent segments at both ends of the ITA corpus multimodal dataset’s audio data were excluded from the input, the alignment error rates were as follows. Similar to the previous comparison experiment using the entire audio data, we adopted the optimal minimum assigned time frame per phoneme $N$ for each model to achieve the best alignment error rate. The optimal values were $N = 5$ for the distinctive feature prediction LSTM and $N = 3$ for the phoneme transition prediction Transformer.

手法	アラインメント誤り率(%)
Distinctive Feature Prediction LSTM	16.787
Phoneme Transition Prediction Transformer	14.049 ± 0.417

This confirms that the introduction of the Transformer based on phoneme transition events improves alignment performance, regardless of whether the entire audio data is used as input or only the speech segments are included.

Comparative Experiment Investigating the Impact of Changing the Neural Network Architecture from LSTM to Transformer

In the previous comparison experiment, we confirmed that the proposed method outperforms the current pydomino model. However, this alone does not rule out the possibility that factors other than replacing the neural network architecture from LSTM to Transformer contributed to the performance improvement.

To investigate this, we prepared a distinctive feature prediction Transformer and a phoneme transition prediction LSTM in addition to the existing models. By comparing the alignment performance of these four different Japanese phoneme alignment models, we aim to isolate the impact of the neural network architecture change.

The results are as follows.

Minimum Assigned Time Frame $N$	Distinctive Feature Prediction LSTM	Distinctive Feature Prediction Transformer	Phoneme Transition Prediction LSTM	Phoneme Transition Prediction Transformer
1	11.640	14.498±0.091	49.027±16.393	8.707±0.264
2	11.386	14.186±0.066	58.734±22.439	8.576±0.226
3	11.174	13.857±0.085	60.695±22.814	8.628±0.205
4	10.785	13.463±0.101	62.862±23.107	9.571±0.367
5	10.410	13.029±0.118	64.080±22.672	12.166±0.807
6	11.081	13.366±0.129	65.915±19.498	18.111±1.518
7	17.445	18.557±0.115	71.297±12.895	32.963±1.875

Since the phoneme transition prediction Transformer achieves better alignment accuracy than the distinctive feature prediction Transformer, we can conclude that the improvement in performance is not only due to the model itself but also to the use of phoneme transitions as the target. Additionally, since using phoneme transitions with LSTM results in worse performance compared to using distinctive features, this suggests that the combination of phoneme transition prediction and Transformer is crucial for achieving better performance.

Additionally, we also measured the phoneme alignment error rate when the silent segments at both ends of the audio data were excluded from the input.

Minimum Assigned Time Frame $N$	Distinctive Feature Prediction LSTM	Distinctive Feature Prediction Transformer	Phoneme Transition Prediction LSTM	Phoneme Transition Prediction Transformer
1	19.225	23.869±0.188	50.062±14.365	14.652±0.622
2	18.654	23.280±0.161	52.802±16.598	14.300±0.416
3	18.073	22.712±0.125	54.360±17.095	14.049±0.417
4	17.410	22.043±0.096	55.397±17.144	14.287±0.429
5	16.787	21.343±0.113	55.791±16.946	15.899±0.452
6	18.045	21.862±0.127	55.483±15.051	21.090±0.475
7	28.100	30.148±0.135	58.239±10.517	34.465±0.509

The same findings as in the previous results were also confirmed to be valid for the data containing only the voiced segments.

Conclusion

In this article, we introduced a phoneme alignment learning approach that does not require hard alignment by training a model to predict phoneme transition events using CTC Loss. Evaluation experiments using the ITA Corpus Multimodal Data showed that this method achieves better alignment accuracy compared to the previously introduced phoneme alignment based on discriminative features.

The framework for estimating alignment by optimizing a model to predict state transition moments with CTC Loss is not limited to phoneme alignment. It is expected to be applicable to other time-series problems involving events that occur as discrete points.

References

[1] GRAVES, Alex, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. 2006. p. 369-376. https://www.cs.toronto.edu/~graves/icml_2006.pdf

[2] VITERBI, Andrew. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 1967, 13.2: 260-269. https://ieeexplore.ieee.org/abstract/document/1054010/

[3] VASWANI, Ashish, et al. Attention is All you Need. Advances in Neural Information Processing Systems, 2017. https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[4] MAEKAWA, Kikuo. Corpus of Spontaneous Japanese: Its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. 2003. https://www2.ninjal.ac.jp/kikuo/SSPR03.pdf

[5] ITAコーパスマルチモーダルデータベース https://zunko.jp/multimodal_dev/login.php

Author

Publish: 2025/03/04

Shun Ueda

Improvements the accuracy of the Japanese phoneme alignment tool "pydomino" using a phoneme transition event timing prediction model

Introduction

Method: Phoneme Alignment Learning Based on Phoneme Transition Event Timing Prediction

Preprocessing for Phoneme Labels

Preprocessing for Audio Waveforms

Network Architecture

Inferrence of Phoneme Alignments

Experiments

Network Architecture

Experiment Results

Comparison with Japanese Phoneme Alignment Based on Discriminative Features

Comparative Experiment Investigating the Impact of Changing the Neural Network Architecture from LSTM to Transformer

Conclusion

References

Author

Publish: 2025/03/04