Reliability Modeling of Speech Recognition Tasks
Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing, 100083, China
*Corresponding Author(s):
Accepted: ; Published:
Speech recognition is becoming the key technology of man-machine interfaces in information technology. The application of voice technology has become a competitive and new high-tech industry. However, due to the big volume of vocabulary, continuous voice, and personalized accents, it is hard to make speech recognition completely accurate. In this paper, a reliability model is proposed to measure the performance of speech recognition. In particular, two types of task failures are suggested and an iterative approach is adopted. Numerical examples are proposed for illustrative purposes.
Keywords:
Cite this article
Hui Qiu, Xiangbin Yan, Rui Peng, Kaiye Gao, Langtao Wu.
1. Introduction
Speech recognition technology has applications in different systems, such as automatic translation telephones, question and answer machines, and intelligence decisions support systems[1-4]. The mechanism of speech recognition lies in the separation of the words and the matching of patterns between the words in the speech and the words in a dictionary [5-6]. Although many measures have been put forward to improve speech recognition accuracy, quantitatively evaluating the performance and reliability of speech recognition tasks is relatively limited [7].
With the advancement of artificial intelligence, the accuracy of speech recognition has been greatly improved. Speech recognition is already applied in everyday lives. For example, the WeChat app can transform voice messages into text messages with an acceptable accuracy. When one receives a voice message on WeChat and is not willing to listen, such as during a noisy occasion, one can choose to transform it into a text message in less than five seconds. Although some progress has been made, the accuracy of speech recognition is still quite difficult for three reasons [8-10].
The first reason that constrains the accuracy of speech recognition is the extensive vocabulary. No matter what language the speech is in, the language can contain at least tens of thousands of words, where at least a few thousands of words are frequently used words. The more words a language has and the more words have similar pronunciations, theeasier one word may be wrongly recognized as another word. For instance, it may be hard to distinguish “what” from “water”,“constellation” from “consternation”, bed ”from“ bird”, etc. The second reason is that the pronunciation for different words may be hard to be separated, especially when the speech speed is very fast. The third reason, which is even harder to cope with, is the personalized accents of different people. It is hard to find any two people who have exactly the same pronunciation for all the words. Even the same person may pronounce words differently at different times, such as by pronouncing words more nasally after catching a cold.
In this paper, we assume that word separation is perfect, but each word has some probability to be recognized wrongly. A success judging criterion is suggested for a speech recognition task. In particular, two failure modes are considered for judging where a speech recognition task is successful: 1) if too many consecutive words are recognized wrongly, the speech recognition task is regarded as failed. Actually, too many wrongly recognized words may make it difficult to achieve the accurate meaning in a speech. 2) If the first failure mode does not happen, then calculate a total score for the errors that have occurred in the speech recognition. In particular, if more consecutive words are recognized wrongly, a bigger increment is added to the score. If the total score of the speech recognition is bigger than a threshold, the speech recognition task is regarded as unsuccessful.
This paper is organized as follows: in Section 2, a reliability model is proposed considering the two types of failures. In Section 3, some specific numerical examples are givenand a comparative analysis is carried out. In Section 4, a summary of this article is made.
2. The Modeling Framework
Consider a speech consisting of N words, where each word has a probability ${{p}_{i}}$,$i\in \{1,2,\cdots ,n\}$, $(0\le {{p}_{i}}\le 1)$ to be recognized wrongly. To start, let all pi’s be equal and independent, whereall pi’sare replaced by P. In the future, this assumption can be readily relaxed to adapt to specific situations.
In case where at least K
In case where the longest group of consecutive wrongly recognized words is smaller than K, the recognized content from a speech may also be hard to understand correctly if too many places are recognized wrongly. The more places that are recognized wrongly and the more consecutive words that are recognized wrongly for each place, the harder it is to understand a speech correctly. In order to catch this feature, we suggest a scoring mechanism for the speech recognition task. The initial score is set to be zero. For each place recognized wrongly, an increment is added to the score depending on how many consecutive words are recognized wrongly at this place.
The reliability of a speech recognition task is expressed as R(N;p;s(1),
In order to introduce the general model form of R(N;p;s(1),
If S
If S
If S
In particular, when M=N+1,$\text{pr}\left( M=N+1 \right)={{P}^{N}}.$
In the case where M=i, the conditional reliability of the speech recognition task is R(N-i; p; K; s(1),
Becauses(N)
$R\left( 0;p;s\left( 1 \right),\cdots ,s\left( N \right);S-s\left( N \right) \right)=0$
In the case where $K=\min \left\{ i\left| s\left( i \right)\ge S,\text{ }i=1,2,\cdots ,N \right. \right\}$, where at least K
Thus, $R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right)$ can be degenerated into $R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right)$.
In short, when s(K)
It is easy to see that when $i-1\ge k$,$S-s\left( i-1 \right)\le 0$,$R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( N-i \right);S-s\left( i-1 \right) \right)=0$.
As a result, we can obtain that M must be 1,2,
Through the above analysis, Equation (2) can be simplified to:
In Equation(3), we finally see that $R\left( N;p;s\left( 1 \right),\cdots ,s\left( N \right);S \right)$ has been decomposed into the form consistingof $R\left( N-i;p;s\left( 1 \right),\cdots ,s\left( N-i \right);S-s\left( i-1 \right) \right)$.
In particular, $R\left( N;p;K;s\left( 1 \right),\cdots ,s\left( K-1 \right);S \right)$ can be obtained by iteratively decomposing it. In order to illustrate the procedures, numerical examples are presented in the next section.
3. The Modeling Framework
In order to get a thorough understanding of the given reliability model, we propose two strategic numerical examples. For the first strategic numerical example, we analyse the same $s(i)$ score array under different thresholds. For the second strategic numerical example, and we analyse different $s(i)$ scorearrays under the same threshold.
3.1 An Numerical Example Analysis of the First Strategy
Given a speech example of N=8 word, assume that{s(0),s(1),
$R\left( 8;p;10,30,60,90,120,150,180,210;100 \right)$ can be degenerated into $R\left( 8;p;5;10,30,60,90;100 \right)$. The reliability of the speech recognition task can be denoted as $R\left( 8;p;5;10,30,60,90;100 \right)$. It can be further decomposed into
Where $R\left( 7;p;10,30,60,90;100 \right)$ can be degenerated into $R\left( 7;p;5;10,30,60,90;100 \right)$, $R\left( 6;p;10,30,60,90;90 \right)$ can be degenerated into $R\left( 6;p;4;10,30,60;90 \right)$, $R\left( 5;p;10,30,60,90;70 \right)$ can be degenerated into $R\left( 5;p;4;10,30,60;70 \right)$, $R\left( 4;p;10,30,60,90;40 \right)$ can be degenerated into $R\left( 4;p;3;10,30;40 \right)$, and $R\left( 3;p;10,30,60,90;10 \right)$ can be degenerated into $R\left( 3;p;1;0;10 \right)$.
Equation (4) is converted as follows:
Note that $R\left( 3;p;1;0;10 \right)$ corresponds to the probability that no more wordsare recognized wrongly starting from the 6th word, and it equals ${{\left( 1-p \right)}^{3}}$. Thus, we have
$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 7;p;5;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;90 \right) \\ & +{{p}^{2}}\left( 1-p \right)R\left( 5;p;4;10,30,60;70 \right)+{{p}^{3}}\left( 1-p \right)R\left( 4;p;3;10,30;40 \right)+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{ } \\ \end{matrix}$
Furthermore, we have
$\begin{matrix} & R\left( 4;p;3;10,30;40 \right) \\ & =\left( 1-p \right)R\left( 3;p;3;10,30;40 \right)+p\left( 1-p \right)R\left( 2;p;2;10;30 \right)+{{p}^{2}}\left( 1-p \right)R\left( 1;p;1;0;10 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{3}} \right)+p\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}\text{ } \\ \end{matrix}$
$\begin{matrix} & R\left( 5;p;4;10,30,60;70 \right) \\ & =\left( 1-p \right)R\left( 4;p;4;10,30,60;70 \right)+p\left( 1-p \right)R\left( 3;p;3;10,30;60 \right)+{{p}^{2}}\left( 1-p \right)R\left( 2;p;3;10,30;40 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 1;p;1;0;10 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\text{ } \\ \end{matrix}$
$\begin{matrix} & R\left( 6;p;4;10,30,60;90 \right) \\ & =\left( 1-p \right)R\left( 5;p;4;10,30,60;90 \right)+p\left( 1-p \right)R\left( 4;p;4;10,30,60;80 \right)+{{p}^{2}}\left( 1-p \right)R\left( 3;p;3;10,30;60 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 2;p;2;10;30 \right) \\ & =\left( 1-p \right)\left( 1-{{p}^{5}}-2\left( 1-p \right){{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{4}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)+{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)\text{ } \\ \end{matrix}$
$\begin{matrix} & R\left( 7;p;5;10,30,60,90;100 \right) \\ & =\left( 1-p \right)R\left( 6;p;5;10,30,60,90;100 \right)+p\left( 1-p \right)R\left( 5;p;4;10,30,60;90 \right)+{{p}^{2}}\left( 1-p \right)R\left( 4;p;4;10,30,60;70 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 3;p;3;10,30;40 \right)+{{p}^{4}}\left( 1-p \right)R\left( 2;p;1;0;10 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}+{{p}^{4}}-3{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{3}}\text{ } \\ \end{matrix}$
Lastly, the reliability of the speech recognition task can be expressed as follows:
$\begin{matrix} & R\left( 8;p;5;10,30,60,90;100 \right) \\ & ={{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}+{{p}^{4}}-3{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{3}}+p\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{5}}-2\left( 1-p \right){{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{4}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-{{p}^{3}} \right) \\ & +{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{4}} \right)\text{+}p\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}} \\ & +{{p}^{3}}\left( 1-p \right)\left( 1-p \right)\left( 1-{{p}^{3}} \right)+p\left( 1-p \right)\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{ } \\ \end{matrix}$
Figure 1 shows the reliability of a speech consisting of N=8 words and S=100 threshold value as a function of p, the probability for wrongly recognizing a word. It can be seen that the smaller the probability for each word to be recognized wrongly, the greater the reliability. This is consistent with our definite theory.
Figure 2
Figure 2.
Reliability of speech recognition task (N=8,S=100)
In order to further analyse the properties of the reliability, we choose different thresholds to compare and analyse. For the above numerical example, we keep the values of N and {s(1),
Based on the above example, we assume that the other conditions are constant, and the threshold value becomes S=90.
That is to say, the speech recognition task fails if the total score reaches S=90. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;90 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60;90 \right)$. We can obtain the reliability formula when the threshold value S=90.
$\begin{matrix} & R\left( 8;p;4;10,30,60;90 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;90 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;80 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;60 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;30 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-2{{p}^{4}}-2{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}-{{p}^{3}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1+p-{{p}^{2}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right) \\ & \text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1+p+{{p}^{2}}-{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right) \\ & +{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1+p-{{p}^{2}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{ } \\ \end{matrix}$
Similarly, assume the threshold value becomes S=80, that is, the speech recognition task fails if the total score reaches S=80. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;80 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60;80 \right)$. We can get the following formula when the threshold value S=80:
$\begin{matrix} & R\left( 8;p;4;10,30,60;80 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;80 \right)+p\left( 1-p \right)R\left( 6;p;4;10,30,60;70 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;50 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;20 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1+p+{{p}^{2}}+{{p}^{3}}-3{{p}^{4}}-8{{p}^{5}}+7{{p}^{6}} \right)\text{+}p\left( 1-p \right)\left( 1-9{{p}^{4}}+12{{p}^{5}}-4{{p}^{6}} \right)\text{+}{{p}^{2}}\left( 1-p \right)\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right) \\ & \text{+}{{p}^{3}}\left( 1-p \right)\left( 1-6{{p}^{2}}+8{{p}^{3}}-3{{p}^{4}} \right) \\ \end{matrix}$
Suppose the threshold value S=70 and the other conditions do not change. The speech recognition task fails if the total score reaches S=70. Thus, the speech recognition task fails if at least K=4 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;70 \right)$ can be degenerated into $R\left( 8;p;4;10,30,60,90;70 \right)$. We can get the following formula when the threshold value S=70:
$\begin{matrix} & R\left( 8;p;4;10,30,60,90;70 \right) \\ & =\left( 1-p \right)R\left( 7;p;4;10,30,60;70 \right)+p\left( 1-p \right)R\left( 6;p;3;10,30;60 \right)+{{p}^{2}}\left( 1-p \right)R\left( 5;p;3;10,30;40 \right) \\ & +{{p}^{3}}\left( 1-p \right)R\left( 4;p;2;10;20 \right)\text{ } \\ & \text{=}{{\left( 1-p \right)}^{2}}\left( 1-9{{p}^{4}}+12{{p}^{5}}-4{{p}^{6}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{3}}+{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ \end{matrix}$
Suppose the threshold value S=60 and the other conditions do not change. The speech recognition task fails if the total score reaches S=60. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly.
$R\left( 8;p;10,30,60,90,120,150,180,210;60 \right)$ can be degenerated into $R\left( 8;p;3;10,30;60 \right)$. We can get the following formula when the threshold value S=60:
$\begin{matrix} & R\left( 8;p;3;10,30;60 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{3}}+{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right) \\ \end{matrix}$
Suppose the threshold value S=50 and the other conditions do not change. The speech recognition task fails if the total score reaches S=50. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;50 \right)$ can be degenerated into $R\left( 8;p;3;10,30;50 \right)$. We can get the following formula when the threshold value S=50:
$\begin{matrix} & R\left( 8;p;3;10,30;50 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{3}}+{{p}^{4}}+{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-6{{p}^{2}}+8{{p}^{3}}-3{{p}^{4}} \right) \\ & \text{+}p{{\left( 1-p \right)}^{3}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+2{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ \end{matrix}$
Suppose the threshold value S=40 and the other conditions do not change. The speech recognition task fails if the total score reaches S=40. Thus, the speech recognition task fails if at least K=3 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;40 \right)$ can be degenerated into$R\left( 8;p;3;10,30;40 \right)$. We can get the following formula when the threshold value S=40:
$\begin{matrix} & R\left( 8;p;3;10,30;40 \right) \\ & \text{=}{{\left( 1-p \right)}^{4}}\left( 1-4{{p}^{3}}+3{{p}^{4}} \right)\text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ & \text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{6}} \\ & \text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+2}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right)+{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & +{{p}^{2}}{{\left( 1-p \right)}^{5}}\text{+}p{{\left( 1-p \right)}^{4}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+3}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ & +2{{p}^{3}}{{\left( 1-p \right)}^{5}}+{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ \end{matrix}$
Suppose the threshold value S=30 and the other conditions do not change. The speech recognition task fails if the total score reaches S=30. Thus, the speech recognition task fails if at least K=2 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;30 \right)$ can be degenerated into $R\left( 8;p;2;10;30 \right)$. Therefore, we get the reliability formula as follows when the threshold value S=30:
$\begin{matrix} & R\left( 8;p;2;10;30 \right) \\ & \text{=}{{\left( 1-p \right)}^{5}}\left( 1-2{{p}^{2}}+{{p}^{3}} \right)\text{+4}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)+3{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+2}{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ & +p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)+3{{p}^{2}}{{\left( 1-p \right)}^{6}}+{{p}^{2}}{{\left( 1-p \right)}^{4}}\left( 1-{{p}^{2}} \right) \\ \end{matrix}$
Suppose the threshold value S=20 and the other conditions do not change. The speech recognition task fails if the total score reaches S=20. Thus, the speech recognition task fails if at least K=2 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;20 \right)$ can be degenerated into $R\left( 8;p;2;10;20 \right)$. Thus, we get the reliability formula as follows when the threshold value S=20:
$\begin{matrix} & R\left( 8;p;2;10;20 \right) \\ & \text{=}{{\left( 1-p \right)}^{5}}\left( 1-3{{p}^{2}}+2{{p}^{3}} \right)+5p{{\left( 1-p \right)}^{7}} \\ \end{matrix}$
Suppose the threshold value S=10 and the other conditions do not change. The speech recognition task fails if the total score reaches S=10. Thus, the speech recognition task fails if at least K=1 consecutive words are recognized wrongly. $R\left( 8;p;10,30,60,90,120,150,180,210;10 \right)$ can be degenerated into $R\left( 8;p;1;0;10 \right)$. Therefore, we get the reliability formula as follows when the threshold value S=10:
$R\left( 8;p;1;0;10 \right)\text{=}{{\left( 1-p \right)}^{8}}$
Figure 2 shows the comparison and analysis of the reliability of a speech consisting of N=8 words under 10 different thresholds, where S={10,20,30,40,50,60,70,80,90,100}. We can see that as the threshold decreases, the reliability of the speech recognition task decreases. WhenS=10, the speed of the reliability of the speech recognition task decline is the fastest. When the threshold is very small, only a few consecutive wrong words are allowed to appear, so the reliability is smaller. By comparing these 10 diagrams, it can be seen that the reliability of the threshold S=100 is the largest.
Figure 2
Figure 2.
Comparison and analysis of the reliability ofspeech recognition tasks under 10 different thresholds
Above isa comparative analysisof the same $s(i)$ scorearray under the different thresholds, where {$s(i)$}={s(0),s(1),
In order to further study the reliability of speech recognition tasks, we try to analyse the conditions under the same threshold and different $s(i)$ score arrays.
3.2 An numerical example Analysis of the Second Strategy
Below, we analyse the conditions under the same threshold and different $s(i)$ score arrays. The above score array{$s(i)$}={s(0),s(1),
Suppose the first numerical example {$s(i)$}={ s(0),s(1),
$R\left( 8;p;22.5,45,67.5,90;100 \right)$ can be degenerated into $R\left( 8;p;5;22.5,45,67.5,90;100 \right)$. The reliability of the speech recognition task can be expressed as follows:
$\begin{matrix} & R\left( 8;p;5;22.5,45,67.5,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)\left( 1+{{p}^{2}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}+2{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+2}p{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{4}} \right) \\ & +4{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+2{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)+4{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}p{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{4}} \right) \\ & +2{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right)+3{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+4{{p}^{4}}{{\left( 1-p \right)}^{4}}+{{p}^{2}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{3}} \right) \\ & +2{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+3{{p}^{4}}{{\left( 1-p \right)}^{4}}+{{p}^{3}}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{2}} \right)+3{{p}^{4}}{{\left( 1-p \right)}^{4}} \\ \end{matrix}$
Next, we analyse the third type. Suppose that $s(i)$ is quadratic function, where $s(i)=5.625{{(i)}^{2}}$; in that way, {s(1),
$R\left( 8;p;5.625,22.5,50.625,90,140.625;100 \right)$ can be degenerated into $R\left( 8;p;5;5.625,22.5,50.625,90;100 \right)$. Therefore, we get the reliability formula as follows:
$\begin{matrix} & R\left( 8;p;5;5.625,22.5,50.625,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{3}}\left( 1-{{p}^{5}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}p{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{5}} \right) \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right)\text{+}{{p}^{2}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{4}} \right)\text{+}{{p}^{3}}\left( 1-p \right)\left( 1-{{p}^{3}} \right) \\ & \text{+}{{p}^{3}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{3}} \right)\text{+}{{p}^{4}}\left( 1-p \right)\left( 1-{{p}^{2}} \right)\text{+}{{p}^{4}}{{\left( 1-p \right)}^{2}}\left( 1-{{p}^{2}} \right)\text{+}{{p}^{5}}{{\left( 1-p \right)}^{3}} \\ \end{matrix}$
Finally, we analyse the fourth type. Assume that $s(i)$ isa square root function, where $s(i)=45{{(i)}^{\frac{1}{2}}}$ and{s(1),
$R\left( 8;p;45,63.6396,77.9423,90,100.6231;100 \right)$ can be degenerated into $R\left( 8;p;5;45,63.6396,77.9423,90;100 \right)$. The reliability of the speech recognition task can be denotedas:
$\begin{matrix} & R\left( 8;p;5;45,63.6396,77.9423,90;100 \right) \\ & \text{=}{{\left( 1-p \right)}^{4}}\text{+3}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)\text{+9}{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+3}{{p}^{3}}{{\left( 1-p \right)}^{5}} \\ & \text{+3}{{p}^{4}}{{\left( 1-p \right)}^{4}}\text{+}p{{\left( 1-p \right)}^{5}}\left( 1-{{p}^{2}} \right)\text{+4}{{p}^{2}}{{\left( 1-p \right)}^{6}} \\ & \text{+}{{p}^{2}}{{\left( 1-p \right)}^{6}}\text{+}{{p}^{3}}{{\left( 1-p \right)}^{5}}\text{+}{{p}^{4}}{{\left( 1-p \right)}^{4}} \\ \end{matrix}$
Figure 3 shows the curves for four types of score functions. The first curve corresponds to the $s(i)$ value of the first numerical example, where $\{s\left( i \right)\}=\{s(0),s(1),s(2),s(3),s(4),s(5),s(6),s(7),s(8)\}=\{0,10,30,60,90,120,150,180,210\}. $These four types of curves represent different situations. Though all these curves are increasing, some are increasing faster and faster, some are increasing slower and slower, and some are increasing constantly.
Figure 3
Figure 3.
Four types of score functions
Figure 4 shows the reliability curves for the four types of score functions given that the threshold is the same. It can be seen from the graph that the fourth types of speech recognition tasks have the minimum reliability. That is because the values of the fourth type of function $s(i)=45{{(i)}^{\frac{1}{2}}}$ are greater than those of the other types from $s(1)$ to $s(3)$. Actually, it is more likely to have only a few words in a consecutively wrongly recognized word string, so thus the values of $s(1)-s(3)$ may have dominant effects on the task reliability. The reliability of the third type of speech recognition tasks is the biggest. That is because the values of the third type of function $s(i)=5.625{{(i)}^{2}}$ are smaller than those of the other types from $s(1)$ to $s(3)$. It can be seen that the reliability of the speech recognition task is related to the threshold value and the value of the $s(i)$.
Figure 4
Figure 4.
Reliability curves for the four types of score functions
4. Conclusions
The reliability of speech recognition tasks is discussed in this paper. The reliability model of speech recognition task failure is obtained by analysis. In particular, a success criterion is proposed based on two types of failures modes. If too many consecutive words are recognized wrongly in the speech recognition task, the task is deemed as failed. Otherwise, one can calculate a total score for the speech recognition task. In particular, for every group of consecutive words recognized wrongly, an increment is added to the score. The more the wrongly recognized consecutive words are, the bigger the increment should be. The task is regarded as failed if the score is bigger than a threshold. An iterative approach is proposed to evaluate the reliability of the speech recognition task, and numerical examples are proposed to illustrate the application.
In the future, the error of word separation can be taken into account. The words can be classified as different types based on the difficulty of recognition and the influence on speech understanding. Also, the reliability of different speech recognition technologies can be investigated. Finally, historical training data of realistic speech recognition tasks can be analysed to assess the probability of wrongly recognizing different types of words and the influences of them on speech understanding. This information can be further utilized to predict the reliability of new speech recognition tasks.
Acknowledgements
This research was supported by the National Natural Science Foundation of China (No.71671016) and the Fundamental Research Funds for the Central Universities (No. FRF-GF-17-B14). The suggestions from reviewers are very much appreciated.
Reference
General Hybrid Framework for Uncertainty-Decoding-based Automatic Speech Recognition Systems
,” , Vol.
Importance Measures for Optimal Structure in Linear Consecutive-K-out-of-N Systems
,” , Vol.
A New Framework for Robust Speech Recognition in Complex Channel Environments
,” , Vol.
Monaural Multi-Talker Speech Recognition Using Factorial Speech Processing Models
,” , Vol.
Recognitionof High Frequency Words From speech as apredictor of L2 Listening Comprehension
,” , Vol.
Effects of Cognitive Load on Speech Recognition
,” , Vol.
Efficient Multicut Enumeration of K-out-of-N:F and Consecutive K-out-of-N: Fsystems
,” , Vol.
Joint Evaluation of Multiple Speech Patterns for Speech Recognition and Training
,” , Vol.
Joint Estimation of Confidence and Error Causes in Speech Recognition
,” ,Vol.
Error Detection and Accuracy Estimation in Automatic Speech Recognition Using Deep Bidirectional Recurrent Neural Networks
,” , Vol.
Evaluation of an Automated Speech-Controlled Listening Test with Spontaneous and Read Responses
,” , Vol.
Element Maintenance and Allocation for Linear Consecutively Connected Systems
,” , Vol.
Robust Peeche Cognitiony Integrating Seech Eparation and Hypothesis Testing
,” , Vol.
Optimal Element Loading for Linear Sliding Window Systems
,” , Vol.
Combinedm-Consecutive and K-out-of-N Sliding Window Systems
,” , Vol.
/
〈 | 〉 |