Mobile QR Code QR CODE

  1. (International Business College, Qingdao Huanghai University, Qingdao, 266427, China )

Recurrent neural network, Posterior probability, English interpretation, Automatic scoring, Deep neural network

1. Introduction

In automatic scoring of English interpreters, the technology standard is based on recognition content and pronunciation quality. Logarithmic posterior probability becomes the standard in pronunciation quality tests. The traditional posterior probability calculation takes all pronunciation factors into consideration, so the score is not targeted. In addition, the emphasis on traditional posterior probability differs from that of manual scoring, and the correlation between the two scoring methods is low [1]. Therefore, the main work of this study is to apply the methods that measure pronunciation standards and speech recognition to the automatic scoring of English interpretations. In this study, a deep neural network (DNN) and the traditional Hidden Markov Model (HMM) framework are combined to evaluate speech recognition [2]. In the DNN framework, the logarithmic posterior probability of the model is extracted for English interpretation scores. The extraction of posterior probability features depends strongly on recognition performance. If the recognition results are wrong, it is impossible to judge whether it is the interpreter’s pronunciation or an error caused by the model, so the calculation of posterior probability features will lose significance [3]. Therefore, in order to improve recognition accuracy, a Recurrent Neural Network (RNN) language model is used to re-estimate the data from the recognition results, and the sentence with the largest re-estimate score is selected as the result. The innovation of this study is to consider the impact of recognition results on feature calculations. Therefore, the RNN speech model is used for the re-estimates to screen for more suitable recognition results and improve the correlation between posterior probability features and manual scores. The purpose is to make automatic scoring of English interpretations more perfect and scientific.

2. Related Work

Automatic scoring of English interpreters uses machine speech recognition technology to obtain correct scores for the interpreter’s content, vocabulary, grammar, and fluency. With the development of deep learning technology, speech recognition is now widely used. Inspired by DNN technology, Sun et al. improved it to get a DNN-decision-tree support vector machine (SVM). Their model constructs a decision-tree-SVM structure by calculating the confusion degree of emotion in speech signals, and uses the DNN to train different emotions and extract features to train the classifier in the decision tree to obtain a speech emotion classification [4]. Experimental results showed that emotion recognition accuracy of the method improved by 6.25% and 2.91% compared with traditional SVM and DNN-SVM classification methods, respectively, indicating that the method can effectively extract features from confused emotions. Jiang et al. proposed mask estimation based on a DNN. In this method, the DNN is used to estimate the time-frequency mask, and the covariance matrix of both the target speech and any noise is calculated [5]. Finally, a generalized eigenvector is used to decompose the beambuilder coefficient. Speech recognition experiments were conducted using the CHiME4 data set, and the results showed that their method achieved good results in speech recognition error rate and speech quality. Seki et al. proposed an acoustic model combining a DNN and a filter, then used hierarchical feature extraction to train the model [6]. In a limited adaptive data experiment with the trained model, it could effectively recognize the speaker’s content. Compared with other models, the false recognition rate of the model after 10 utterances was reduced by 5.8%. Kentaro and Toru proposed a DNN-based statistical parameter speech synthesis system that describes joint distribution of acoustic and linguistic features using a depth relation model, but the two variables in the model can be bidirectionally dependent so the deep architecture of the model is optimized [7]. Experiments proved that their method has better performance even if the training data are limited, and it provided higher accuracy for speech synthesis when the case experimental parameters were the same. Praseetha and Vadivel compared the performance of feed-forward deep neural networks with recursive neural networks in speech recognition emotion problems. The experimental results showed the DNN model had 89.96% accuracy in speech emotion recognition, while the RNN model had 95.82% accuracy [8]. Therefore, the RNN model had better accuracy and robustness in speech recognition.

Prafianto et al. proposed a manual scoring method to improve speech recognition by using parametric speech synthesis in which unevaluated features are replaced by the characteristics of the pronunciator, while the characteristics of the pronunciator are synthesized by parameters [9]. The proposed methods were analyzed by an automatic pronunciation evaluation system, and experimental results showed that this method improved the scoring reliability, that the predicted pronunciation score matched the manual score, and that the human-machine correlation coefficient reached 0.87 while the correlation coefficient of the traditional scoring method was 0.74. Liu et al. used a machine learning algorithm combined with image recognition technology to propose a candidate-region character extraction model based on MSER for intelligent scoring of English compositions [10]. In order to verify the feasibility of the model, the researchers input the basic conditions of composition scoring into the model as constraint conditions, and experimental results showed that the method has certain practical effects. Pribadi et al. proposed a maximum marginal correlation method and a sentence similarity method to improve an automatic short answer scoring system [11]. Experimental results showed the accuracy of the proposed method in generating reference answers reached 91.95%, and the root mean square error of similarity between reference answers and students’ answers was 0.884, indicating good performance. Gaillat et al. implemented a supervised learning approach that serves as a frame of reference for English writers in an automated essay scoring system. The researchers used the system to find related concepts in a language system [12]. The results of the internal data set showed that the classification system is helpful for classification of writing levels, and has effective classification performance for sentences, vocabulary, and other features. Zhang analyzed a method of feature selection in an automatic scoring system, and used a multiple regression method to evaluate the final score [13]. Through controlled experiments, the results showed that the method has certain effects, and a computer scoring system for college English translation was built.

Based on the above analyses, speech recognition technology can extract sentence emotion in the development of DNN models and has been widely used, but its recognition accuracy still has room to improve. While automatic marking technology is often used for English writing, it is rarely used for oral English or interpretations. Therefore, two DNN algorithms are constructed to identify and score English interpreters in order to further improve the performance of automatic scoring.

3. Automatic Scoring for English Interpreting

3.1 Speech Recognition based on an HMM Model

The automatic scoring model of English interpreters built by a research institute is essentially a process of converting speech into text and scoring the content. The specific process is shown in Fig. 1.

In Fig. 1, speech recognition technology decodes speech signals to generate text. This process can be transformed into a mathematical problem by using the Bayesian statistical modeling framework, and the expression of this mathematical problem is shown in Formula (1) [14-16]:

$ \hat{W}=\underset{W}{argmax}p\left(W\left| O\right.\right)=\underset{W}{argmax}\frac{p\left(O\left| W\right.\right)p\left(W\right)}{p\left(O\right)} $
Fig. 1. Structure of the Automatic Scoring Model for English Interpretations.

In Formula (1), $O$ represents the acoustic feature vector of the input speech; $W$ represents the subsequence corresponding to the acoustic feature vector; $p(W\left| O\right.)$ represents a posteriori probability, which is defined as the probability that a sequence of words occurs in a particular acoustic feature vector; and $p(O\left| W\right.)$ represents the matching degree between acoustic feature vector and word sequence. This model is called an acoustic model (AM) in which $p(W)$ represents the probability of word sequences in the text and $p(O)$ represents a constant term independent of the sequence of words. At present, the AM is constructed using the HMM, and its state can only be inferred by the observed vector rather than observed directly [17-19]. Each observed vector in the HMM is controlled by a certain probability density distribution. The DNN model has an advantage in obtaining the HMM probability distribution density by using the Bayesian formula. The HMM parameter is denote $\phi $, the observed vector is denoted $O$, and its sequence expression is $O=\left\{o_{1},o_{2},\cdot \cdot \cdot ,o_{T}\right\}$. The problem in solving the model was transformed into one generating the observed appropriate likelihood degree, which is expressed as $p(O\left| \phi \right.)$. The likelihood degree can be solved recursively by the forward algorithm, and the rigid method can reduce the time complexity generated in the calculation of $p(O\left| \phi \right.)$. The solution equation for the HMM state sequence is Formula (2):

Editor - Highlight - Is this intended (relating to, or derived by, reasoning from observed facts)? Or do we mean posterior probability (used more than 30 times in this paper)? Please confirm all such instances in this paper.

Editor - Highlight - Is this intended? (Not ``a revised probability that takes into account new available information'' instead?) Please confirm (see above). Source:

$ \hat{S}=\underset{S}{argmax}p\left(O,S\left| \phi \right.\right) $

In Formula (2), $\hat{S}$ represents the state sequence. The calculation of the AM is Formula (3):

$ p\left(O\left| W\right.\right)=\sum _{S}a_{s\left(0\right)s\left(1\right)}\prod _{t=1}^{T}b_{s\left(t\right)}\left(o_{t}\right)a_{s\left(t\right)s\left(t+1\right)} $

In Formula (3), $b_{s(t)}$ represents the output probability density distribution; $s_{\left(1\right)}$ indicates the status at a certain time; and $a_{s(t)}$ indicates the probability of state transition at a given time. If parameter $\phi $ is the maximum likelihood score of the observed vector, Formula (4) can be obtained according to the recursion formula:

Editor - Highlight - This exact term is not in Formula (3). Is that OK?

$ \phi _{j}\left(t\right)=\max _{i}\left\{\phi _{i}\left(t-1\right)a_{ij}\right\}b_{j}\left(o_{t}\right) $

In Formula (4), $\phi _{j}(t)$ is the maximum likelihood score located in the state at time $t$. The underflow of the likelihood value can be prevented by taking the logarithm of the likelihood score. By solving the constructed AM and the maximum likelihood score, speech recognition can be achieved successfully.

3.2 Construction of the DNN Model

The basic speech recognition method is obtained by AM construction, but the pronunciation state is not reflected in the speech recognition task, so the DNN model is introduced into the HMM. Suppose the constructed DNN model has hidden layer (HL) $L$ and output layer (OL) 1. Input data are output by the DNN model in the expression $p(s\left| o\right.)$. The DNN model’s process is shown in Formula (5):

$ \left\{\begin{array}{l} b^{0}=o\\ a^{l}=W^{l}h^{l-1}+bias^{l}\\ b^{l}=\sigma \left(a^{l}\right)\\ b_{s}^{L+1}=p\left(s\left| o\right.\right)=soft\max \left(a^{L+1}\right)=e^{a_{s}^{L+1}}/\sum _{s}e_{s'}^{a_{s}^{L+1}} \end{array}\right. $

In Formula (5), $W^{l}$ represents weight; $bias^{l}$ indicates network bias; $\sigma $ represents a sigmoid function, which uses the softmax function in the OL. This study uses the cross-entropy criterion to update the DNN model. Stochastic Gradient Descent (SGD) is used to concatenate some data information. Due to the strong randomness of SGD, fluctuation of gradient updating causes great changes. Therefore, an impulse factor can be introduced to reduce the fluctuation caused by randomness, and a weight attenuation factor can be added at the same time to avoid the phenomenon of overfitting by punishing data with too much weight. In the DNN model, the matrix multiplication of the SGD process takes a lot of time. In terms of reducing operation time, the parallel computing capability of a graphics processor can be used to achieve it. In the HMM model, the expression of probability distribution density is expressed in Formula (6):

$ b_{i}\left(o\right)=p\left(o\left| s_{i}\right.\right)=\sum _{k=1}^{K}\pi _{ki}N\left(o\left| \mu _{ki},\sum ki\right.\right) $

In Formula (6), $\pi $ represents the initial state; $N(o\left| \mu _{ki},\sum ki\right.)$ represents the mean vector; and $p(o\left| s_{i}\right.)$ represents the DNN. The Bayesian formula is used to transform the DNN so that Formula (7) is obtained:

$ p\left(o\left| s_{i}\right.\right)=p\left(s_{i}\left| o\right.\right)p\left(o\right)/p\left(s_{i}\right) $

Formula (7) is the posterior probability of a node state when the DNN model inputs a vector. In (7), $p(s_{i})$ represents the prior probability of the HMM state, which can be obtained through the training set, and $p(o)$ is a constant. The DNN form constructed by Formula (7) and the DNN architecture applied to the HMM are shown in Fig. 2.

Fig. 2 shows the framework of the DNN-HMM model. Under this framework, the performance of interpreting English pronunciation can be reflected by calculating the logarithmic posterior probability. The result of decoding by the speech recognizer is used as reference data. When phoneme z is decoded, assuming that the observed vector decoded by Viterbi is $O=\left\{o_{1},o_{2},\cdot \cdot \cdot ,o_{N}\right\}$, then the posterior probability corresponding to phoneme z is $pp(z\left| O\right.)$. The posterior probability expression is shown in Formula (8):

Editor - Highlight - Is this the intended term? If not, please adjust as needed. (See earlier query, please.)

$ pp\left(z\left| O\right.\right)=\frac{1}{N}\ln \frac{p\left(O\left| z\right.\right)p\left(z\right)}{\sum _{q\in {Q_{z}}}p\left(O\left| q\right.\right)p\left(q\right)}\approx \frac{1}{N}\ln \frac{p\left(O\left| z\right.\right)}{\sum _{q\in {Q_{z}}}p\left(O\left| q\right.\right)}\approx \frac{1}{N}\ln \frac{p\left(O\left| z\right.\right)}{\max _{q\in {Q_{z}}}p\left(O\left| q\right.\right)} $
Fig. 2. The DNN-HMM acoustic model frame structure.

There are two approximations in Formula (8). The first is to ignore the prior probabilities of all phonemes, and the second is to retain the largest term in the denominator of the formula to simplify the calculation. $Q_{z}$ represents the denominator calculation space, which consists of mispronunciation of phoneme z, making the posterior probability of phoneme z more specific. After obtaining the posterior probability of the phoneme, its logarithmic cumulative sum is shown in Formula (9):

Editor - Highlight - Is this the intended term? Please see earlier queries about a posteriori versus posterior, and reconcile as needed.

$ \ln p\left(O\left| t\right.\right)=\sum _{i=1}^{N}\left\{\ln p\left(o_{i}\left| s_{i}\right.\right)+\ln a_{{s_{i-1}}{s_{i}}}\right\}\approx \sum _{i=1}^{N}\left\{\ln p\left(o_{i}\left| s_{i}\right.\right)\right\} $

Formula (9) ignores the transition probability of the HMM, and this study assumes that the transition probability itself does not need to calculate the likelihood score. Substituting the DNN formula into Formula (9), Formula (10) is obtained:

$ \ln p\left(O\left| t\right.\right)=\sum _{i=1}^{N}\left\{\ln \frac{p\left(s_{i}\left| o\right.\right)p\left(o\right)}{p\left(s_{i}\right)}\right\}\approx \sum _{i=1}^{N}\left\{\ln \frac{p\left(s_{i}\left| o\right.\right)}{p\left(s_{i}\right)}\right\} $

According to Formula (10), factors $p(o)$ and $p(s_{i}\left| o\right.)$, which have no influence on the likelihood calculation, can be ignored; $p(o)$ and $p(s_{i}\left| o\right.)$ can be obtained directly from the model output; and 555 can be obtained from the training set. Then, Formula (10) is the final calculation of likelihood degree in the DNN framework.

Editor - Highlight - Is this intended? (Just asking{\ldots}.)

Suppose there are $N_{k}$ sentences in a certain segment of an English interpretation, and each sentence is decoded to obtain a different phoneme. Then, the final posterior probability feature of this speech is calculated with Formula (11):

Editor—Highlight—Please confirm per earlier queries.

$ WPP\left(k\right)=\frac{1}{N_{k}}\sum _{i=1}^{N_{k}}\left\{\frac{\sum _{j=1}^{n_{i}}pp\left(t_{j}\left| O\right.\right)}{n_{i}}\right\} $

In Formula (11), $WPP(k)$ represents a posterior probability feature, which calculates the average of the posterior probability of all phonemes in each sentence in this speech segment, and then calculates the average of all sentences to obtain the estimated value of the posterior probability feature of this speech segment. This study designs a scoring model for English interpreters, which also gives corresponding points (plus or minus) according to the interpreter’s state. In addition, when the interpreter has less content representation but better pronunciation, the posterior probability feature will also have a better performance. However, for manual scoring, the score should be reduced. Therefore, $S$ versus $WPP(k)$ is used in this study to adjust the mismatch in man-machine scoring. The weighting method is shown in (12):

Editor - Highlight - Is this right term, per earlier queries?

Editor - Highlight - Is this the right term?

$ eWPP=\exp \left(WPP\right)\cdot \left(1-S\right) $

The matching degree between the posterior probability feature and the human score after retrograde weighting is higher, and the correlation degree of the man-machine score is stronger.

Editor - Highlight - Is this the intended term? (See earlier queries, too.)

3.3 English Interpretation Scoring via RNN

In this study, a posterior probability feature estimation model is constructed, which is highly dependent on the recognition effect of the model in English interpreter assessment scoring. If an identification error occurs, the posterior probability of a subsequent calculation cannot provide useful information for interpretation scoring. Moreover, when the model identifies interpreted sentences, the score of a word is only determined by the first two or three words, which is hardly scientific. If the language model can see more of the historical information, the interpretation scoring model is more reliable, and the recognition effect of the model can be improved, so that the posterior probability calculation is more accurate. Therefore, researchers have used the Recurrent Neural Network to construct speech models [20-22]. The RNN structure has different data transmission modes. An RNN splices the output data of the HL at the current moment, with the vector describing word information in the next moment and the retrograde to form new input data to be transmitted in the structure. The data output of each transmission mode retains historical information on the data, so the model introduces more of the data information in the training process. The network structure from the RNN language model is shown in Fig. 3.

Fig. 3. Basic Structure of a DNN Speech Model.

In Fig. 3, $w(t)$ represents the vector form of the current input data; $s(t)$ represents the output of the HL at time $t$; $s(t-1)$ represents HL output at the previous moment and is the input for the present moment; $y(t)$ represents the output vector of the model; and $c(t)$ represents the clustering of words, which is mainly used to accelerate the training of the model. At this time, output $y(t)$ is used as the softmax function to ensure that probability of the occurrence of prediction words is within the range 0 to 1 to avoid the complicated backoff smoothing operation in the model. Assuming the dimension of $c(t)$ is $M$, the pre-trained words will be divided into $M$ categories, and the sum of word frequencies in each category is basically the same. While training the model, it is necessary to update the weight of $c(t)$ and $y(t)$ in the same category.

The RNN speech model has some problems in decoding efficiency. The model cannot directly decode once. Therefore, in this study, score re-estimation is carried out in the decoding process of the RNN speech model, and the first scored sentence after re-estimation is taken to be the new recognition result. Some experiments showed that an RNN speech model has better performance after n-gram speech model interpolation, so the speech model score after re-estimation is also obtained after interpolation. The re-estimation score of candidate sentences is calculated by Formula (13):

Score $_k=$ AcScore $_k+W_k \cdot C+\left[\lambda \cdot \operatorname{lm}_{\text {ngram }}^k+(1-\lambda) \cdot \operatorname{lm}_{R N N}^k\right] \cdot \operatorname{lmScale}$

Editor\textemdash{}Highlight\textemdash{}Please adjust to two lines instead.

In Formula (13), $Score_{k}$ represents the score of the re-estimation; $AcScore_{k}$ represents the score of the acoustic model; $W_{k}$ represents the words in the entire sentence; $C$ is word punishment; $\lambda $ denotes the interpolation coefficient; $lm_{ngram}^{k}$ and $lm_{RNN}^{k}$ indicate the scores of their respective models; $lmScale$ is the score scaling factor of the speech model during decoding. Formula (13) is used to calculate the sentence with the largest re-estimation score, and the sentence is taken as the new reference data to re-estimate the posterior probability. Its flow chart is Fig. 4.

Fig. 4. Flow Chart of RNN Re-estimation Identification Results.

In Fig. 4, for the candidate data decoded once, the score of the acoustic model of the candidate data is kept unchanged, and the score of the original model is replaced by the score after interpolation to calculate a new score for the sentence. The data with the highest new score are selected from the candidate data as the new recognition results.

Editor - Highlight — Column flow differs on this page. Please reformat so columns flow properly.

4. Performance Analysis of the English Interpretation Scoring Model

4.1 Recognition under the DNN-HMM Identifier

In this paper, an automatic scoring model for English interpreters is constructed and its performance analyzed through experiments. The experiments mainly tested three aspects: correlation of man-machine scores, speech recognition accuracy, and error in posterior probability. In the DNN-HMM model, the sample data set used for AM training was composed of speech data with scores of no less than 80 points extracted from English interpretation tests totaling about 700 hours. The parameters of the DNN model were set as follows. HL consisted of five layers with 2048 nodes in each layer. The HL activation function was a sigmoid function, and the output layer used the softmax function. All data iterations were performed 1000 times; the learning rate of the first 300 iterations was fixed at 0.2, and the rate for the last 700 iterations was cut in half. The test data set consisted of 4000 selected pieces of data from an English interpretation exam. All data were independently scored by two raters, with score differences within three points and score correlation above 0.8. After the training data set and test data set are determined, model performance was tested and analyzed.

Editor - Highlight - Is this the intended meaning? If not, please rephrase as intended.

Fig. 5 shows the analysis results for the correlation between different posterior probability features and artificial scoring, and the results for the correlation between pronunciation errors and artificial scoring. The degree of correlation can reflect the important proportion of factors affecting artificial scoring, and the closer to the artificial scoring curve, the higher the degree of correlation. In Fig. 5(a), the correlation between pure posterior probability WPP and the manual score is 0.646, indicating that the posterior probability feature of the DNN model can reflect the pronunciation standard for English interpreters to some extent. The correlation between the scoring method for missing ratio and manual scoring is 0.626, indicating that interpreters had more stuck phenomena in a special background, which is also a concern of the manual scoring condition. In eWPP, the correlation between the combined and manual scoring methods reached 0.725, indicating the combined scoring method was closer to the manual method. In Fig. 5(b), from among the 4000 pieces of data, the correlation between pronunciation error and manual scoring is 0.336, indicating that pronunciation errors have a relatively low weight in the English interpretation scoring model. But for data with a missing percentage of 0.1 or less, the correlation between automatic pronunciation errors and human scoring was 0.501. That shows that the effective expression content was less in the data for interpretation lag, and pronunciation errors were also reduced. When the content is expressed effectively enough, pronunciation errors also have some correlation with the final score.

Fig. 6 shows the speech recognition results of the DNN model, and reflects the accuracy of the model through training and testing sets. In Fig. 6, training speech recognition accuracy of the DNN model correlated positively with the number of iterations. When the iterations reached about 250, the accuracy curve of the model in the training set began to converge. When iterations reached 1000, the accuracy from model speech recognition was 0.854. The test speech recognition accuracy of the DNN model also correlated positively with the number of iterations. When the iterations reached about 300, the accuracy of the model with the test set began to converge. When the iterations stopped, the recognition rate of the model with the test set was 0.842.

Fig. 5. Analysis of the results of man-machine scoring correlation.
Fig. 6. DNN Model Speech Recognition Performance.
Fig. 7. Error Results for DNN-HMM Posterior Probability.

Fig. 7 shows the posterior probability error results of the DNN-HMM model. The smaller the posterior probability error, the better the interpretation accuracy. In Fig. 7, error results for the DNN model in the training set were inversely correlated with the samples. When the number of samples was 200, the model error curve had converged, and the error value was finally stable at about 0.037. The error results of the DNN model with the test set also correlated inversely with the samples. When the number of samples was about 260, the model error curve began to converge, and the error value in the test set was stable at about 0.041.

4.2 Performance Analysis based on the RNN

Since the structure of the RNN model is inconsistent with the parameters, the parameters of the RNN model were adjusted to the optimum for the experiment. HL in the RNN model was also five layers, and the number of nodes was 500. The output category was 100. Generally speaking, the larger the output category, the more obvious the training acceleration of the model, but the performance is also affected. Therefore, 100 is a suitable value for the output category. Model iterations were set to 1000, the learning rate for the first 700 was fixed at 0.1, and the learning rate of the last 300 was cut in half. When the RNN model was re-estimated, 50 pieces of data were retained by decoding in one pass, the interpolation coefficient was set to 0.5, and the training set and test sets remain unchanged.

Fig. 8 shows the analysis of correlation between the posterior probability features and the manual score before and after re-estimation by the RNN model. In Fig. 8, the correlation between pure posterior probability WPP and manual scoring is 0.646. The correlation between the posterior probability feature of the RNN model and manual scoring is 0.781. We can see that the scoring correlation of the RNN increased by 20.9% after re-estimation. The experiment shows that the RNN model can see more comprehensive data in scoring, so more logical sentences get better scores. The RNN re-estimates the largest candidate, and the candidate data may be the best logical data. From the perspective of recognition, the candidate data are more likely to be correct recognition results that improve the recognition ability of the model.

Fig. 9 shows the speech recognition performance of the RNN model. In Fig. 9, the training speech recognition accuracy of the DNN model correlates positively with the number of iterations. When iterations reached about 200, the accuracy curve of the model in the training set began to converge. When iterations reached 1000, the accuracy of speech recognition was 0.923. The test speech recognition accuracy of the RNN model also correlates positively with the number of iterations. When iterations reached about 220, the accuracy curve of the model for the test set began to converge. When iterations stopped, the recognition rate for the test set was 0.915.

Fig. 10 shows the posterior probability error from the RNN model. In Fig. 10, the error results of the RNN model in the training set correlate inversely with the samples. When the number of samples was 150, the error curve of the training set converged, and the error value was finally stable at about 0.028. The error results for the RNN model on the test set also correlate inversely with the samples. When the number of samples was about 180, the error curve began to converge, and the error value for the test set was stable at about 0.033. Comparing the performance of the DNN-HMM model and the RNN model, the RNN model had higher correlation with human scoring, and its speech recognition accuracy and posterior probability error were better. Therefore, from among the models constructed in this study, the RNN is more suitable for automatic scoring of English interpretations.

Fig. 8. The correlation between posterior probability and manual scoring in the RNN model.
Fig. 9. RNN Model Speech Recognition Performance.
Fig. 10. Error Results for the RNN’s Posterior Probability.

5. Conclusion

Automatic marking of English interpreting is a highly useful task, which can reduce the time spent on the examination, and it can reduce labor costs. First, the DHH-HMM model is established, and the recognized text is obtained through speech recognition. Then, the pronunciation likelihood score and phoneme likelihood score relative to the text are calculated, and the post-verification probability of the pronunciation vector is obtained through the relationship between the two. Since posterior probability is highly dependent on model recognition, the RNN structure improves the model. In the experiments, the correlation between the scoring method of the DNN-HMM model and manual scoring was 0.725, and the correlation between the scoring method of the RNN model and manual scoring was 0.781. The recognition accuracy of the DNN-HMM model was 0.854, and that of the RNN model was 0.923. The posterior probability error of the DNN-HMM model was 0.041, and for the RNN model it was 0.033. The results show that the RNN model is more dependent on human scoring of English interpretation, considering the recognition accuracy, description content, and pronunciation errors. The RNN model also sees more comprehensive data, giving more logical sentences higher scores, and making the scoring more scientific. The main contribution of this research is to use speech recognition technology to recognize spoken English under the DNN-HMM framework, and to improve the calculation method of posterior probability to obtain a more accurate posterior probability, thereby improving the accuracy of English interpretation recognition. At the same time, research has found that if the model can be given more of the historical information from the data, performance can be significantly improved. However, there are shortcomings in this study. The structure of the RNN proposed has room for optimization, and the influence on the score from the interpreter’s speaking speed and description fluency was not considered. Subsequent studies will address the deficiencies and will help the automatic scoring model for English interpretations provide a more reasonable scoring system.

Editor - Highlight - Is this the intended word? Or is appropriate more accurate? If neither, please choose a word other than applicable, which is rather vague in this context.

Editor - Highlight - Is this the intended word? (See earlier queries, too.)


P. Avent, C. Hughes, H. Garvin. “Applying posterior probability informed thresholds to traditional cranial trait sex estimation methods”. Journal of forensic sciences, vol. 67(2), pp. 440-449, 2021.DOI
L. Chen, et al., “An innovative deep neural network–based approach for internal cavity detection of timber columns using percussion sound”. Structural Health Monitoring, vol. 21(3), pp. 1251-1265, 2022.DOI
M. S. Johnson, S. Sinharay. “The Reliability of the Posterior Probability of Skill Attainment in Diagnostic Classification Models.” Journal of Educational and Behavioral Statistics, vol. 45(1), pp. 5-31, 2020.DOI
L. Sun, B. Zou, S. Fu, et al. “Speech emotion recognition based on DNN-decision tree SVM model.” Speech Communication, vol. 115, pp. 29-37, 2019.DOI
W. Jiang, F. Wen, P. Liu. “Robust Beamforming for Speech Recognition Using DNN-based Time-Frequency Masks Estimation.” IEEE Access, vol. 6, pp. 52385-52392, 2018.URL
H. Seki, K. Yamamoto, T. Akiba, S. Nakagawa, “Discriminative Learning of Filterbank Layer within Deep Neural Network Based Speech Recognition for Speaker Adaptation.” IEICE Transactions on Information and Systems, vol. (2), pp. 364-374, 2019.DOI
Kentaro, SONE, Toru, NAKASHIKA. “Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech.” IEICE Transactions on Information and Systems, 2019, vol. E102.D(8), pp. 1546-1553, 2019.DOI
V. M. Praseetha, S. Vadivel. “Deep Learning Models for Speech Emotion Recognition.” Journal of Computer Science, vol. 14(11), pp. 1577-1587, 2018.URL
H. Prafianto, T. Nose, Y. Chiba, A. Ito. “Improving Human Scoring of Prosody Using Parametric Speech Synthesis.” Speech Communication, vol. 111, pp. 14-21, 2019.DOI
J. Liu, L. Lin, X. Liang, “Intelligent system of English composition scoring model based on improved machine learning algorithm.” Journal of Intelligent and Fuzzy Systems, vol. 40(2), pp. 2397-2407, 2021.DOI
F. S. Pribadi, A. E. Permanasari, T. B. Adji, “Short answer scoring system using automatic reference answer generation and geometric average normalized-longest common subsequence (GAN-LCS).” Education & Information Technologies, vol. 23(6), pp. 2855-2866, 2018.DOI
T. Gaillat, et al. “Predicting CEFR levels in learners of English: The use of microsystem criteria features in a machine learning approach.” ReCall, vol. 34(May), pp. 130-146, 2022.DOI
Y. Zhang, “Interactive intelligent teaching and automatic composition scoring system based on linear regression machine learning algorithm.” Journal of Intelligent and Fuzzy Systems, vol. 40(2), pp. 2069-2081, 2021.URL
Westera W. “Comparing Bayesian Statistics and Frequentist Statistics in Serious Games Research.” International Journal of Serious Games, vol. 8(1), pp. 27-44, 2021.DOI
O. M. Crook, C. W. Chung, C. M. Deane, “Challenges and Opportunities for Bayesian Statistics in Proteomics.” Journal of proteome research, vol. 21(4), pp. 849-864, 2022.DOI
J. M. Luningham, Chen J., Tang S., De Jager P. L., Bennett D. A., Buchman A. S., Yang J. “Bayesian Genome-wide TWAS Method to Leverage both cis- and trans-eQTL Information through Summary Statistics.” The American Journal of Human Genetics, vol. 107(4), pp. 714-726, 2020.URL
T. Caelli, J. Mukerjee, A. Mccabe, D. Kirszenblat. “The Situation Awareness Window: a Hidden Markov Model for analyzing Maritime Surveillance missions.” The Journal of Defense Modeling and Simulation, vol. 18(3), pp. 207-215, 2021.DOI
X. Liu, K. Shi, Z. Wang, J. Chen. “Exploit Camera Raw Data for Video Super- Resolution via Hidden Markov Model Inference.” IEEE Transactions on Image Processing, vol. 30, pp. 2127-2140, 2021.DOI
Y. Li, E. Zio, E. Pan. “An MEWMA-based segmental multivariate hidden Markov model for degradation assessment and prediction.” Journal of Risk and Reliability, vol. 235(5), pp. 831-844, 2021.DOI
P. L. Prasanna, “Forecasting Inflation Rate (WPI &CPI) in India with Time Series Model – A Statistical Recurrent Neural Network Approach”. IARJSET, vol. 8(6), pp. 87-90, 2021.URL
X. Dai, J. Liu, Y. Li. “A recurrent neural network using historical data to predict time series indoor PM2.5 concentrations for residential buildings.” Indoor Air, 2021, vol. 31(4), pp. 1228-1237, 2021.DOI
X. Zhang, Y. Huang, Y. Rong, G. Li, H. Wang, C. Liu. “Recurrent neural network based optimal integral sliding mode tracking control for four-wheel independently driven robots.” IET control theory & applications, 2021, vol. 15(10), pp. 1346-1363, 2021.DOI


Haiyang Cao

Haiyang Cao obtained her Master’s Degree in English Language and Literature (2012) from Harbin Normal University, Harbin. Presently, she is working as a Professor in the International Business College, Qingdao Huanghai University, Qingdao. She was invited as a lecturer to deliver various talks on English teaching and education. She has published articles in more than 15 national and international reputed peer reviewed journals. As a program leader, she has applied and finished 5 research programs of provincial and municipal levels. Her areas of interest include English teaching, English education and English interpretation research.