Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 06, p.457-465

ISSN (online) :

2287-5255

Received : 20 February 2023Revised : 26 April 2023Accepted : 15 July 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.6.457

Regular Paper

Applying AR Technology Integrating Unity3D with the Vuforia SDK for Oral English Teaching

HuangWei¹ ZhangHaiyan^1,^*

(School of Foreign Studies, Suzhou University, Suzhou 234000, China )

^* Corresponding Author: Haiyan Zhang, haiyanzhang234@163.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

With the rapid progress of information technology, its application to teaching has gradually become a hot topic in the education field. Augmented reality (AR) combines virtual and real characteristics that can improve comprehension in a virtual environment, bringing new development opportunities to oral English teaching. Based on integration of the Vuforia SDK in the U-nity3D augmented reality engine, this research applies AR technology to spoken English teaching, improves a convolutional neural network (CNN), and proposes an English speech recognition system based on a connectionist temporal classification (CTC)-CNN (maxout). The results from experiments varying the number of iterations and the loss value, the proposed model converges after 80 iterations with strong performance. In recognition of spoken English with or without noise, the accuracy of this method was highest at 0.957 and 0.894, respectively, which is better than the CTC-CNN (sigmoid) model. In recognizing six kinds of spoken English, the accuracy of the CTC-CNN (maxout) model stabilizes at about 95%, with the highest accuracy at 97%. The accuracy rate shows that the method can be effectively applied to oral English teaching, and can provide a new reference method for innovations in oral English teaching and the improvement of teaching efficiency.

Keywords

U-nity3D, Vuforia SDK, AR, CNN, Oral English teaching

1. Introduction

Augmented reality (AR) is also known as mixed reality. With the support of technological simulation, it superimposes onto the real world entity information that is difficult to experience, such as sound, touch, and visual information, so people can perceive a variety of sensory experiences beyond reality ^[1]. When the U-nity3D augmented reality engine integrates the Vuforia SDK, AR technology can be better applied in various real scenes. In the field of education, AR is becoming a popular technical tool. To address the lack of practicality in nurse practitioner education, Anderson et al.~applied AR to learning scenarios and conducted a usability evaluation ^[2]. The results showed it can effectively improve the teaching effect. In spoken English lessons, to improve the teaching effect, an efficient speech recognition system must be established. Aiming at problems of long delay and energy consumption caused by the computation complexity of deep neural networks and massive memory access, Zheng et al.~proposed a speech recognition system based on a binary convolutional neural network (CNN) that uses a self-learning mechanism to improve low precision. The compensated loss reduced the energy consumption of speech frames and neurons ^[3]. However, at present, the English speech recognition system still has problems with low efficiency and noise, which makes it difficult to achieve breakthrough progress in the teaching of spoken English. Therefore, this research uses the AR technology of U-nity3D integrated with the Vuforia SDK to be used in spoken English teaching, and in optimizing a CNN to improve the accuracy of English speech recognition and the effect of spoken English teaching.

2. Related Work

In recent years, English speech recognition technology has attracted the attention of many professionals, and some breakthrough research results have been achieved. Among them, application of the CNN to speech recognition has strong referential significance. The weight-sharing structure of the CNN provides an advantage in the application of speech recognition by reducing the difficulties in feature extraction and classification as well as the complexity of the network ^[4]. The deep CNN has been well used in the field of education to perform hierarchical feature extraction on big data to achieve adaptive group teaching ^[5].

The teaching environment has an important influence on teaching quality. When noise in the teaching environment is too great, it will affect the results of speech recognition. The CNN can be applied to speech recognition technology, but the main research direction is to provide a noise-free environment, and further research is needed on the transformation of speech information in a noisy environment. Nam and Lee cascaded a CNN structure that can remove noise. Training results showed that this cascade structure had high accuracy in speech emotion recognition, and could effectively extract speech from a noisy environment ^[6].

A speech recognition model for people with dysarthria could enhance speech recognition through variational mode decomposition, and accurately recognized speech information despite noise or distortion ^[7]. The model introduced the CNN to input a reconstructed speech signal. Performance comparison in the results showed that the CNN improved the accuracy of the model, outperforming other existing methods.

Mu et al. proposed combining a CNN with time-frequency attention to focus on time frames of important information, reducing the interference of ambient sounds. This method’s components complement each other with feature information, accurately extracting the time-frequency features of key information, and effectively improving the representation ability of the model ^[8].

Teachers may express certain emotions while teaching English according to the content or teaching situation, so accurate identification of teachers’ emotional information can help improve the teaching effect. Emotion recognition has been studied extensively in speech technology, and accurate recognition of emotion in speech is difficult. In order to achieve accurate speech emotion classification, researchers use a deep CNN to extract features from relevant speech emotion data, classify the extracted data, and finally, obtain highly accurate speech emotion recognition ^[9].

In a convolutional filter speech recognition model, the CNN can perform feature extraction on segmented speech and distinguish one feature from another. The embedding of the neural network can generate higher dimensions, which can better perform spatial analysis in high dimensions ^[10].

In oral English teaching, computer-aided technology provides technical support to reform teaching methods. Some scholars have combined computer-aided technology with speech recognition to create an intelligent teaching model that has better speech recognition abilities than traditional models ^[11]. The application of 3D simulation to software development is more and more extensive. Using 3D technology, things or phenomena to be studied can be presented to researchers in a finer and more vivid way. In weather forecasting, U-nity3D can be used to build a weather simulation environment to vividly display outdoor weather conditions in visual form ^[12]. U-nity3D software can be used to realize the augmented reality of visualization. Some researchers have combined U-nity3D with AR to process maxillofacial prosthesis data and obtain target images, realizing interactive visualization of target images in 3D mode ^[13].

Visual speech recognition (VSR) can convert visual information to speech information. But in English teaching, some English words are short, and traditional VSR technology easily ignores these words or wrongly converts them. Kim combined 3D and the CNN to improve the traditional VSR model, used connectionist time classification to train the new model, and finally obtained visual speech recognition with higher performance and wider application ^[14].

To sum up, based on the relevant research on 3D technology and AR visualization, combining 3D technology and AR in spoken English teaching can be considered in order to realize visual display of teaching methods. At the same time, optimizing a speech recognition model in combination with a CNN and other technologies can improve accuracy in the speech recognition model.

3. Integrating U-nity3D and AR for Spoken English Teaching

3.1 Integrating the Vuforia SDK

U-nity3D is an augmented reality engine developed by Unity Technologies mainly for interactive graphics. U-nity3D can create and render scene models, import the Vuforia SDK extension toolkit, and implement tracking and detection under the corresponding interface to obtain AR applications with human-computer interaction and a virtual-real overlay ^[15]. U-nity3D supports 3D models in OBJ or FBX formats. Upon importing them into various environments and scenes, they can be augmented with environmental sound effects and physical material effects such as wind, sky, and fog. Moreover, this feature also supports editing 3D application scenes, testing them, and instant browsing. Additionally, it facilitates the direct transfer of the product as desired for cross-platform support ^[16]. The Vuforia Augmented Reality SDK is mainly aimed at mobile-device augmented reality applications. It uses computer vision technology to recognize and capture simple three-dimensional objects or flat images in real time, and supports developers who want to virtualize and capture virtual reality. For placement and adjustment of objects, the data flow module is shown in Fig. 1.

The data flow of the Vuforia SDK has four modules: input conversion, the database module, tracking detection, and the rendering input module. The input conversion module obtains a new image format through an image converter after the camera captures a scene. The database module is storage for the data, including cloud storage and local device storage. The tracking detection module is used to track a target, including user-defined targets. The rendering input module contains application coding and video background rendering. The four modules transmit to each other and provide feedback on problems, so U-nity3D is easily integrated with the Vuforia SDK. Good adaptation and powerful engine functions enable developers to obtain augmented reality interactive applications with excellent effects under a simple design. Therefore, U-nity3D integrated with the Vuforia SDK AR technology markers not only recognize three-dimensional models but also offer real-time tracking and English-speaking teaching, as shown in Fig. 2.

In the AR oral English teaching mode of U-nity3D integrated with the Vuforia SDK, students have synchronous dialogue in a simulated real scene. The teacher can switch between different roles, and multiple students can cooperate with each other to communicate orally in English. At the same time, teachers use computers to construct various dialogue situations so students can participate and communicate with foreigners. Especially in some shopping, travel, and other situations, AR technology has a high degree of restoration, and students’ English adaptability can be greatly improved ^[17]. With the help of AR, students’ listening and reading processes, voices, and videos can be recorded; video playback is supported, realizing cross-time and -space teaching. In addition, students’ autonomous learning abilities will make great progress because AR technology itself is attractive. It can create a relaxed and harmonious oral English learning environment, give students experience in a real atmosphere, and stimulate enthusiasm for oral learning initiatives.

Fig. 1. Diagram of the Vuforia SDK Data Flow Module.

Fig. 2. AR Oral English Teaching Mode Based On U-nity3D Integrated with the Vuforia SDK.

3.2 English Speech Recognition based on CNN

In an English speech recognition system, the acoustic model is an integral part. The CNN, a prominent algorithm in the realm of deep learning, showcases its efficiency through its convolution pooling structure, which significantly reduces the number of parameters. Additionally, it eliminates the impact of signal amplitude changes during the convolution process, and exhibits robust adaptability. The field of speech recognition has already been successfully applied ^[18], and applying it to English will greatly improve the performance of the acoustic model. The basic structure of CNN is shown in Fig. 3.

The convolutional layer of the CNN has multiple feature maps, and in each feature map, there are several neurons. The input of the feature map is obtained under the local filtering effect of the convolution kernel on the input features, and the convolution kernel is fundamentally a weight matrix ^[19]. The convolutional layer of a CNN first extracts rough information, and then extracts discriminative features until the key distinguishable features are obtained. Therefore, the fundamental feature of the convolutional layer is feature extraction of deep information contained in the input speech signal and transmitting it to the pooling layer. The local connections of the convolutional layers are shown in Fig. 4.

In Fig. 4, the input is the L-1 layer. The way its neurons are connected to the adjacent neurons in the L layer is a local connection, and the weights are shared at the same time. The neuron weights in the first feature surface of the input layer are shared, as shown in formula (1):

(1)

$ w_{a\left(i\right)b\left(j\right)}=w_{a\left(i+1\right)b\left(j+1\right)}=w_{a\left(i+2\right)b\left(j+2\right)} $

In formula (1), $i$ and $j$ represent neural sequences, $a$ and $b$ are sequences of feature planes, and $w$ represents the weight. The CNN can reduce the complexity of the model through weight sharing, thereby reducing the number of parameters for learning, making the model easier to train ^[20]. The feature surface owned by each convolutional layer in the CNN uniquely corresponds to the input feature surface of the pooling layer. The pooling layer further extracts information from the convolutional layer, which uses the maximum pooling method to deal with the volume. The problem of estimated value deviation caused by layered parameters is shown in formula (2):

(2)

$ h_{m}=\max a_{ij}\enspace \enspace \enspace \enspace \enspace i,j\in N_{m} $

In formula (2), $N_{m}$ is the size of the neighborhood, $h_{m}$ represents the output value of this field, and $a_{ij}$ is the maximum value of each point contained in the neighborhood. To mitigate the effect of neighborhood error on the estimated value, which leads to a reduction in variance, it is necessary to implement the mean pooling operation in formula (3):

(3)

$ h_{m}=\frac{1}{\left| N_{m}\right| }\sum _{i,j\in N_{m}}a_{ij} $

In formula (3), $i$ and $j$ are points in the neighborhood. The pooling layer can preserve the features extracted by the convolutional layer to the greatest extent; it can further reduce the amount of computation and prevent overfitting ^[21]. At the same time, when the pooling layer performs feature compression, it will not damage the speech features, but maintains the invariance of the features to a certain extent. Therefore, in the design of the acoustic model, the mean shift is reduced by the pooling layer, and the 3${\times}$3 pooling kernel size is selected to obtain higher-precision features. After multiple convolutional layers and pooling layers, the speech information features are passed to the fully connected layer. The fully connected layer can receive all the local information contained in the previous layer, and its calculation is shown in formula (4):

(4)

$ y_{pj}^{l}=f\left(\sum _{i=0}^{N^{l-1}}X_{pj}^{l-1}\cdot b_{j}^{l}+w_{ji}^{l}\right) $

In formula (4), $f$ represents the activation function, $b$ is the bias of the neuron, $N$ is the number of neurons, and $w$ is the weight. The fully connected layer can integrate the feature map obtained by the convolution pooling operation, and finally output a vector or probability value; that is, it can become the classifier of the network, mapping the previous feature representation to the label space. In the CNN design, selecting an appropriate activation function can retain better speech features, and introducing a nonlinear function can improve its nonlinear representation ability ^[22]. Nonlinear functions often include the tanh function, the ReLU function, the sigmoid function, and the maxout function. The sigmoid function is shown in formula (5):

(5)

$ f\left(x\right)=\frac{1}{e^{-x}+1} $

In formula (5), $e$ is a natural constant. Deriving the sigmoid function is convenient, but the variation range is small, and the convergence speed is slow. The tanh function is shown in formula (6):

(6)

$ \tanh x=\frac{e^{x}-e^{-x}}{e^{-x}+e^{x}} $

The value range of the tanh function is [-1, 1], but the gradient disappears and still occurs. The ReLU function is shown in formula (7):

(7)

$ \mathrm{Re}LU=\max \left(0,x\right) $

It is difficult to saturate the calculation of the ReLU function, which can prevent the vanishing gradient problem, but it may make it difficult to activate some parameters and cause a crash. The maxout function is formula (8):

(8)

$ h_{l}^{i}\left(x\right)=\underset{j\in \left[1,k\right]}{\max z_{l}^{ij}} $

In formula (8), $k$ is the maximum number of neurons, $l$ is the neural layer sequence, $h_{l}^{i}$ represents the output, and $z_{l}^{ij}$ is the activation amount. Then, in the first layer, $l$, the activation amount is shown in formula (9):

(9)

$ z^{ij}=b_{ij}+x^{T}W_{\ldots ij} $

In formula (9), $b$ represents the offset, $x^{T}$ represents the eigenvector, and $W$ is a three-dimensional matrix related to the input and output nodes. The maxout function has a strong fitting ability, and can give the network a constant gradient, thereby effectively improving the vanishing gradient phenomenon. Therefore, this function is selected to optimize the acoustic model. In English speech recognition, the traditional mode needs to perform mandatory alignment processing on the training speech, which leads to an increase in complexity and training difficulty. Therefore, an end-to-end structure is added and connectionist temporal classification (CTC) and the CNN are combined for research. CTC processes the time series classification task based on predicting the information output of each frame to recognize the speech signal. CTC is an objective function based on softmax. An empty node is introduced in CTC, which can automatically optimize the output sequence and realize the mapping of the same label sequence and multiple paths ^[23]. The probability corresponding to the corresponding path of the speech frame length after CTC is shown in formula (10):

(10)

$ P\left(\pi \left| I\right.\right)=\prod _{t=1}^{T}q_{\pi \left(t\right)}^{t} $

In formula (10), $T$ represents the length of the speech frame, and $\pi $ represents the corresponding path. Then, the forward and backward algorithm is introduced, and the result is shown in formula (11):

(11)

$ a\left(t,d\right)=y_{l}^{t}\sum _{i=f\left(d\right)}^{d}a\left(t-1,d\right) $

In formula (11), $a(t,d)$ represents the forward probability value of the forward vector, $y_{l}^{t}$ represents the probability of $l$ getting the output at moment $t$. Therefore, the forward probability calculation formula at a certain moment is obtained as shown in formula (12):

(12)

$ f\left(d\right)=\left\{\begin{array}{l} d-1,l=blank\\ d-2,otherwise \end{array}\right. $

In formula (12), $d$ represents a node, and $blank$ is a space. The idea of the backward algorithm is the same as that of the forward algorithm, and its formula is shown in (13):

(13)

$ \left\{\begin{array}{l} \beta \left(t,d\right)=\sum _{i=d}^{g\left(d\right)}\beta \left(t+1,i\right)y_{l}^{t+1}\\ g\left(d\right)=\left\{\begin{array}{l} d+1,l=blank\\ d+2,otherwise \end{array}\right. \end{array}\right. $

In formula (13), $\beta (t,d)$ represents the backward probability value. The CTC loss function is shown in formula (14) through the maximum likelihood function:

(14)

$ L\left(S\right)=\sum _{\left(x,z\right)\in S}L\left(x,z\right) $

In formula (14), $x$ is the input, and $z$ is the output sequence. Therefore, the CTC loss function diagram is obtained with formula (15):

(15)

$ L\left(s\right)=-\ln \prod _{\left(x,z\right)\in s}p\left(z\left| x\right.\right)=-\sum _{\left(x,z\right)\in s}\ln p\left(z\left| x\right.\right) $

In formula (15), $s$ represents the training set, and $p$ represents the probability. Convolutional and pooling layers can help accurately identify slightly displaced and deformed input features, while the end-to-end architecture optimizes the output sequence. Therefore, the two are combined into an acoustic CTC-CNN model (maxout) with the parameters shown in Table 1.

Therefore, the acoustic model process of CTC-CNN (maxout) is to first input English speech and obtain feature vectors through feature extraction. This feature vector is then input into the proposed CTC-CNN model, and then enters the first convolutional layer, which is a coarser acoustic feature at this time. Then, the nonlinear activation function and convolution operation of the second convolution layer are used to obtain relatively fine features. The convolutional features reach the pooling layer, and the mean shift is further reduced by the maximum pooling process. The pooling layer obtains more accurate features. After the feature map obtained in the first two steps reaches the fully connected layer, the posterior probability is obtained through mapping of the convolution and the activation function, which is used as the output. Finally, CTC (maxout) classifies and optimizes the recognition of speech features, and outputs the recognized speech after decoding.

Fig. 3. Diagram of the CNN Structure.

Fig. 4. Diagram of the Convolution Layer’s Local Connection Mode.

Table 1. CTC-CNN (maxout) Acoustic Model Parameter Table.

Network layer	Parameter
Input	39 dimensional MFCC features
Convolution layer 1	Convolution kernel: 9×9, convolver: 128, step: 1×1, activation function: sigmoid
Convolution layer 2	Convolution kernel: 4×3, convolver: 256, step: 1×1, activation function: sigmoid
Pooling layer	Maximum pooling: 3×3
Fully connected layer 1	Activation function: ReLU, number of neuron nodes: 1024
Fully connected layer 2	Activation function: ReLU, number of neuron nodes: 1024
Maxout layer	CTC

4. Application Effect Analysis

First, the performance of the proposed CTC-CNN (maxout) model was tested and compared with CNN (sigmoid), CNN (maxout), and CTC-CNN (sigmoid). The training results are shown in Fig. 5.

Fig. 5 shows the results from the four selected models after training on the same dataset. We can see from Fig. 5 that the CNN (sigmoid) model is close to the target accuracy after 445 iterations. The CNN (maxout) model is close to the target accuracy after 378 iterations, and the CTC-CNN (sigmoid) model reaches the specified target accuracy after 275 iterations. The CTC-CNN (maxout) model achieves the established goal after 190 iterations, indicating that the proposed CTC-CNN (maxout) model has stronger convergence performance. Then, the four models were validated with the same data samples, and the loss values versus the number of iterations are shown in Fig. 6.

We can see from Fig. 6 that the loss values for the CNN (sigmoid) and the CNN (maxout) gradually decreased with an increase in the number of iterations, but the decrease in the loss value was slow. Among them, the number of iterations for the CNN (sigmoid) converged at 200 with a loss value of 2.0. CNN (maxout) converged after 150 iterations with a loss value of 1.5. The CTC-CNN (sigmoid) model converged when the number of iterations reached 80 and the loss value was 1.0, while CNN (maxout) converged after 50 iterations with a loss value of 0.6, indicating that it had better convergence. Then, an English speech recognition system based on CTC-CNN (sigmoid), CNN (sigmoid), CNN (maxout), and CTC-CNN (maxout) was built through TensorFlow, and English speech recognition accuracy experiments with the four models for different numbers of iterations were carried out. The results are shown in Fig. 7.

We can see from Fig. 7 that with more iterations the accuracy of the four models showed an upward trend. After five iterations, the accuracy of the four models was low, the accuracy of CNN (sigmoid) was 75%, the accuracy of CNN (maxout) was 76%, and the accuracy of CTC-CNN (sigmoid) and CTC-CNN (maxout) was 77% and 78%, respectively. When the number of iterations increased to 40, the recognition accuracy of the four models tended to be stable. After 40 iterations, CNN (sigmoid), CNN (maxout), CTC-CNN (sigmoid), and CTC-CNN (maxout) reached accuracies of 83%, 84%, 87%, and 92%, respectively. After 40 iterations, the accuracy of the four-station model increased slowly. Among them, CTC-CNN (maxout) had the highest accuracy at 93%, while CNN (sigmoid) had the lowest (83%). When the amount of training increased, the accuracy of the speech features learned by the four models gradually increased, and the CTC-CNN (maxout) model had higher accuracy. At the same time, the accuracy of a model would reach a stable value, indicating that in the training process, more iterations did not help because the increase also increased the training time. To further verify the performance of the CTC-CNN (maxout) model for English speech recognition, 800 samples were extracted from six dataset (a total of 4800 samples) to form a dataset for recognition. The number of convolution kernels was 3${\times}$3, and the number of acquisition frames was set to 180. Table 2 shows the recognition accuracy results of the four models obtained with and without noise.

As seen in Table 2, in the absence of noise, the highest recognition rate achieved by CNN (sigmoid) was 0.823, with the lowest and average recognition rates at 0.732 and 0.784, respectively. On the other hand, CNN (maxout) recorded its highest recognition rate of 0.865 with a minimum recognition rate of 0.739 and an average recognition rate of 0.797. CTC-CNN (sigmoid) demonstrated even higher accuracy, with a recognition rate of 0.884, a minimum recognition rate of 0.781, and an average recognition rate of 0.842. Moreover, the highest accuracy rate was achieved by CTC-CNN (maxout), which reached 0.957, with lowest and average accuracy rates at 0.916 and 0.935, respectively, significantly higher than the other three models.

However, when noise was present, the accuracy rates of all four models decreased. The highest rates recorded by CNN (sigmoid), CNN (maxout), and CTC-CNN (sigmoid) were 0.742, 0.758, and 0.761, respectively, while CTC-CNN (maxout) reached 0.894. Nevertheless, the average accuracy rate remained relatively high at 0.847. Overall, the proposed CTC-CNN (maxout) model had the highest recognition accuracy: a maximum of 95.7% with and without noise. Finally, voice data from a spoken English teaching database were selected and divided into six types to be identified by the four models, each of which represented a specific common dialogue in spoken English. The results are shown in Fig. 8.

In Fig. 8, the abscissa shows the six kinds of voice data, and the ordinate is recognition accuracy. We can see that the accuracy rate of CNN (sigmoid) and CNN (maxout) in recognizing spoken English had a small change, falling below 85%. The highest accuracy for CTC-CNN (sigmoid) was 93%, and accuracy with the other five packets was around 90%. The CTC-CNN (maxout) model had a highest recognition accuracy up to about 95% in the selected data packets, but the fluctuation range was small (a high of 97%), which was the highest of all the models.

Fig. 5. Training Results of the Four Models.

Fig. 6. Numbers of Iterations and Loss Values.

Fig. 7. English Speech Recognition Accuracy Results for Different Numbers of Iterations.

Fig. 8. Accuracy of the Models with Spoken English.

Table 2. Recognition Accuracy Results of Four Models with and without Noise.

Model	Without noise			With noise
Model	Highest accuracy	Minimum accuracy	Average accuracy	Highest accuracy	Minimum accuracy	Average accuracy
CNN (sigmoid)	0.823	0.732	0.784	0.742	0.657	0.671
CNN (maxout)	0.865	0.739	0.797	0.758	0.659	0.683
CTC-CNN (sigmoid)	0.884	0.781	0.842	0.761	0.703	0.736
CTC-CNN (maxout)	0.957	0.916	0.935	0.894	0.847	0.875

5. Conclusion

Spoken English is the key content of English teaching and the concentrated expression of English as a language. With wide application of information technology to the field of education, further innovations in oral English teaching through AR technology has gradually become the direction explored. This research integrated U-nity3D with Vuforia’s AR technology for spoken English education, optimizing a CNN, and establishing an English speech recognition system based on the CTC-CNN (maxout) model. The results showed that when training on the same dataset, the number of iterations for the CNN (sigmoid), CNN (maxout), CTC-CNN (sigmoid), and CTC-CNN (maxout) when approaching the target accuracy were 445, 378, 275, and 190, respectively. The convergence performance of CTC-CNN (maxout) was better than the others. In a comparison of English speech recognition accuracy from the four models after different numbers of iterations, at 40, the accuracy rates of CTC-CNN (maxout), CTC-CNN (sigmoid), CNN (maxout), and CNN (sigmoid) were 92%, 87%, 84%, and 83%, respectively. The proposed method was more accurate than the others. In a comparison of recognition accuracy with and without noise, the highest, lowest, and average accuracy of CTC-CNN (maxout) without noise were 0.957, 0.916, and 0.935, respectively, higher than the other three models. In the presence of noise, the highest accuracy reached 0.894. In the example verification, the recognition accuracy rate of CTC-CNN (maxout) was up to 96% with a better English speech recognition effect. However, when improving the CNN, the selected activation function was not optimized, and analysis of the application effect of other neural network models, such as the recurrent neural network, was lacking, which needs to be researched further.

ACKNOWLEDGMENTS

The research was supported by major projects from the Education Department of Anhui Province: The Research on the Construction and Certification of “Two Lines” Teacher Education Specialty (No. 2020jyxm2201); by the Ideological and Political Education Curriculum Demonstration project from the Education Department of Anhui Province, College English (Chinese Elements in the New Era), (No. 2020szsfkc1037); and by the Teaching Reform Research Project, Suzhou University, Long-term Poverty Alleviation Mechanism of English Language Education in Northern Anhui Province—A Case Based on Application-oriented Universities (No. Szxy2020jyxm09).

REFERENCES

M. Wang, V. Callaghan, J. Bernhardt, et al. ``Augmented reality in education and training: pedagogical approaches and illustrative case studies'', Journal of ambient intelligence and humanized computing, vol. 9, pp. 1391-1402, 2018.

S., Zheng P. Ouyang, D. Song, et al. ``An Ultra-Low Power Binarized Convolutional Neural Network-Based Speech Recognition Processor with On-Chip Self-Learning'', IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, pp. 4648-4661, 2019.

M. Anderson, F. Guido-Sanz, et al. ``Augmented Reality in Nurse Practitioner Education: Using a Triage Scenario to Pilot Technology Usability and Effectiveness''. Clinical Simulation in Nursing, 2021, vol. 54, pp. 105-112, 2021.

K. Zhu, N. Zhang, S. Ying, et al. ``Within-Project and Cross-Project Just-In-Time Defect Prediction Based on Denoising AutoEncoder and Convolutional Neural Network'', IET Software, 2020, vol. 14, pp. 185-195, 2020.

V. Gupta and S. Pawar, ``An effective structure of multi-modal deep convolutional neural network with adaptive group teaching optimization''. Soft Computing, vol. 26, pp. 7211-7232, 2022.

Y. Nam & C. Lee, ``Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions'', Sensors, vol. 21, pp. 4399-4415, 2021.

R. Rajeswari, T. Devi and S. Shalini, ``Dysarthric Speech Recognition Using Variational Mode Decomposition and Convolutional Neural Networks'', Wireless Personal Communications, 2022, vol. 122, pp. 293-307, 2022.

W. Mu, B. Yin, X. Huang, et al., ``Environmental sound classification using temporal-frequency attention based convolutional neural network'', Scientific Reports, vol. 11, pp. 1-14, 2021.

M. Farooq, F. Hussain, N. K. Baloch, et al., ``Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network'', Sensors, vol. 20, pp. 6008-6025, 2020.

R. J. Wesley, A. N. Khan, A. Shahina, ``Phoneme Classification in Reconstructed Phase Space with Convolutional Neural Networks'', Pattern Recognition Letters, vol. 135, pp. 299-306, 2020.

Y. Hai, ``Computer-aided teaching mode of oral English intelligent learning based on speech recognition and network assistance. Journal of Intelligent and Fuzzy Systems, 2020, vol. 39, pp. 5749-5760, 2020.

H. Noueihed, H. Harb, and J. Tekli, ``Knowledge-based virtual outdoor weather event simulator using unity 3D'', The Journal of Supercomputing, vol. 78, pp. 10620-10655, 2022.

M. E. Elbashti, T. Itamiya, A. Aswehlee, et al, ``Augmented reality for interactive visualization of three-dimensional maxillofacial prosthetic data'', The International journal of prosthodontics, vol. 33, pp. 680-683, 2020.

M. S. Kim, ``Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition'', Sensors, vol. 22, pp. 72-89, 2021.

T. Lham, P. Jurmey, S. Tshering, ``Asian Journal of Education and Social Studies Augmented Reality as a Classroom Teaching and Learning Tool: Teachers’ and Students’ Attitude'', Asian Journal of Education and Social Studies, vol. 12, pp. 27-35, 2020.

S. Gargrish, A. Mantri, D. P. Kaur, ``Augmented Reality-Based Learning Environment to Enhance Teaching-Learning Experience in Geometry Education'', Procedia Computer Science, vol. 172, pp. 1039-1046, 2020.

J. B. Samaniego-Franco, M. V. Agila-Palacios, D. I. Jara-Roa, et al., ``Augmented reality as a support resource in the teaching-learning process in legal medicine'', RISTI - Revista Iberica de Sistemas e Tecnologias de Informacao, pp. 972-984, 2019.

A. Hu, L. Wu, J. Huang, et al., ``Recognition of weld defects from X-ray images based on improved convolutional neural network'', Multimedia Tools and Applications, vol. 81, pp. 15085-15102, 2022.

S. Zhang, S. Zhang, T. Huang, et al., ``Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching'', IEEE Transactions on Multimedia, pp. 1-4, 2017.

J. Wang, ``Speech recognition in English cultural promotion via recurrent neural network'', Personal and Ubiquitous Computing, vol. 24, pp. 237-246, 2020.

M. Lee, J. Lee, J. H. Chang, ``Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition'', Digital Signal Processing, vol. 85, pp. 1-9, 2019.

B. Shi, C. Lian, Y. Long, et al., ``Obstacle type recognition in visual images via dilated convolutional neural network for unmanned surface vehicles'', Journal of Navigation, vol. 75, pp. 437-454, 2022.

R. Shashidhar, S. Patilkulkarni and S. B. Puneeth, ``Combining audio and visual speech recognition using LSTM and deep convolutional neural network'', International Journal of Information Technology. pp. 1-12, 2022.

Author

Wei Huang

Wei Huang , male, was born in April, 1980. He is an associate professor. He received his bachelor of Arts degree from Anhui Agricultural University in July, 2003. In 2005, he graduated from Anhui Normal University for the Graduation Certificate of Master Degree Training Class, having the Same Academic Level as Master Degree Candidate. He is working at Suzhou University and devote himself to the research on the study of Foreign Language Teaching. He has published 7 academic papers. At the same time, he presided over 11 research projects and participated in 9 research projects.

Haiyan Zhang

Haiyan Zhang , female, was born in August 1981. She is lecturer. She received his bachelor of Arts degree from Anhui Agricultural University in July, 2003 and received her Master's degree in English Language and Literature from Hefei University of Technology in 2010. She is working at Suzhou University. She devote herself to the research on Britain and America literature and dedicate herself to the study of Foreign Language Teaching. She has published 9 academic papers. At the same time, she presided over 8 research projects and participated in 11 research projects.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Applying AR Technology Integrating Unity3D with the Vuforia SDK for Oral English Teaching

Abstract

Keywords

1. Introduction

2. Related Work

3. Integrating U-nity3D and AR for Spoken English Teaching

3.1 Integrating the Vuforia SDK

Fig. 1. Diagram of the Vuforia SDK Data Flow Module.

Fig. 2. AR Oral English Teaching Mode Based On U-nity3D Integrated with the Vuforia SDK.

3.2 English Speech Recognition based on CNN

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

Fig. 3. Diagram of the CNN Structure.

Fig. 4. Diagram of the Convolution Layer’s Local Connection Mode.

Table 1. CTC-CNN (maxout) Acoustic Model Parameter Table.

4. Application Effect Analysis

Fig. 5. Training Results of the Four Models.

Fig. 6. Numbers of Iterations and Loss Values.

Fig. 7. English Speech Recognition Accuracy Results for Different Numbers of Iterations.

Fig. 8. Accuracy of the Models with Spoken English.

Table 2. Recognition Accuracy Results of Four Models with and without Noise.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Wei Huang

Haiyan Zhang

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing