3.1 Integrating the Vuforia SDK
                  U-nity3D is an augmented reality engine developed by Unity Technologies mainly for
                     interactive graphics. U-nity3D can create and render scene models, import the Vuforia
                     SDK extension toolkit, and implement tracking and detection under the corresponding
                     interface to obtain AR applications with human-computer interaction and a virtual-real
                     overlay [15]. U-nity3D supports 3D models in OBJ or FBX formats. Upon importing them into various
                     environments and scenes, they can be augmented with environmental sound effects and
                     physical material effects such as wind, sky, and fog. Moreover, this feature also
                     supports editing 3D application scenes, testing them, and instant browsing. Additionally,
                     it facilitates the direct transfer of the product as desired for cross-platform support
                     [16]. The Vuforia Augmented Reality SDK is mainly aimed at mobile-device augmented reality
                     applications. It uses computer vision technology to recognize and capture simple three-dimensional
                     objects or flat images in real time, and supports developers who want to virtualize
                     and capture virtual reality. For placement and adjustment of objects, the data flow
                     module is shown in Fig. 1.
                  
                  The data flow of the Vuforia SDK has four modules: input conversion, the database
                     module, tracking detection, and the rendering input module. The input conversion module
                     obtains a new image format through an image converter after the camera captures a
                     scene. The database module is storage for the data, including cloud storage and local
                     device storage. The tracking detection module is used to track a target, including
                     user-defined targets. The rendering input module contains application coding and video
                     background rendering. The four modules transmit to each other and provide feedback
                     on problems, so U-nity3D is easily integrated with the Vuforia SDK. Good adaptation
                     and powerful engine functions enable developers to obtain augmented reality interactive
                     applications with excellent effects under a simple design. Therefore, U-nity3D integrated
                     with the Vuforia SDK AR technology markers not only recognize three-dimensional models
                     but also offer real-time tracking and English-speaking teaching, as shown in Fig. 2.
                  
                  In the AR oral English teaching mode of U-nity3D integrated with the Vuforia SDK,
                     students have synchronous dialogue in a simulated real scene. The teacher can switch
                     between different roles, and multiple students can cooperate with each other to communicate
                     orally in English. At the same time, teachers use computers to construct various dialogue
                     situations so students can participate and communicate with foreigners. Especially
                     in some shopping, travel, and other situations, AR technology has a high degree of
                     restoration, and students’ English adaptability can be greatly improved [17]. With the help of AR, students’ listening and reading processes, voices, and videos
                     can be recorded; video playback is supported, realizing cross-time and -space teaching.
                     In addition, students’ autonomous learning abilities will make great progress because
                     AR technology itself is attractive. It can create a relaxed and harmonious oral English
                     learning environment, give students experience in a real atmosphere, and stimulate
                     enthusiasm for oral learning initiatives.
                  
                  
                        Fig. 1. Diagram of the Vuforia SDK Data Flow Module.
 
                  
                        Fig. 2. AR Oral English Teaching Mode Based On U-nity3D Integrated with the Vuforia SDK.
 
                
               
                     3.2 English Speech Recognition based on CNN
                  In an English speech recognition system, the acoustic model is an integral part. The
                     CNN, a prominent algorithm in the realm of deep learning, showcases its efficiency
                     through its convolution pooling structure, which significantly reduces the number
                     of parameters. Additionally, it eliminates the impact of signal amplitude changes
                     during the convolution process, and exhibits robust adaptability. The field of speech
                     recognition has already been successfully applied [18], and applying it to English will greatly improve the performance of the acoustic
                     model. The basic structure of CNN is shown in Fig. 3.
                  
                  The convolutional layer of the CNN has multiple feature maps, and in each feature
                     map, there are several neurons. The input of the feature map is obtained under the
                     local filtering effect of the convolution kernel on the input features, and the convolution
                     kernel is fundamentally a weight matrix [19]. The convolutional layer of a CNN first extracts rough information, and then extracts
                     discriminative features until the key distinguishable features are obtained. Therefore,
                     the fundamental feature of the convolutional layer is feature extraction of deep information
                     contained in the input speech signal and transmitting it to the pooling layer. The
                     local connections of the convolutional layers are shown in Fig. 4.
                  
                  In Fig. 4, the input is the L-1 layer. The way its neurons are connected to the adjacent neurons
                     in the L layer is a local connection, and the weights are shared at the same time.
                     The neuron weights in the first feature surface of the input layer are shared, as
                     shown in formula (1):
                  
                  
                  In formula (1), $i$ and $j$ represent neural sequences, $a$ and $b$ are sequences of feature planes,
                     and $w$ represents the weight. The CNN can reduce the complexity of the model through
                     weight sharing, thereby reducing the number of parameters for learning, making the
                     model easier to train [20]. The feature surface owned by each convolutional layer in the CNN uniquely corresponds
                     to the input feature surface of the pooling layer. The pooling layer further extracts
                     information from the convolutional layer, which uses the maximum pooling method to
                     deal with the volume. The problem of estimated value deviation caused by layered parameters
                     is shown in formula (2):
                  
                  
                  In formula (2), $N_{m}$ is the size of the neighborhood, $h_{m}$ represents the output value of
                     this field, and $a_{ij}$ is the maximum value of each point contained in the neighborhood.
                     To mitigate the effect of neighborhood error on the estimated value, which leads to
                     a reduction in variance, it is necessary to implement the mean pooling operation in
                     formula (3):
                  
                  
                  In formula (3), $i$ and $j$ are points in the neighborhood. The pooling layer can preserve the features
                     extracted by the convolutional layer to the greatest extent; it can further reduce
                     the amount of computation and prevent overfitting [21]. At the same time, when the pooling layer performs feature compression, it will not
                     damage the speech features, but maintains the invariance of the features to a certain
                     extent. Therefore, in the design of the acoustic model, the mean shift is reduced
                     by the pooling layer, and the 3${\times}$3 pooling kernel size is selected to obtain
                     higher-precision features. After multiple convolutional layers and pooling layers,
                     the speech information features are passed to the fully connected layer. The fully
                     connected layer can receive all the local information contained in the previous layer,
                     and its calculation is shown in formula (4):
                  
                  
                  In formula (4), $f$ represents the activation function, $b$ is the bias of the neuron, $N$ is the
                     number of neurons, and $w$ is the weight. The fully connected layer can integrate
                     the feature map obtained by the convolution pooling operation, and finally output
                     a vector or probability value; that is, it can become the classifier of the network,
                     mapping the previous feature representation to the label space. In the CNN design,
                     selecting an appropriate activation function can retain better speech features, and
                     introducing a nonlinear function can improve its nonlinear representation ability
                     [22]. Nonlinear functions often include the tanh function, the ReLU function, the sigmoid
                     function, and the maxout function. The sigmoid function is shown in formula (5):
                  
                  
                  In formula (5), $e$ is a natural constant. Deriving the sigmoid function is convenient, but the
                     variation range is small, and the convergence speed is slow. The tanh function is
                     shown in formula (6):
                  
                  
                  The value range of the tanh function is [-1, 1], but the gradient disappears and still
                     occurs. The ReLU function is shown in formula (7):
                  
                  
                  It is difficult to saturate the calculation of the ReLU function, which can prevent
                     the vanishing gradient problem, but it may make it difficult to activate some parameters
                     and cause a crash. The maxout function is formula (8):
                  
                  
                  In formula (8), $k$ is the maximum number of neurons, $l$ is the neural layer sequence, $h_{l}^{i}$
                     represents the output, and $z_{l}^{ij}$ is the activation amount. Then, in the first
                     layer, $l$, the activation amount is shown in formula (9):
                  
                  
                  In formula (9), $b$ represents the offset, $x^{T}$ represents the eigenvector, and $W$ is a three-dimensional
                     matrix related to the input and output nodes. The maxout function has a strong fitting
                     ability, and can give the network a constant gradient, thereby effectively improving
                     the vanishing gradient phenomenon. Therefore, this function is selected to optimize
                     the acoustic model. In English speech recognition, the traditional mode needs to perform
                     mandatory alignment processing on the training speech, which leads to an increase
                     in complexity and training difficulty. Therefore, an end-to-end structure is added
                     and connectionist temporal classification (CTC) and the CNN are combined for research.
                     CTC processes the time series classification task based on predicting the information
                     output of each frame to recognize the speech signal. CTC is an objective function
                     based on softmax. An empty node is introduced in CTC, which can automatically optimize
                     the output sequence and realize the mapping of the same label sequence and multiple
                     paths [23]. The probability corresponding to the corresponding path of the speech frame length
                     after CTC is shown in formula (10):
                  
                  
                  In formula (10), $T$ represents the length of the speech frame, and $\pi $ represents the corresponding
                     path. Then, the forward and backward algorithm is introduced, and the result is shown
                     in formula (11):
                  
                  
                  In formula (11), $a(t,d)$ represents the forward probability value of the forward vector, $y_{l}^{t}$
                     represents the probability of $l$ getting the output at moment $t$. Therefore, the
                     forward probability calculation formula at a certain moment is obtained as shown in
                     formula (12):
                  
                  
                  In formula (12), $d$ represents a node, and $blank$ is a space. The idea of the backward algorithm
                     is the same as that of the forward algorithm, and its formula is shown in (13):
                  
                  
                  In formula (13), $\beta (t,d)$ represents the backward probability value. The CTC loss function is
                     shown in formula (14) through the maximum likelihood function:
                  
                  
                  In formula (14), $x$ is the input, and $z$ is the output sequence. Therefore, the CTC loss function
                     diagram is obtained with formula (15):
                  
                  
                  In formula (15), $s$ represents the training set, and $p$ represents the probability. Convolutional
                     and pooling layers can help accurately identify slightly displaced and deformed input
                     features, while the end-to-end architecture optimizes the output sequence. Therefore,
                     the two are combined into an acoustic CTC-CNN model (maxout) with the parameters shown
                     in Table 1.
                  
                  Therefore, the acoustic model process of CTC-CNN (maxout) is to first input English
                     speech and obtain feature vectors through feature extraction. This feature vector
                     is then input into the proposed CTC-CNN model, and then enters the first convolutional
                     layer, which is a coarser acoustic feature at this time. Then, the nonlinear activation
                     function and convolution operation of the second convolution layer are used to obtain
                     relatively fine features. The convolutional features reach the pooling layer, and
                     the mean shift is further reduced by the maximum pooling process. The pooling layer
                     obtains more accurate features. After the feature map obtained in the first two steps
                     reaches the fully connected layer, the posterior probability is obtained through mapping
                     of the convolution and the activation function, which is used as the output. Finally,
                     CTC (maxout) classifies and optimizes the recognition of speech features, and outputs
                     the recognized speech after decoding.
                  
                  
                        Fig. 3. Diagram of the CNN Structure.
 
                  
                        Fig. 4. Diagram of the Convolution Layer’s Local Connection Mode.
 
                  
                        Table 1. CTC-CNN (maxout) Acoustic Model Parameter Table.
                     
                           
                              
                                 | 
                                    
                                 									
                                  Network layer 
                                 								
                               | 
                              
                                    
                                 									
                                  Parameter 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Input 
                                 								
                               | 
                              
                                    
                                 									
                                  39 dimensional MFCC features 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Convolution layer 1 
                                 								
                               | 
                              
                                    
                                 									
                                  Convolution kernel: 9×9, convolver: 128, step: 1×1, activation function: sigmoid 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Convolution layer 2 
                                 								
                               | 
                              
                                    
                                 									
                                  Convolution kernel: 4×3, convolver: 256, step: 1×1, activation function: sigmoid 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Pooling layer 
                                 								
                               | 
                              
                                    
                                 									
                                  Maximum pooling: 3×3 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Fully connected layer 1 
                                 								
                               | 
                              
                                    
                                 									
                                  Activation function: ReLU, number of neuron nodes: 1024 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Fully connected layer 2 
                                 								
                               | 
                              
                                    
                                 									
                                  Activation function: ReLU, number of neuron nodes: 1024 
                                 								
                               | 
                           
                           
                                 | 
                                    
                                 									
                                  Maxout layer 
                                 								
                               | 
                              
                                    
                                 									
                                  CTC 
                                 								
                               |