3.1 Integrating the Vuforia SDK
U-nity3D is an augmented reality engine developed by Unity Technologies mainly for
interactive graphics. U-nity3D can create and render scene models, import the Vuforia
SDK extension toolkit, and implement tracking and detection under the corresponding
interface to obtain AR applications with human-computer interaction and a virtual-real
overlay [15]. U-nity3D supports 3D models in OBJ or FBX formats. Upon importing them into various
environments and scenes, they can be augmented with environmental sound effects and
physical material effects such as wind, sky, and fog. Moreover, this feature also
supports editing 3D application scenes, testing them, and instant browsing. Additionally,
it facilitates the direct transfer of the product as desired for cross-platform support
[16]. The Vuforia Augmented Reality SDK is mainly aimed at mobile-device augmented reality
applications. It uses computer vision technology to recognize and capture simple three-dimensional
objects or flat images in real time, and supports developers who want to virtualize
and capture virtual reality. For placement and adjustment of objects, the data flow
module is shown in Fig. 1.
The data flow of the Vuforia SDK has four modules: input conversion, the database
module, tracking detection, and the rendering input module. The input conversion module
obtains a new image format through an image converter after the camera captures a
scene. The database module is storage for the data, including cloud storage and local
device storage. The tracking detection module is used to track a target, including
user-defined targets. The rendering input module contains application coding and video
background rendering. The four modules transmit to each other and provide feedback
on problems, so U-nity3D is easily integrated with the Vuforia SDK. Good adaptation
and powerful engine functions enable developers to obtain augmented reality interactive
applications with excellent effects under a simple design. Therefore, U-nity3D integrated
with the Vuforia SDK AR technology markers not only recognize three-dimensional models
but also offer real-time tracking and English-speaking teaching, as shown in Fig. 2.
In the AR oral English teaching mode of U-nity3D integrated with the Vuforia SDK,
students have synchronous dialogue in a simulated real scene. The teacher can switch
between different roles, and multiple students can cooperate with each other to communicate
orally in English. At the same time, teachers use computers to construct various dialogue
situations so students can participate and communicate with foreigners. Especially
in some shopping, travel, and other situations, AR technology has a high degree of
restoration, and students’ English adaptability can be greatly improved [17]. With the help of AR, students’ listening and reading processes, voices, and videos
can be recorded; video playback is supported, realizing cross-time and -space teaching.
In addition, students’ autonomous learning abilities will make great progress because
AR technology itself is attractive. It can create a relaxed and harmonious oral English
learning environment, give students experience in a real atmosphere, and stimulate
enthusiasm for oral learning initiatives.
Fig. 1. Diagram of the Vuforia SDK Data Flow Module.
Fig. 2. AR Oral English Teaching Mode Based On U-nity3D Integrated with the Vuforia SDK.
3.2 English Speech Recognition based on CNN
In an English speech recognition system, the acoustic model is an integral part. The
CNN, a prominent algorithm in the realm of deep learning, showcases its efficiency
through its convolution pooling structure, which significantly reduces the number
of parameters. Additionally, it eliminates the impact of signal amplitude changes
during the convolution process, and exhibits robust adaptability. The field of speech
recognition has already been successfully applied [18], and applying it to English will greatly improve the performance of the acoustic
model. The basic structure of CNN is shown in Fig. 3.
The convolutional layer of the CNN has multiple feature maps, and in each feature
map, there are several neurons. The input of the feature map is obtained under the
local filtering effect of the convolution kernel on the input features, and the convolution
kernel is fundamentally a weight matrix [19]. The convolutional layer of a CNN first extracts rough information, and then extracts
discriminative features until the key distinguishable features are obtained. Therefore,
the fundamental feature of the convolutional layer is feature extraction of deep information
contained in the input speech signal and transmitting it to the pooling layer. The
local connections of the convolutional layers are shown in Fig. 4.
In Fig. 4, the input is the L-1 layer. The way its neurons are connected to the adjacent neurons
in the L layer is a local connection, and the weights are shared at the same time.
The neuron weights in the first feature surface of the input layer are shared, as
shown in formula (1):
In formula (1), $i$ and $j$ represent neural sequences, $a$ and $b$ are sequences of feature planes,
and $w$ represents the weight. The CNN can reduce the complexity of the model through
weight sharing, thereby reducing the number of parameters for learning, making the
model easier to train [20]. The feature surface owned by each convolutional layer in the CNN uniquely corresponds
to the input feature surface of the pooling layer. The pooling layer further extracts
information from the convolutional layer, which uses the maximum pooling method to
deal with the volume. The problem of estimated value deviation caused by layered parameters
is shown in formula (2):
In formula (2), $N_{m}$ is the size of the neighborhood, $h_{m}$ represents the output value of
this field, and $a_{ij}$ is the maximum value of each point contained in the neighborhood.
To mitigate the effect of neighborhood error on the estimated value, which leads to
a reduction in variance, it is necessary to implement the mean pooling operation in
formula (3):
In formula (3), $i$ and $j$ are points in the neighborhood. The pooling layer can preserve the features
extracted by the convolutional layer to the greatest extent; it can further reduce
the amount of computation and prevent overfitting [21]. At the same time, when the pooling layer performs feature compression, it will not
damage the speech features, but maintains the invariance of the features to a certain
extent. Therefore, in the design of the acoustic model, the mean shift is reduced
by the pooling layer, and the 3${\times}$3 pooling kernel size is selected to obtain
higher-precision features. After multiple convolutional layers and pooling layers,
the speech information features are passed to the fully connected layer. The fully
connected layer can receive all the local information contained in the previous layer,
and its calculation is shown in formula (4):
In formula (4), $f$ represents the activation function, $b$ is the bias of the neuron, $N$ is the
number of neurons, and $w$ is the weight. The fully connected layer can integrate
the feature map obtained by the convolution pooling operation, and finally output
a vector or probability value; that is, it can become the classifier of the network,
mapping the previous feature representation to the label space. In the CNN design,
selecting an appropriate activation function can retain better speech features, and
introducing a nonlinear function can improve its nonlinear representation ability
[22]. Nonlinear functions often include the tanh function, the ReLU function, the sigmoid
function, and the maxout function. The sigmoid function is shown in formula (5):
In formula (5), $e$ is a natural constant. Deriving the sigmoid function is convenient, but the
variation range is small, and the convergence speed is slow. The tanh function is
shown in formula (6):
The value range of the tanh function is [-1, 1], but the gradient disappears and still
occurs. The ReLU function is shown in formula (7):
It is difficult to saturate the calculation of the ReLU function, which can prevent
the vanishing gradient problem, but it may make it difficult to activate some parameters
and cause a crash. The maxout function is formula (8):
In formula (8), $k$ is the maximum number of neurons, $l$ is the neural layer sequence, $h_{l}^{i}$
represents the output, and $z_{l}^{ij}$ is the activation amount. Then, in the first
layer, $l$, the activation amount is shown in formula (9):
In formula (9), $b$ represents the offset, $x^{T}$ represents the eigenvector, and $W$ is a three-dimensional
matrix related to the input and output nodes. The maxout function has a strong fitting
ability, and can give the network a constant gradient, thereby effectively improving
the vanishing gradient phenomenon. Therefore, this function is selected to optimize
the acoustic model. In English speech recognition, the traditional mode needs to perform
mandatory alignment processing on the training speech, which leads to an increase
in complexity and training difficulty. Therefore, an end-to-end structure is added
and connectionist temporal classification (CTC) and the CNN are combined for research.
CTC processes the time series classification task based on predicting the information
output of each frame to recognize the speech signal. CTC is an objective function
based on softmax. An empty node is introduced in CTC, which can automatically optimize
the output sequence and realize the mapping of the same label sequence and multiple
paths [23]. The probability corresponding to the corresponding path of the speech frame length
after CTC is shown in formula (10):
In formula (10), $T$ represents the length of the speech frame, and $\pi $ represents the corresponding
path. Then, the forward and backward algorithm is introduced, and the result is shown
in formula (11):
In formula (11), $a(t,d)$ represents the forward probability value of the forward vector, $y_{l}^{t}$
represents the probability of $l$ getting the output at moment $t$. Therefore, the
forward probability calculation formula at a certain moment is obtained as shown in
formula (12):
In formula (12), $d$ represents a node, and $blank$ is a space. The idea of the backward algorithm
is the same as that of the forward algorithm, and its formula is shown in (13):
In formula (13), $\beta (t,d)$ represents the backward probability value. The CTC loss function is
shown in formula (14) through the maximum likelihood function:
In formula (14), $x$ is the input, and $z$ is the output sequence. Therefore, the CTC loss function
diagram is obtained with formula (15):
In formula (15), $s$ represents the training set, and $p$ represents the probability. Convolutional
and pooling layers can help accurately identify slightly displaced and deformed input
features, while the end-to-end architecture optimizes the output sequence. Therefore,
the two are combined into an acoustic CTC-CNN model (maxout) with the parameters shown
in Table 1.
Therefore, the acoustic model process of CTC-CNN (maxout) is to first input English
speech and obtain feature vectors through feature extraction. This feature vector
is then input into the proposed CTC-CNN model, and then enters the first convolutional
layer, which is a coarser acoustic feature at this time. Then, the nonlinear activation
function and convolution operation of the second convolution layer are used to obtain
relatively fine features. The convolutional features reach the pooling layer, and
the mean shift is further reduced by the maximum pooling process. The pooling layer
obtains more accurate features. After the feature map obtained in the first two steps
reaches the fully connected layer, the posterior probability is obtained through mapping
of the convolution and the activation function, which is used as the output. Finally,
CTC (maxout) classifies and optimizes the recognition of speech features, and outputs
the recognized speech after decoding.
Fig. 3. Diagram of the CNN Structure.
Fig. 4. Diagram of the Convolution Layer’s Local Connection Mode.
Table 1. CTC-CNN (maxout) Acoustic Model Parameter Table.
Network layer
|
Parameter
|
Input
|
39 dimensional MFCC features
|
Convolution layer 1
|
Convolution kernel: 9×9, convolver: 128, step: 1×1, activation function: sigmoid
|
Convolution layer 2
|
Convolution kernel: 4×3, convolver: 256, step: 1×1, activation function: sigmoid
|
Pooling layer
|
Maximum pooling: 3×3
|
Fully connected layer 1
|
Activation function: ReLU, number of neuron nodes: 1024
|
Fully connected layer 2
|
Activation function: ReLU, number of neuron nodes: 1024
|
Maxout layer
|
CTC
|