Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (School of Applied Foreign Languages, Henan Industry and Trade Vocational College, Zhengzhou, 451191, China Fan136997936@163.com )



Virtual teaching, Emotional interaction, Facial expression recognition, Self-cure nerves, Incentive factors

1. Introduction

Remote teaching provides students with greater flexibility and convenience. However, it is also accompanied by some new challenges. One of the most significant issues is the lack of emotional interaction [1]. In remote virtual teaching, due to physical distance and technological limitations, the emotional interaction between teachers and students is often weakened, which brings some difficulties to the teaching. Facial expressions, as one of the most important non-verbal communication methods in humans, play an irreplaceable role in conveying emotional information and establishing emotional connections [2]. In remote virtual teaching, due to the screen separation between teachers and students, they cannot directly perceive each other's facial expressions, which limits emotional interaction in teaching. Students find it difficult to accurately understand the teacher's emotional state, and teachers are also unable to obtain emotional feedback from students timely [3]. This emotional deficiency may lead to a decrease in teaching effectiveness, a weakened sense of student engagement in learning, and teachers being unable to adjust teaching strategies timely to meet the personalized needs [4]. With the development of deep learning, Convolutional Neural Network (CNN) has been more widely used in the field of image processing and recognition. This method can be used for feature extraction and classification in facial expression recognition tasks. The Self-Cure Network (SCN) can repair its own damage or error through dynamic learning and adaptive adjustment, which can improve the robustness and performance of the model. The Correction Strategy (CS) can be used to optimize the network threshold to further improve the accuracy and stability of the model [5]. In view of this, the study attempts to introduce deep learning networks to achieve expression recognition and emotional interaction by constructing algorithmic models. The innovation of the research is to analyze the CNN framework based on the introduced deep learning network. A new face expression recognition model is constructed by combining SCN and calibration strategy. A targeted emotional interaction model is proposed to solve the lack of emotional interaction in the remote virtual teaching. The research aims to promote the development of remote virtual teaching, providing more effective theoretical methods for remote virtual teaching. This study is divided into four parts. The first part is an analysis and summary for existing research. The second part introduces how facial expression recognition models and emotional interaction models are built. The third part tests the performance of these two models. The last part is a summary of the research.

2. Related works

In remote virtual teaching environments, emotional interaction between students and teachers is particularly important. However, due to the lack of face-to-face contact, it is difficult to accurately perceive the teacher's emotional expression. This has caused difficulties in teaching communication and affected learning outcomes. To solve this problem, many scholars at home and abroad have successively explored facial expression recognition technology. N. Hajarolasvadi et al. found that existing video facial recognition techniques had lower frame rates and recognition efficiency when processing individual videos with specific emotions. Therefore, a novel method for capturing representative frame sets of intrinsic spatial domain videos was proposed by combining principal component analysis. The experimental result in the RML database was nearly 8% superior to the state-of-the-art facial expression recognition technology, reducing data redundancy and shortening computation time [6]. Although computer vision technology has made significant progress, facial expression recognition still faces many unresolved issues, such as factors such as perspective, posture, and facial recognition. Therefore, J Han et al. proposed a novel facial recognition model by combining harmonious learning algorithms. The experimental results showed that the new model performed better than similar facial expression recognition models on three different face data sets, which had certain advantages [7]. D. Gera et al. found that facial expression recognition techniques using deep learning commonly suffered from high network responsibility and multiple parameters when capturing facial images. Therefore, after considering lightweight, a compact expression recognition network was proposed. Compared with other lightweight networks, this network exhibited better robustness and superiority in occlusion and pose changes, while greatly reducing computation time [8]. Y. Cai et al. found that traditional facial expression recognition techniques generally suffered from poor robustness and parameter complexity in complex data environments. Therefore, an adaptive loss function was introduced to optimize the neural network classification module. Finally, a new facial expression recognition model was proposed. According to the experimental results, in the same environment, this new method could effectively improve the efficiency of expression recognition. The robustness and recall of the model were significantly improved [9].

Traditional facial expression recognition algorithms often require abundant annotated data and computing resources, which are also sensitive to factors such as lighting and posture. They still face challenges for facial expression recognition in complex scenes. Therefore, domestic and foreign scholars have made improvements to network algorithms in deep learning and proposed some new methods. W. Hu et al. found that traditional CNN have many limitations when processing large amounts of facial data, such as slow computation speed and poor fitting. Therefore, a novel learning discriminative facial expression feature model was proposed after combining label smoothing regularization. The experimental results showed that the improved recognition model enhanced the discriminative ability of deep facial expression features, which was more competitive [10]. To improve the performance of existing facial expression recognition methods in the computer vision, J Li et al. proposed an end-to-end facial expression recognition method. The experimental results showed that the method had generally high accuracy in facial expression recognition on four self-made data sets. The method had strong robustness and feasibility. In addition, to better achieve emotional interaction in remote virtual teaching and improve the quality of teaching guidance and feedback, which had great significance to recognize different facial expressions [11]. Y. Zhang et al. proposed a public emotion network propagation interaction model with intelligent human-computer interaction combining human-computer expression and emotion recognition technology. The experimental results indicated that the model could accurately reflect the emotional state of the public in the network, providing practical basis for the application of artificial intelligence technology in online public opinion judgment [12]. To improve the performance of existing emotional interaction models, Y Ye et al. proposed a novel emotional interaction model based on grounded theory after analyzing the driving factors and influencing environment of the model. The experimental results indicated that the new model could provide timely feedback on emotional changes among the population. It had positive impacts on factory management through technical guidance [13].

In summary, many scholars at home and abroad have conducted research on facial expression recognition and emotional interaction in the remote virtual teaching. Some practical methods that can provide certain technical support for existing online teaching to a certain extent are proposed. However, with the continuous expansion of network resources and the deepening of teaching tasks, the existing facial expression recognition methods and interaction models in remote virtual teaching are no longer able to meet the requirements. In view of this, the study attempts to improve deep learning algorithms and emotional interaction, aiming to promote the development of online teaching and provide a reliable technical support.

3. Construction of An Emotional Interaction Model for Facial Expression Recognition in Remote Virtual Teaching with Emotional Deficiency

Firstly, the study discusses facial expression recognition and analyzes traditional CNN networks. On this basis, a self-cure neural network is introduced. After describing the various modules of the network, a CS is further introduced for threshold adaptive adjustment. Finally, a new facial expression recognition model is proposed. In addition, the study constructs emotional space and emotional transfer paths, and then discusses emotional motivation and fading compensation separately, proposing an emotional interaction algorithm. Finally, a new emotional interaction model is proposed by combining facial expression recognition models.

3.1 Facial Expression Recognition Algorithm Based on Improved Convolutional Neural Network

Due to the complexity and diversity of facial expressions, facial expression recognition places high demands on the accuracy and robustness. As a classic foundational network, CNN has derived various new recognition algorithms based on its excellent convolutional characteristics. A typical CNN consists of multiple modules, namely convolutional layer, pooling layer, and fully connected layer [14]. The convolution operation process is shown in Fig. 1.

Fig. 1. Convolutional operation process.

../../Resources/ieie/IEIESPC.2025.14.1.68/image1.png

From Fig. 1, the input data is first partially multi-plied by elements. The multi-plied element data is slid according to their respective positions. By performing convolution operations on the output values of each position, a two-dimensional tensor map is formed sequentially, namely, the feature map [15]. The convolutional operation of convolutional layers can extract different feature information, such as texture, color, borders, etc. The pooling layer compresses the feature map spatially, reducing its dimension while preserving feature information. The pooling operation is shown in Fig. 2.

Fig. 2. Pooling operation.

../../Resources/ieie/IEIESPC.2025.14.1.68/image2.png

In Fig. 2, by filtering the maximum, minimum, or average values of data information within a specific area, the computational complexity and network parameters are simplified. However, traditional CNN has some limitations in facial expression recognition tasks. Firstly, the different parts and dynamic changes in facial expressions play a crucial role in expression. Traditional CNNs do not have good modeling capabilities when processing these information. Secondly, due to the uneven distribution of training data sets and the similarity between categories, traditional CNNs are easily affected by sample bias in facial expression recognition tasks, resulting in a decrease in recognition accuracy [16]. In view of this, SCN is introduced, which evolves from CNN. Through dynamic learning and adaptive adjustment, self damage or errors are repaired while maintaining the normal operation of the network [17]. When dynamic learning manifests as the network detecting damage or errors, the SCN uses this information to update its own configuration to fix the problem. Adaptive adjustment is manifested as the ability of SCN to adjust its behavior based on new inputs and feedback, such as reallocating resources, reacquiring routing information, or adjusting parameter configurations. When the network detects damage or errors, it is able to self-repair using dynamic learning and adaptive tuning mechanisms. By re-configuring its own connections and weights, the network can return to its normal state. The network structure diagram of SCN is shown in Fig. 3.

Fig. 3. SCN architecture diagram.

../../Resources/ieie/IEIESPC.2025.14.1.68/image3.png

In Fig. 3, the entire SCN structure consists of three main modules: self attention module, regularization module, and noise labeling. Firstly, feature extraction is performed on the image. Secondly, the importance of each part is weighted using a self attention weighting module. Then, the weighted features are sorted using a regularization module. The average weight is used as the threshold for regularization in each sorting process [18]. Finally, the noise labeling module is used to label the highly important features for subsequent screening and training. The importance weighting process of the self attention module is shown in equation (1).

(1)
$ \alpha _{i} =\sigma (W_{\alpha }^{T} x_{i} ) . $

In equation (1), $\sigma $ represents the Sigmoid function. $W_{\alpha }^{T} $ represents the $T$-th layer parameter under the attention mechanism. $\alpha _{i} $ represents the importance of the $i$-th sample. $x_{i} $ denotes the first $i$ element of the input sequence. The self-attention weighting module calculates the importance scores of each part of the image, such as pixels, regions. The Sigmoid function maps these scores between 0 and 1, which can assign weights to different parts of the input image. The main processing method of the regularization module is weight ranking, which divides various high weight feature samples and low weight feature samples to ensure the high and low difference between the two types of data [19]. The process is shown in equation (2).

(2)
$ L=\max (0,\delta _{1} -(\alpha _{H} -\alpha _{L} )) . $

In equation (2), $L$ represents the regularization function. $\alpha _{H} $ and $\alpha _{L} $ represent the average weight values of high and low weight restructurings, respectively. $\delta _{1} $ represents a fixed hyper-parameter. Noise labeling is used to determine the threshold of each grouped data in the form of labels. The threshold, in turn, is determined by a combination of performance metrics such as accuracy and recall after model training. If the predicted probability is greater than the threshold, the data is adjusted to a high weight feature sample. The process is shown in equation (3).

(3)
$ y=\left\{\begin{aligned} l_{\max } ,\;\text{if}~P_{\max } -P_{gt\ln d} >\delta _{2},\\ l_{org} ,\;\text{otherwise}. \end{aligned}\right. $

In equation (3), $\delta _{2} $ represents a predetermined threshold. $l_{\max } $ and $l_{org} $ represent labels that are greater than the predicted probability and the original labels, respectively. $P_{\max } $ and $P_{gt\ln d} $ represent the maximum prediction probability and the given prediction probability, respectively. Although SCN can effectively prevent sample uncertainty issues, repeatedly using a predetermined threshold reduces noise labels, thereby reducing the robustness of the network. In view of this, CS is introduced to reduce the impact of uncertain samples in model training [20]. Compared with other methods, the advantage of CS is that it can improve the performance and robustness of the model in tasks such as facial expression recognition by calculating sample calibration weights and resetting the loss function, as well as by applying calibration weights during the attention weighting process. This strategy can be roughly divided into two directions. The first is to calculate the sample calibration weight and reset the loss function based on the weight. Secondly, the calibration weights are assigned to the attention weighting process. The thresholds are adjusted timely by monitoring the performance of the model on the validation set or test set. This process is shown in equation (4).

(4)
$ cw_{i} =1-\sigma (W_{\alpha }^{T} x_{i} ) . $

In equation (4), $P_{gt\ln d} $ represents the importance weighting of the optimized $i$-th sample, while the rest of the algebra remains the same as before. The regularization process is prone to smaller losses when predicting images with higher weights. However, images with smaller weights result in significant losses when predicted incorrectly [21]. Regarding this issue, the CS is used to optimize the regularization process. The CS function is used to replace the regularization function, as shown in equation (5).

(5)
$ L_{cs} =\sum _{i=0}^{M}cw_{i} . $

In equation (5), $L_{cs} $ represents the CS function. $M$ represents the number of incorrect samples. For the noise marking process, there may also be differences in weight values in groups with lower importance. It is unreasonable to use the established threshold again for judgment. It also takes into account that the CS can dynamically adjust the threshold according to the actual situation. Therefore, it better adapt to different data distributions and model performance, and improve the accuracy and reliability of the noise labeling. Therefore, the threshold for noise labeling is also replaced, as shown in equation (6) [22].

(6)
$ y`=\left\{\begin{aligned} l_{\max } ,\;\text{if}~P_{\max } -P_{gt\ln d} >\delta _{2} *cw, \\ l_{org} ,\;\text{otherwise}. \end{aligned}\right. $

In equation (6), $\delta _{2} *cw$ represents the calibrated sample weight threshold. The rest of the algebra remains the same as before. On the basis of optimizing and improving the above modules, the final SCN-CS facial expression recognition model is proposed, as shown in Fig. 4.

In Fig. 4, the overall framework of the recognition model is still based on SCN, with only updates made to each part. Firstly, the student facial images are input and subjected to initial convolution operations using a CNN, which is then decomposed into multiple layers. Secondly, the SCN algorithm is used to perform self attention importance weighting, regularization ranking, and noise labeling operations on these images. After completion, CS performs threshold replacement on the above three steps, recalculates the results, and finally outputs them. The model effectively improves the robustness of a single SCN algorithm. By calibrating the strategy and dynamically adjusting the thresholds, the model is able to better adapt to the changes in the data. It automatically adjusts to cope with different data distributions and noise situations during the training process, which improves the robustness and performance of the model, making it more robust in facial expression recognition facing data fluctuations.

Fig. 4. SCN-CS model structure.

../../Resources/ieie/IEIESPC.2025.14.1.68/image4.png

3.2 Construction of An Interactive Model Based on Emotional Compensation and Motivation

After constructing a facial expression recognition model, it is still difficult to provide targeted assistance for virtual online classroom teaching through recognition results [23]. Human emotions are rich and varied. According to dimensional analysis, human emotions can be divided into six categories: happiness, sadness, surprise, disgust, anger, and fear. Based on the six basic emotions mentioned above, the emotional space of four components is constructed, as shown in Fig. 5 [24].

Fig. 5. Four-direction emotion 3D model.

../../Resources/ieie/IEIESPC.2025.14.1.68/image5.png

In Fig. 5, in three-dimensional space, emotions can be divided into 8 parts along three coordinates. Excitement, joy, happiness, and relaxation are positive emotions, while calmness, depression, tension, and anger are negative emotions. Positive emotions have a motivating effect on students' teaching emotions, while negative emotions have a restraining effect. The emotional space expression in this state is shown in equation (7).

(7)
$ \left(\begin{array}{c} {S} \\ {A} \end{array}\right)=\left[\begin{array}{c} {s_{1} ,~s_{2} ,~\cdots,~ s_{N} } \\ {A_{1} ,~A_{2} ,~\cdots,~ A_{N} } \end{array}\right]. $

In equation (7), $S$ represents the set of emotions. $s_{1} $, $s_{2} $ represent basic emotions. $A$ represents the probability set of a certain emotional state. $A_{1} $, $A_{2} $ represent the probability of emotional states. The probability of one emotion is shown in equation (8).

(8)
$ \sum _{i=1}^{N}A_{i} =1,~0\le A_{i} \le 1~(i=1,~2,~\cdots ,~N). $

In equation (8), $A_{i} $ denotes the $i$th emotion. $N$ denotes the number of basic emotional states. Although emotions cannot be directly observed, preliminary judgments can be made by processing expression features through special means, such as Hidden Markov Model (HMM). Compared with other models of the same type, HMM is able to calculate the emotion transfer probability by building emotion sequences, thus effectively capturing the dynamic changes in the sequence data. It is suitable for describing the evolution of emotions over time. Therefore, the study utilizes HMM to construct the relational expressions of the above eight emotions, as shown in Fig. 6.

Fig. 6. HMM of 8 emotional states.

../../Resources/ieie/IEIESPC.2025.14.1.68/image6.png

According to Fig. 6 combined with equation (8), the conversion probability between emotions of the same category is higher, while the conversion probability between emotions of different categories is lower. Therefore, to optimize the transmission mode between emotional states, the study utilizes the HMM to convert conditions into probabilistic evaluation problems for calculation [25]. The forward-backward algorithm is used to solve probability assessment problems [26]. The probability calculation of forward and backward variables is shown in equation (9).

(9)
$ \left\{\begin{aligned} P(O\left|\lambda \right. )=\sum _{i=1}^{N}\alpha _{T} (i) ,\\ P(O\left|\lambda \right. )=\sum _{i=1}^{N}\beta _{T} (i) . \end{aligned}\right. $

In equation (9), $\lambda $ represents the parameters of the HMM emotion model. $O$ represents the observation sequence 0. $\alpha _{T} $ and $\beta _{T} $ represent variables that move forward and backward at time $T$. After continuous iteration, the parameters in the model tend to be rationalized. The optimal observation sequence judgment is shown in equation (10).

(10)
$ \left|\log P(O\left|\lambda \right. )-\log P(O\left|\lambda _{0} \right. )<\varepsilon \right| . $

In equation (10), $\varepsilon $ represents the threshold. $\lambda _{0} $ represents the model parameters after multiple iterations. If the probability result satisfies the above equation, the probability of emotional transition is output at this time. Conscious stimulation is used to control emotions and stabilize learning state. The study defines this consciousness as motivation and dilution compensation [27]. Incentives act on students' emotional states through motivational factors, thereby achieving a positive guiding effect. Incentive factors can be further divided into four categories, namely reducing learning difficulty, removing difficult problems, enhancing classroom fun, and improving teaching quality [28]. For the convenience of subsequent calculations, the empirical value probability is used to express the calculation method for these four incentive factors and transition state probabilities, as shown in equation (11).

(11)
$ A_{k} (i,j)=\tau \left(I_{k1} ,~I_{k2} ,~I_{k3} ,~I_{k4} \right) . $

In equation (11), $\tau $ represents the empirical constant. $A_{k} (i,j)$ represents the transition state probability of the $i$-th state $j$. $I_{k1} ,I_{k2} ,I_{k3} ,I_{k4} $ represent four types of incentive factors, respectively. In addition, influenced by external stimuli, students emotions may also experience a fading process, as shown in equation (12).

(12)
$ \frac{dE(t)}{dt} =\psi \left[E-E(t)\right] . $

In equation (12), $E$ represents the emotion in the ideal state. $E(t)$ represents the emotional state at moment $t$. $d$ represents the rate of change. $\psi $ represents the emotional dilution factor, which can directly reflect the rate of emotional dilution in students. Based on the above emotional state description, a Emotional Compensation and Encouragement Algorithm (ECEA) is proposed based on a facial expression recognition model [29]. This algorithm converts student emotions into the emotion axis through dimension reduction for judgment. The emotion that satisfies the optimal emotion region on the emotion axis is defined as the target emotion region, as shown in equation (13).

(13)
$ TEA(\pi _{i} )=\pi \cdot C . $

In equation (13), $TEA$ represents the target emotional region. $\pi $ represents the emotional state model. $C$ represents the dimension reduction coefficient of emotions. The condition for adopting emotional motivation strategies is shown in equation (14).

(14)
$ P_{TEA} =\left[O_{TEA} -\varphi ,~O_{TEA} +\varphi \right],~0<\varphi <1 . $

In equation (14), $P_{TEA} $ represents the optimal emotional region. $O_{TEA} $ represents the current emotional region. $\varphi $ represents a constant on the emotional axis, and $\varphi \in (0$, $1$, $2$, $\cdots $, $n)$. The interaction model will only actively adopt emotional motivation strategies when the current emotional state of the student is less than the optimal emotional state. In summary, the final Emotional Compensation and Encouragement Model (ECEM) is proposed. The interaction process of the model is shown in Fig. 7 [30].

Fig. 7. ECEM model interaction flow.

../../Resources/ieie/IEIESPC.2025.14.1.68/image7.png

In Fig. 7, the entire process is roughly divided into four parts. Firstly, the image is recognized and classified using a facial expression recognition model. The recognition results are input into the interaction model, with initialization parameters set. The formula determines whether the current emotion is in the target emotion region. If it is satisfied, there is no need for a motivational factor, that is, a motivational factor of 0, and outputs the current emotion. If the target emotion region is not met, the current emotion is initialized with parameters. The probability of the next possible emotion is re estimated through a calculation formula. Then, based on the optimal emotional region on the emotional axis, if it meets the optimal emotional region, the incentive factor is output. If it is not satisfied, it is returned and the parameters are initialized to calculate the optimal emotional region until it is satisfied. In summary, this model can achieve emotional monitoring and interaction among students during remote virtual teaching, improving teaching quality.

4. Performance Testing of Emotional Interaction Models for Facial Expression Recognition

To verify the performance of the proposed emotional interaction model, a comparative test and simulation test are conducted on the same type of model, namely SCN-CS, for facial expression recognition model. Secondly, parameter testing and multi-scenario testing are conducted on the emotional interaction model ECEM to verify its simulation performance.

4.1 Recognition Model Performance Testing

Three classic expression recognition databases are introduced, namely the Cohn Kanade (CK+) data set, Multi-media Multi-modal Interaction (MMI) data set, and Facial Expression Images (FEI) data set. CK+ is a widely used facial expression database that includes a series of static facial images and video sequences. MMI is a multi-modal facial expression database that combines facial expressions with sound and body movements. FEI is a database containing nearly 200 facial images for research on expression recognition and facial analysis. In addition, the specific equipment and parameters for the experiment are shown in Table 1.

Table 1. Experimental environment and parameters.

Item

Parameter

Programming Language

Python 3.7.13

CPU

Intel Core i7 3.6Hz

GPU

Nvidia GeForce GTX TITAN X

Memory

64G

Backbone

CNN

Deep learning framework

Pytorch 1.13.1

Iiterations

72

$\delta _{1} $

0.15

$\delta _{2} $

0.20

Loss function ratio

1:01

Initial learning rate

0

After setting all testing parameters, the SCN-CS model is first subjected to ablation testing to explore the impact of each module on the overall algorithm performance, such as CNN, SCN, CS, and SCN-CS. The expression recognition accuracy of the four modules on the CK+, MMI, and FEI data sets with varying sample sizes are shown in Fig. 8.

Fig. 8. Test results of each module on different data sets.

../../Resources/ieie/IEIESPC.2025.14.1.68/image8.png

Fig. 8 (a) shows the recognition rate test results of different modules on the CK+ data set. Fig. 8 (b) shows the recognition rate test results on the MMI data set. Fig. 8 (c) shows the results on the FEI data set. In Fig. 8, the recognition results of the SCN module were gradually increasing. With the parameter threshold optimization strategy of the CS module, the performance of the final SCN-CS module was much higher than the other three categories. The highest recognition accuracy was 95.8% in the MMI data set, which far exceeded the CNN module by nearly 34.6%. In addition, to verify the optimization degree of the CS for the SCN model and explore the superiority and robustness of the SCN-CS model, taking the MMI data set with better data performance as an example, several similar and smoother facial expression recognition models are introduced, namely Visual Gemmetry Group-Face (VGG-Face), DeepFace, and OpenFace. The iteration speed, label noise, and random occlusion are used as variables, and the misidentification rate is used as a reference indicator for testing. The test results are shown in Fig. 9.

Fig. 9. Performance test results of different facial expression recognition models.

../../Resources/ieie/IEIESPC.2025.14.1.68/image9.png

Fig. 9 (a) shows the iteration rate test results of different recognition models. Fig. 9 (b) shows the error rate of different recognition models as label noise changes. Fig. 9 (c) shows the error rate of different recognition models as the occlusion changes. In Fig. 9, the mis-identification rate of the four models gradually decreased with the increase of iteration times. The error rate of the four models significantly increased with the increase of label noise and occlusion ratio. The SCN-CS model had the smallest increase in error rate slope. The performance was best when the label noise ratio was 70%, with an error rate of 33%. The model had the best performance when the occlusion ratio was 67%, with the lowest mis-identification rate of 34%. In summary, after comparing multiple models horizontally, the proposed model had the best overall performance, with better computational speed and lower recognition error rate. To compare the performance of the proposed model more vividly, seven different facial expressions are studied, namely anger, disgust, fear, happiness, surprise, sadness, and contempt. The chaotic matrices of DeepFace, OpenFace, and SCN-CS in the MMI data set are plotted. The test results are shown in Fig. 10.

Fig. 10 (a) shows the confusion test results of the DeeFace model on the MMI data set. Fig. 10 (b) shows the confusion test results of the OpenFace model on the MMI data set. Fig. 10 (c) shows the confusion test results of the SCN model on the MMI data set. In Fig. 10, among the three recognition models, the OpenFace model had the lowest recognition accuracy, with 5 recognition results above 80 points. Next was the DeeFace model, which had 6 recognition results with scores above 80. The proposed SCN-CS model still performed the best, which accurately recognized corresponding facial expressions. The recognition rate was generally above 90 points. In summary, the test results once again validated the efficiency and feasibility of the proposed model.

Fig. 10. Test results of confusion matrix for three facial expression recognition models.

../../Resources/ieie/IEIESPC.2025.14.1.68/image10.png

4.2 Interaction Model Performance Testing

The IMF-DB facial emotion data set, RAF-DB data set, and AffectNet data set are the test data. These three data sets all contain nearly 1000 facial expression images, which are roughly divided into 7 categories: anger, disgust, fear, happiness, sadness, surprise, and neutrality. The study begins with ablation testing of the ECEM model to investigate the effect of each module on the overall performance of the model, including HMM, HMM-FBA, HMM-Motivation Factor and HMM-FBA-Motivation Factor, i.e. ECEM algorithm. Taking the emotion recognition accuracy and the number of emotion recognition as the test indexes, the test results are shown in Table 2.

Table 2. ECEM ablation test results.

Date set

Module

Accuracy/%

Number of emotion recognition/piece

IMF-DB

HMM

80.4

257

HMM-FBA

82.6

368

HMM-FBA-Incentive factors

89.7

446

ECEM

92.4

689

RAF-DB

HMM

80.2

287

HMM-FBA

83.3

397

HMM-FBA-Incentive factors

86.9

512

ECEM

92.1

651

AffectNet

HMM

81.5

274

HMM-FBA

85.6

327

HMM-FBA-Incentive factors

89.3

446

ECEM

93.4

692

From Table 2, the ablation test of the ECEM module revealed that the individual HMM had a maximum emotion recognition accuracy of 81.5% and 287 emotion recognition numbers on the three types of data sets. The HMM accuracy optimized by FBA was effectively improved by a maximum of about 4%. After introducing incentive factors, the model achieved a maximum improvement of 7% in recognition accuracy compared with the HMM-FBA. ECEM had the highest emotion recognition accuracy of 93.4%, and the number of emotion recognition was 692, which were effectively improved compared with the other three sub-modules. Therefore, the sub-modules in the ECEM model had a facilitating effect on the overall model performance, which was significant. Four emotional motivational factors for emotional dilution in the ECEM model are tested to explore the patterns of human emotional changes under the effects of positive single factor, positive double factor, negative single factor, and negative multi-factor. The emotional entropy is used as the testing indicator. This higher indicator value indicates that the individual state of the test object is chaotic. The test results are shown in Fig. 11.

Fig. 11 (a) shows the test results of incentive factors on the IMF-DB data set. Fig. 11 (b) shows the test results of the RAF-DB data set. Fig. 11 (c) shows the test results of the AffectNet data set. From Fig. 11, affected by both positive and negative factors, the emotional entropy of students gradually decreased and tended to stabilize. Secondly, the test data of multiple factors was significantly better than that of single factors. The fastest compensation time for positive multiple factors was 14 minutes. The fastest compensation time for negative multiple factors was 12 minutes. Compared with the single factor compensation, it was shortened by 4 minutes and 8 minutes, respectively. In addition, popular interaction models are introduced for comparison, such as the K-Nearest Neighbor (KNN) model, Naive Bayes (NB) model, and Decision Tree (DT) model. Taking emotional intensity as an indicator, 10 emotions, including warmth and joy are tested. The test results are shown in Table 3.

Fig. 11. Test results of different incentive factors on three data sets.

../../Resources/ieie/IEIESPC.2025.14.1.68/image11.png

Table 3. Emotional intensity testing of different emotional interaction models.

Type

KNN

NB

DT

ECEM

Warm

0.092

0.053

0.052

0.091

Cheerfulness

0.018

0.068

0.059

0.072

Lively

0.053

0.071

0.067

0.081

Funny

0.064

0.073

0.064

0.062

Exaggerate

0.031

0.068

0.084

0.053

Humorous

0.082

0.014

0.081

0.081

Interesting

0.076

0.034

0.082

0.061

Dreary

0.054

0.038

0.089

0.078

Dull

0.051

0.039

0.059

0.075

Dreary

0.069

0.056

0.088

0.079

Tangled

0.091

0.082

0.041

0.082

Illusory

0.093

0.086

0.048

0.089

Thrilling

0.086

0.079

0.065

0.058

Terror

0.024

0.062

0.081

0.067

Average value

0.063

0.059

0.069

0.074

Table 4. Three interaction test results for different interaction models.

Model

Index

Cognitive interaction

Instructional Interaction

Emotional Interaction

Low-level

High-level

Promoting communication

Promote reflection

Active

Negative

KNN

P

0.076

0.052

0.083

0.081

0.089

0.064

R

0.076

0.083

0.087

0.089

0.091

0.076

F1

0.075

0.073

0.072

0.053

0.077

0.071

NB

P

0.081

0.082

0.067

0.072

0.078

0.082

R

0.083

0.084

0.089

0.091

0.082

0.070

F1

0.082

0.081

0.079

0.085

0.079

0.075

DT

P

0.067

0.061

0.067

0.064

0.068

0.082

R

0.074

0.079

0.075

0.068

0.069

0.073

F1

0.082

0.083

0.084

0.089

0.088

0.074

ECEM

P

0.089

0.088

0.078

0.085

0.083

0.092

From Table 3, among the 14 virtual teaching emotion tests given, the lowest average value of emotion intensity was the NB, followed by the KNN and DT models. The average emotion intensity of the ECEM model was 0.074, which increased by 0.025 compared with the NB model. This data reflected the fact that the ECEM model could more accurately process and respond to diversified emotions, such as warmth, cheerfulness, etc. It helps the virtual teaching system to adjust teaching strategies in real time to match students' emotional states, and enhance teaching interactivity and personalized experience. Compared with NB, KNN and DT models, the high emotional intensity of ECEM meant that it was more effective in processing complex emotions, creating a more empathetic learning environment for students. It was crucial for improving the engagement and learning outcomes of virtual teaching. To compare the effectiveness of various interaction models more deeply, four models are tested using three analysis indicators, namely cognitive interaction, teaching interaction, and emotional interaction. Cognitive interaction is divided into low-level and high-level. The teaching interaction is divided into promoting dialogue and promoting reflection. The emotional interaction has negative and positive. The precision, recall, and F1 values of each model in different indicators are shown in Table 4.

In Table 4, the ECEM model had a maximum P value of 0.092, which was the negative interaction in affective interactions. The maximum R value was 0.094, which was the facilitated reflection in instructional interactions. The maximum F1 value was 0.091, which was the low-level interaction in cognitive interactions. The above data indicated that the model performed well in recognizing and processing cognitive activities. In contrast, the other three models performed more evenly, with no prominent facets. There were significant data differences with the ECEM model. The reason for this is that ECEM has a complex algorithmic structure that better captures and understands multi-modal information in interaction data. At the same time, the model effectively integrates different aspects of cognitive, pedagogical, and affective interactions, achieving excellent performance on all metrics. Finally, the scoring method is used to compare the NB model and the ECEM model with better performance. The test results are shown in Fig. 12.

Fig. 12 (a) shows the scores of various interaction indicators under the NB model. Fig. 12 (b) shows the scores of various interaction indicators under the ECEM model. In Fig. 12, the highest positive evaluation value of the NB model was 82, which meant that the teaching interaction promoted communication and cognitive interaction at a low level. The scores for various indicators in the ECEM model were around 90 points. The highest positive evaluation value was 93 points for the negative factors of emotional interaction. This result was basically consistent with the change pattern of data in Table 3, which demonstrated the absolute advantage of ECEM in cognitive interaction, instructional interaction and emotional interaction. It also confirmed that the strong item of NB model was instructional interaction and the weak item was emotional interaction. From the above results, the public's recognition and favorability towards the ECEM model are high, which further indicates the superiority of the interaction performance of ECEM model.

Fig. 12. Comparison of ratings between two models.

../../Resources/ieie/IEIESPC.2025.14.1.68/image12.png

5. Conclusion

In modern education, remote virtual teaching has become an important teaching method. However, due to the lack of face-to-face communication and emotional interaction, this teaching method often leads to emotional loss. In view of this, a novel facial expression recognition model was proposed by introducing CS for threshold adjustment based on the SCN network. Secondly, based on the constructed emotional space, a new emotional interaction model was proposed. The experimental results showed that the highest recognition accuracy of the SCN-CS module was 95.8% on the MMI data set. This data far exceeded the CNN module by nearly 34.6%. The SCN-CS model had the best performance when the label noise ratio was 70%, with an error rate of 33%. The model had the best performance when the occlusion ratio was 67%, with the lowest mis-identification rate of 34%. Compared with the other three models, the SCN-CS model could accurately recognize corresponding facial expressions. The recognition rate was generally above 90 points. In addition, in performance testing of the ECEM model, the compensation time for positive multiple factors was the fastest, with 14 minutes, while the fastest compensation time for negative multiple factors was 12 minutes. Compared with the single factor compensation, it was shortened by 4 minutes and 8 minutes, respectively. The average emotional intensity of the ECEM model was 0.074, which was increased by 0.025 compared with the NB model. The maximum P value was 0.092, the maximum R value was 0.094, and the maximum F1 value was 0.091. In summary, the SCN-CS and ECEM models proposed in the study have better overall performance, which can achieve high-quality facial expression recognition and emotional interaction for online teaching. However, the study only used an existing data set in the test and did not produce a real classroom data set. Therefore, subsequent studies can continue to explore to improve the accuracy of facial expression recognition, enhance the adaptability of the emotional interaction model, validate the model validity using real-time data sets, and optimize the model to adapt to resource-limited environments, thereby providing better emotional interaction support for remote virtual teaching.

Funding

The research is supported by: Henan Province 2023 Science and Technology Development Plan, Technology Tackling Project, Critical Technology Research on a Business English Multiplayer Online Negotiation Platform based on Virtual Simulation Technology (Project ID: 232102210191).

REFERENCES

1 
D. Young, F. J. Real, R. D. Sahay, M. D. Ms, and M. Zackoff, ``Remote virtual reality teaching: Closing an educational gap during a global pandemic,'' Hosp. Pedia., vol. 11, no. 10, pp. 258-262, 2021.DOI
2 
Y. Cai and T. Zhao, ``Performance analysis of distance teaching classroom based on machine learning and virtual reality,'' J. Intell. Int. Fuzzy Syst, vol. 40, no. 2, pp. 2157-2167, 2021.DOI
3 
N. Baruch, S. Behrman, P. Wilkinson, T. Bajorek, E. Murphy, and M. Browning, ``Negative bias in interpretation and facial expression recognition in late life depression: A case control study,'' Int. J. Geriatr. Psych, vol. 36, no. 9, pp. 1450-1459, 2021.DOI
4 
L. Gurukumar, R. G. Harinatha, and P. M. N. Giri, ``Optimized scale-invariant feature transform with local tri-directional patterns for facial expression recognition with deep learning model,'' Comput. J., vol. 65, no. 9, pp. 2509-2527, 2021.DOI
5 
D. Schwartz and M. Demasi, ``The importance of teaching virtual rapport-building skills in telehealth curricula,'' Acad. Medicine, vol. 96, no. 9, pp. 1231-1232, 2021.DOI
6 
N. Hajarolasvadi and H. Demirel, ``Deep facial emotion recognition in video using eigenframes,'' IET Image Proc., vol. 14, no. 14, pp. 3536-3546, 2020.DOI
7 
J. Han, L. Du, X. Ye, L. Zhang, and J. Feng, ``The devil is in the face: Exploiting harmonious representations for facial expression recognition,'' Neurocomputing, vol. 486, no. 14, pp. 104-113, 2022.DOI
8 
D. Gera, S. Balasubramanian, and A. Jami, ``CERN: Compact facial expression recognition net,'' Pattern Recognit. Lett., vol. 155, no. 3, pp. 9-18, 2022.DOI
9 
Y. Cai, J. Gao, G. Zhang, and Y. Liu, ``Efficient facial expression recognition based on convolutional neural network,'' Intell. Data Anal., vol. 25, no. 1, pp. 139-154, 2021.DOI
10 
W. Hu, Y. Huang, F. Zhang, R. Li, and H. Li, ``SeqFace: Learning discriminative features by using face sequences,'' IET Image Proc., vol. 15, no. 11, pp. 2548-2558, 2021, DOI:10.1049/ipr2.12243.DOI
11 
J. Li, K. Jin, D. Zhou, N. Kubota, and Z. Ju, ``Attention mechanism-based CNN for facial expression recognition,'' Neurocomputing, vol. 411, no. 10, pp. 340-350, 2020.DOI
12 
Y. Zhang, B. Dai, and Y. Zhong, ``The establishment and Ooptimization of public emotion network communication model using deep learning,'' Int. J. Humanoid Rob., vol. 19 no. 3, pp. 39-52, 2022.DOI
13 
Y. Ye, R. Omar, B. Ning, and H. Ting, ``Exploring the interactions of factory workers in China: A model development using the grounded theory approach,'' Sustainability, vol. 12, no. 17, pp. 6750-6752, 2020.DOI
14 
X. Sun, P. Xia, and F. Ren, ``Multi-attention based deep neural network with hybrid features for dynamic sequential facial expression recognition,'' Neurocomputing, vol. 139, no. 9, pp.157-165, 2020.DOI
15 
S. Zhao, H. Tao, Y. Zhang, T. Xu, K. Zhang, Z. Hao, and E. Chen, ``A two-stage 3D CNN based learning method for spontaneous micro-expression recognition,'' Neurocomputing, vol. 448, no. 8, pp.276-289, 2021.DOI
16 
B. L. Liu and A. B. Djamel, ``Effective image super resolution via hierarchical convolutional neural network,'' Neurocomputing, vol. 274, no. 1, pp. 109-116, 2020.DOI
17 
Y. Liu, X. Zhang, J. Zhou, and L. Fu, ``SG-DSN: A semantic graph-based dual-stream network for facial expression recognition,'' Neurocomputing, vol. 462, no. 28, pp. 320-330, 2021.DOI
18 
Q. Zhu, L. Gao, H. Song, and Q. Mao, ``Learning to disentangle emotion factors for facial expression recognition in the wild,'' Int. J. Intell. Syst., vol. 36, no. 6, pp. 2511-2527, 2021.DOI
19 
N. B. Kar, D. R. Nayak, K. S. Babu, and Y. D. Zhang, ``A hybrid feature descriptor with Jaya optimised least squares SVM for facial expression recognition,'' IET Image Proc., vol. 15, no. 7, pp. 1471-1483, 2021.DOI
20 
X. Jin and Z. Jin, ``MiniExpNet: A small and effective facial expression recognition network based on facial local regions,'' Neurocomputing, vol. 462, no. 10, pp. 353-364, 2021.DOI
21 
R. C. Zhi, C. X. Zhou, T. T. Li, S. Liu, and Y. Jin, ``Action unit analysis enhanced facial expression recognition by deep neural network evolution,'' Neurocomputing, vol. 425, no. 3, pp. 135-148, 2021.DOI
22 
W. Yang, H. Gao, Y. Jiang, J. Yu, J. Sun, J. Liu, and Z. Ju, ``A cascaded feature pyramid network with non-backward propagation for facial expression recognition,'' IEEE Sens. J, vol. 21, no. 10, pp. 11382-11392, 2021.DOI
23 
D. Liu, X. Ouyang, S. Xu, P. Zhou, K. He, and S. Wen, ``SAANet: Siamese action-units attention network for improving dynamic facial expression recognition,'' Neurocomputing, vol. 413, no. 9, pp. 145-157, 2020.DOI
24 
X. J. Wang, M. C. Fairhurst, and A. M. P. Canuto, ``Improving multi-view facial expression recognition through two novel texture-based feature representations,'' Intell. Data Anal., vol. 24, no. 6, pp. 1455-1476, 2020.DOI
25 
D. K. Jain, Z. Zhang, and K. Q. Huang, ``Multi angle optimal pattern-based deep learning for automatic facial expression recognition,'' Pattern Recognit. Lett., vol. 139, no. 9, pp.157-165, 2020.DOI
26 
C.S. Ana, C. E. Jacobo, and O. S, ``Social networks, emotions, and education: Design and validation of e-COM, a scale of socio-emotional interaction competencies among adolescents,'' Sustainability, vol. 14, no. 5, pp. 2566-2567, 2022.DOI
27 
A. K. Ursula, G. G. Jose, L. P. Cristina, and M. L. Jose, ``Interaction and emotional connection with pets: A descriptive analysis from Puerto Rico,'' Anim, vol. 10, no. 11, pp. 2316-2317, 2020.DOI
28 
A. Y. Alhaddad, J. J. Cabibihan, and A. Bonarini, ``Influence of reaction time in the emotional response of a companion robot to a child's aggressive interaction,'' Int. J. Social Rob., vol. 12, no. 6, pp. 1279-1291, 2020.DOI
29 
A. Hong, N. Lunscher, T. Hu, Y. Tsbooi, X. Zhang, and S. F. Alves, ``A multimodal emotional human–robot interaction architecture for social robots engaged in bidirectional communication,'' IEEE Trans. Cybern., vol. 51, no. 12, pp. 5954-5968, 2020.DOI
30 
P. Preethi and H. R. Mamatha, ``Region-based convolutional neural network for segmenting text in epigraphical images,'' Artif. Intell. Appl., vol. 1,, no. 2, pp. 119-127, Sep. 2023.DOI
Zhiqi Fan
../../Resources/ieie/IEIESPC.2025.14.1.68/author1.png

Zhiqi Fan obtained her master’s degree in artistic theory from Zhengzhou University in 2018. Presently, she is working as a teacher in the School of Applied Foreign Languages, Henan Industry and Trade Vocational College. She has published a number of articles in domestic journals. Her areas of interest include Artistic Theory and English teaching.