SuXing1*
                     WangWei1
               
                  - 
                           
                        (Digital Technology School, Sias University, Zhengzhou, Henan 451150, China)
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Classroom behavior,  Deep learning,  YOLO v5s
             
            
          
         
            
                  1. Introduction
               Student behavior determines, to a certain extent, the amount of knowledge students
                  acquire in the classroom. Therefore, automatic analysis of classroom behavior is of
                  great importance in evaluating teaching effectiveness [1]. Observing students' classroom behavior relies on teachers' observations, which is
                  very time-consuming and may distract from teaching efforts. With the rapid development
                  of deep learning technology in fields such as image recognition and object detection,
                  some scholars have begun to study how to use deep learning to automatically identify
                  student behavior in the classroom. Among the related research on behavior identification,
                  Wu constructed a classroom behavior recognition model [2]. Combining the particle swarm optimization\textemdash{}k-nearest neighbors algorithm
                  with emotional image processing algorithms proved highly accurate in identifying both
                  emotion and behavior. Xie et al. proposed a deep learning algorithm based on spatio-temporal
                  representation learning to evaluate college students' classroom posture [3]. The results revealed that the proposed algorithm had a 5% higher accuracy than the
                  baseline 3D convolutional neural network (CNN), and it was an effective tool for identifying
                  abnormal behavior in college classrooms. Lin et al. [4] proposed a system that uses a deep neural network to classify actions and identify
                  student behavior. The experiment results showed that the proposed system had a 15.15%
                  higher average accuracy rate and a 12.15% higher average recall rate than the skeleton-based
                  approach. Pang [5] combined a conventional clustering analysis algorithm and the random forest algorithm
                  with a human skeleton model to recognize students' classroom behavior. Through experiments,
                  the effectiveness of recognizing behavior based on human skeleton models was verified.
                  Mao [6] proposed an intelligent image recognition system for students' classroom behavior.
                  Simulating many classroom behaviors through a large number of experiments, the system
                  could help students accurately and quickly identify incorrect classroom behavior and
                  could make timely reminders. Ma and Yang [7] constructed a system for analyzing and assessing classroom behavior using deep learning
                  face recognition technology and found that the system could effectively evaluate students'
                  classroom behavior.
               
               This article explores the recognition and identification of students' classroom behavior
                  in universities. By collecting data for input into the constructed You Only Look Once
                  Version 5 Small (YOLO v5s) model for experimentation, the behavior of students in
                  the classroom was identified and analyzed. The results are compared with other object
                  detection models to prove the effectiveness and feasibility of the student classroom
                  behavior recognition model proposed in this article.
               
             
            
                  2. Algorithms for Behavior Recognition
               Currently, most universities have smart classrooms equipped with video surveillance
                  devices to record students' behavior, which can be recognized based on the content
                  of the surveillance video. Target detection and recognition algorithms based on deep
                  learning can be either single-step or two-step. Compared with the two-step recognition
                  algorithm, training the single-step algorithm is more stable, and the recognition
                  speed is higher. YOLO [8] is one of the best algorithms in the field of target detection and is a single-step
                  recognition algorithm based on target regression. YOLO v5 is one of the newer versions,
                  has high accuracy, and is fast. It comes in v5s, v5m, v5l, and v5x versions based
                  on networks with different depths and widths. This paper evaluated these four models
                  based on actual application scenarios, and we ultimately chose YOLO v5s [9] as the model for our target detection. YOLO v5s consists of six modules: Focus, CBL,
                  CSP, SPP, upsampling, and Concat, as described below.
               
               (1) Focus structure: The original image (640${\times}$640${\times}$3) is imported
                  and sliced into a feature map (320${\times}$320${\times}$12). Then, after a convolution
                  operation, it becomes a 320${\times}$320${\times}$32 feature map.
               
               (2) CBL: This module consists of Conv, Bn, and the Leaky\_ReLU activation function.
                  The feature map is convolved, normalized, and activated sequentially. A convolution
                  operation with a kernel size of 3 and a step length of 2 is performed to downsample
                  the feature map, whereas a convolution operation with a kernel size of 1 and a step
                  length of 1 is used for feature mapping.
               
               (3) Cross Stage Partial (CSP): The CSP1\_X structure is used in the main backbone
                  network, and the CSP2\_X structure is used in the neck.
               
               (4) Spatial Pyramid Pooling (SPP): The feature map is subjected to a k${\times}$k
                  maximum pooling operation to increase the reception range of the main data feature.
               
               (5) Upsampling: This module uses the nearest-neighbor method to double the size of
                  the advanced feature map.
               
               (6) Concat: This module adds advanced features to low-level features to create a new
                  feature map.
               
             
            
                  3. Case Analysis
               
                     3.1 Data Acquisition and Processing
                  A dataset was constructed for the experiment in this study and was divided into training
                     and testing sets at a 7:3 ratio. The experimental data were classroom videos obtained
                     from the Digital Technology School, Sias University, Zhengzhou, China. After obtaining
                     consent from teachers and students, cameras installed in the classroom were used to
                     record videos of the classes. Student behavior was evaluated as shown in Table 1. Since there were very few cases of sleeping or standing up to answer questions in
                     the initial collection, we deliberately asked students to perform these actions in
                     subsequent collections to supplement the dataset.
                  
                  After data acquisition and before the experiment began, the collected data were processed.
                     First, five classroom behaviors (raising the head to listen, standing up to answer
                     questions, sleeping, playing with a mobile phone, and turning to chat) were collected
                     from the video data. Each type of behavior was limited to approximately 20 minutes
                     of video. Video frames were then extracted at equal intervals. Finally, a dataset
                     of approximately 4200 images was obtained. Then, the images were labeled by using
                     the LabelImg Python-based image annotation tool [10]. It was preinstalled on a Windows system, and a folder named JPEGImages was created
                     to store the labeled images. Then, a folder named Annotations was created to store
                     the annotated files. Finally, all the images were imported at once, and after the
                     LabelImg tool was applied, the labeled images were used to generate corresponding
                     XML files. The images before labeling are shown in Figs. 1-1 and 1-3, and the images after labeling are shown in Figs. 1-2 and 1-4. After all the images were labeled, all the labeled XML files and the corresponding
                     images were uniformly named and stored in the JPEGImages and Annotations folders.
                  
                  
                        Fig. 1. Comparison of student classroom behavior video images before and after labeling.
 
                  
                        Table 1. Criteria for Determining Classroom Behavior.
                     
                           
                              
                                 | Behavior | Judgment criteria | 
                           
                                 | Raising the head to listen to the lecture | Looking up at the teacher, blackboard, or PowerPoint presentation, and taking notes | 
                           
                                 | Standing up to answer questions | Standing in front of the chair | 
                           
                                 | Sleeping on the table | Bending over the table | 
                           
                                 | Playing with a mobile phone | Looking down at a phone and holding it in the hand | 
                           
                                 | Turning the head to chat | Turning the head and talking to another student | 
                        
                     
                   
                
               
                     3.2 Experimental Steps and Parameter Settings
                  The steps of the experiment are as follows. First, the data were collected and processed,
                     the data images were labeled using the LabelImg tool before the experiment, and the
                     training set in the processed dataset was input to the model for training. Secondly,
                     after continuously training, the model improved, according to the results, so the
                     test set was input for experiment. Third, different algorithms were used to identify
                     students' classroom behaviors with different numbers of people in the classroom, considering
                     there are large and small classes in universities. Fourth, considering that students'
                     classroom behaviors are different, and each behavior has a different posture, the
                     recognition results of the models for the five different classroom behaviors were
                     evaluated by using three different algorithms. Finally, at the end of the experiment,
                     the results of the three algorithms were evaluated under different intersection over
                     union (IoU) thresholds. The three algorithms were YOLO v5s, a single shot multibox
                     detector (SSD), and a region-based convolutional neural network (R-CNN). The purpose
                     of this paper is to demonstrate that the YOLO v5s object detection model is feasible
                     for recognizing student behavior in the classroom.
                  
                  In order to ensure true and effective recognition results from the YOLO v5s model,
                     the training parameters were kept consistent during the ablation experiment using
                     the proposed method. The parameter settings are presented below, and the stochastic
                     gradient descent (SGD) algorithm was chosen for network optimization. The initial
                     learning rate was 0.001. The batch size was 8. The number of epochs was 100. The damping
                     index was set to 0.5. The length-width ratio of the anchor box was 1:2. The binary
                     cross-entropy loss function was used for classification, and the bounder loss function
                     [11] was used:
                  
                  
                  where $L_{CIoU}$ represents the loss value of border prediction, $IoU$ represents
                     the overlap between predicted and true boxes, $D^{2}$ represents the center distance
                     between predicted and true boxes, $D_{c}$ represents the diagonal distance between
                     predicted and true boxes, and V is a parameter that measures consistency in the length-width
                     ratio.
                  
                
               
                     3.3 Model Evaluation Indicators
                  Precision (P), recall (R), average precision (AP), and mean average precision (mAP)
                     [12] were used as evaluation indicators to assess the recognition results of the model.
                     Considering the practical application requirements of the model, real-time detection
                     and recognition of students' classroom behaviors were required. Therefore, the detection
                     speed (in frames per second) was also evaluated. The precision and recall rate expressions
                     are:
                  
                  
                  and
                  
                  where TP indicates a positive example recognized as positive, FP indicates a negative
                     example recognized as positive, and FN indicates a positive example recognized as
                     negative.
                  
                  In addition to the above two indicators, there were AP and mAP, where AP measures
                     the detection performance of a specific class, while mAP measures the detection performance
                     of the model for all classes. The calculation of AP is explained as follows. When
                     recognizing students' classroom behaviors, the precision and recall rates of each
                     behavioral class can be calculated, and a precision/recall curve can be obtained for
                     each class of behavior. The area under the curve is the AP value, whereas mAP is the
                     mean for all the AP values of all categories. The expressions for AP and mAP are:
                  
                  
                  and
                  
                  where P and R stand for precision and recall, respectively, and C is the number of
                     categories.
                  
                
               
                     3.4 Result Analysis
                  Considering the different class sizes in universities, and in order to better apply
                     the recognition model in the future, the experiment covered two situations: a moderate
                     class size and a large class size. For the moderate class size, one class was selected
                     for video recording, while for the large class size, two to three classes were selected.
                     Identifying classroom behavior was performed under different recognition models and
                     for different class sizes. From the results presented in Table 2, we can see that under the different classroom densities, the precision of the YOLO
                     v5s model for medium and large classes was 94.37% and 94.29%, respectively. Recall
                     was 95.71% and 94.29%, respectively, and mAP was 96.02 and 95.48, respectively, suggesting
                     that detection and recognition effects were similar under different class sizes. At
                     the same time, there was not much difference in detection speed for medium and large
                     class sizes (118.25 fps and 117.65 fps, respectively, a difference of only 0.6 s).
                     Compared with the SSD and R-CNN models, the detection speed of YOLO~v5s was much higher,
                     indicating it was more suitable for real-time detection of students' classroom behavior.
                     Although the evaluation index data of the YOLO~v5s recognition model in the large
                     class situation was slightly lower than for the moderate class, the overall difference
                     was not significant, indicating the YOLO v5s recognition model can be applied to classroom
                     behavior recognition under different class sizes.
                  
                  Each student's behavior in the classroom varies, and the body posture varies. Therefore,
                     it is crucial for the model to accurately classify and recognize each behavior. This
                     article evaluated AP for recognition results from three different algorithms. As shown
                     in Table 3, the AP of the YOLO v5s model for the five behaviors were 97.4 for raising the head
                     and listening, 93.5 for standing up to answer questions, 93.6 for sleeping, 95.9 for
                     looking down and playing with a phone, and 98.6 for turning the head to chat. These
                     values were all higher than the AP under the SSD and R-CNN models for these behaviors.
                     Moreover, by analyzing the overall AP of the three different models, it was found
                     that the AP for standing up to answer questions, raising the head and listening, and
                     turning the head to chat were higher than the AP for sleeping and looking down and
                     playing with a phone. This implies that some students have similar upper body postures
                     for sleeping on the table and looking down and playing with a phone, leading to detection
                     errors in these two categories.
                  
                  mAP is the average accuracy of the model under different IoU thresholds. A higher
                     mAP means a more accurate model. Therefore, IoU is a crucial function for calculating
                     mAP. This paper evaluated the recognition results of three different algorithms at
                     different thresholds [15]. In Table 4, mAP@0.5 and mAP@0.75 represent AP for all images in each class at an IoU set at
                     0.5 and 0.75, respectively; mAP@0.5:0.95 represents the mAP at 0.5-0.95 IoU thresholds
                     (with increments of 0.05). A higher mAP indicates better detections by the model.
                     Based on the data in Table 4, we can see that YOLO v5s had a higher mAP than SSD and R-CNN models under the different
                     IoU thresholds, reaching 95.8, 94.3, and 92.9, respectively. This indicates the YOLO
                     v5s performance is excellent in the field of object detection, and it can be used
                     for real-time recognition of classroom behavior in college students, achieving the
                     expected experimental results.
                  
                  
                        Table 2. Recognition Results of Three Algorithms for Different Class Sizes.
                     
                           
                              
                                 | Category | Model | Precision | Recall rate | mAP | Detection speed | 
                           
                                 | Medium class size | YOLO v5s | 94.37% | 95.71% | 96.02 | 118.34 | 
                           
                                 | SSD [13] | 87.88% | 82.86% | 88.66 | 92.14 | 
                           
                                 | R-CNN [14] | 80.65% | 71.43% | 80.27 | 91.93 | 
                           
                                 | Large class size | YOLO v5s | 94.29% | 94.29% | 95.48 | 117.65 | 
                           
                                 | SSD | 84.36% | 77.14% | 84.91 | 89.37 | 
                           
                                 | R-CNN | 76.19% | 68.57% | 79.35 | 87.64 | 
                        
                     
                   
                  
                        Table 3. Evaluation of Identification Results under Different Classroom Behaviors.
                     
                           
                              
                                 | Model | AP | mAP | 
                           
                                 | Raise the head and listen | Sleep on the table | Look down and play with a phone | Turn the head to chat | Stand up to answer a question | 
                           
                                 | YOLO v5s | 97.4 | 93.5 | 93.6 | 95.9 | 98.6 | 95.8 | 
                           
                                 | SSD | 93.3 | 81.1 | 82.9 | 91.2 | 93.5 | 88.4 | 
                           
                                 | R-CNN | 82.2 | 74.8 | 75.1 | 80.5 | 82.9 | 79.1 | 
                        
                     
                   
                  
                        Table 4. Evaluation of Recognition Results under Different IoU Thresholds.
                     
                           
                              
                                 | Model | mAP@0.5 | mAP@0.75 | mAP@0.5:0.95 | 
                           
                                 | YOLO v5s | 95.8 | 94.3 | 92.9 | 
                           
                                 | SSD | 88.4 | 85.9 | 80.2 | 
                           
                                 | R-CNN | 79.1 | 78.5 | 76.3 | 
                        
                     
                   
                
             
            
                  4. Discussion
               Real-time recognition of students' classroom behavior using deep learning techniques
                  can evaluate classroom situations and help improve the quality of teaching. In this
                  study, the YOLO v5s recognition model was used to detect and recognize students' behavior
                  in the classroom. Teachers can use this information to evaluate students' in their
                  regular classes. The experiment results of this study showed that under different
                  classroom densities and IoU thresholds, the YOLO v5s model was superior to SSD and
                  R-CNN models in terms of precision, recall, AP, mAP, and detection speed. The results
                  revealed that YOLO v5s can be applied to real-time classroom behavior recognition
                  under different classroom densities. After developing the model for identification
                  of classroom behaviors, different types of behavior need to be managed, and some bad
                  behavior is often due to the difficulty in effectively engaging students with the
                  content, and failing establish a genuine relationship with them [16]. Some studies have suggested that providing students with social rewards, such as
                  praise, encouragement, and care, to promote good classroom behavior is the most accepted
                  management approach [17]. This paper argues that the key to implementing student behavior management in the
                  classroom is teacher behavior [18], and that both classroom management and student interactions should be involved,
                  such as strengthening the establishment of attendance systems and frequently asking
                  students to answer questions. In the future, research directions will focus on classroom
                  management, such as building a classroom management system based on the YOLO v5s recognition
                  model. According to the teaching needs of colleges and universities, the system is
                  divided into three ports: student, teacher, and administrator [19].
               
               (1) Students can view their attendance and video recognition results by entering their
                  student ID and password [20].
               
               (2) Teachers can identify and view behavior detection results and attendance records
                  through the classroom video. They can send the results and attendance records to students,
                  and have permission to modify the data. For example, if a student attends class for
                  only a short time, but the attendance record was miscalculated, the teacher can change
                  it.
               
               (3) The operation rights of the administrator include and exceed those of students
                  and teachers. The administrator can organize courses and process class videos, such
                  as saving them in different locations according to different courses and semesters,
                  and can delete videos from the previous semester.
               
             
            
                  5. Conclusion
               This article provides a brief introduction to student classroom behavior and the YOLO
                  object detection algorithm. Prior to the experiment, video data were converted to
                  images and labeled using the Python-based LabelImg tool. Then, the YOLO v5s model
                  was used to build an object detection and recognition model to identify and analyze
                  student classroom behavior. The performance of the model was evaluated based on precision,
                  recall, AP, mAP, and detection speed. The experiment results showed that under medium
                  and large classroom densities, respectively, the YOLO v5s model achieved precision
                  of 94.37% and 94.29%, recall rates of 95.71% and 94.29%, and AP of 96.02 and 95.48,
                  with respective detection speeds of 118.25 fps and 117.65 fps. The recognition results
                  were consistent at both classroom densities. The mAP values from YOLO v5s at different
                  IoU thresholds were higher than those of the SSD and R-CNN models, reaching 95.8,
                  94.3, and 92.9. This paper demonstrates that YOLO v5s is an excellent model in the
                  field of object detection, and it can be effectively applied to real-time recognition
                  of college students' behavior in the classroom.
               
             
          
         
            
                  
                     REFERENCES
                  
                     
                        
                        B. Yang, Z. Yao, H. Lu, Y. Zhou, and J. Xu, ``In-classroom learning analytics based
                           on student behavior, topic and teaching characteristic mining - ScienceDirect,'' Pattern
                           Recognition Letters, Vol. 129, No. Jan., pp. 224-231, Jan. 2020.

 
                     
                        
                        S. Wu, ``Simulation of classroom student behavior recognition based on PSO-kNN algorithm
                           and emotional image processing,'' Journal of Intelligent and Fuzzy Systems, Vol. 40,
                           No. 4, pp. 1-11, Dec. 2020.

 
                     
                        
                        Y. Xie, S. Zhang, and Y. Liu, ``Abnormal Behavior Recognition in Classroom Pose Estimation
                           of College Students Based on Spatiotemporal Representation Learning,'' Traitement
                           du Signal: signal image parole, Vol. 38, No. 1, pp. 89-95, Feb. 2021.

 
                     
                        
                        F. Lin, H. Ngo, C. Dow, K. H. Lam, and H. L. Le, ``Student Behavior Recognition System
                           for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection,''
                           Sensors (Basel, Switzerland), Vol. 21, No. 16, pp. 1-20, Aug. 2021.

 
                     
                        
                        C. Pang, ``Simulation of student classroom behavior recognition based on cluster analysis
                           and random forest algorithm,'' Journal of Intelligent and Fuzzy Systems, Vol. 40,
                           No. 2, pp. 2421-2431, Feb. 2021.

 
                     
                        
                        L. Mao, ``Remote classroom action recognition based on improved neural network and
                           face recognition,'' Journal of Intelligent and Fuzzy Systems, Vol. 2021, No. 1, pp.
                           1-11, March. 2021.

 
                     
                        
                        C. Ma, and P. Yang, ``Research on Classroom Teaching Behavior Analysis and Evaluation
                           System Based on Deep Learning Face Recognition Technology,'' Journal of Physics: Conference
                           Series, Vol. 1992, No. 3, pp. 1-7, Aug. 2021.

 
                     
                        
                        X. Feng, Y. Piao, and S. Sun, ``Vehicle tracking algorithm based on deep learning,''
                           Journal of Physics: Conference Series, Vol. 1920, No. 1, pp. 1-7, May. 2021.

 
                     
                        
                        Z. Ying, Z. Lin, Z. Wu, K. Liang, and X. Hu, ``A modified-YOLOv5s model for detection
                           of wire braided hose defects,'' Measurement, Vol. 190, pp. 110683.1-110683.11, Jan.
                           2022.

 
                     
                        
                        S. Tabassum, S. Ullah, N. H. Al-Nur, and S. Shatabda, ``Poribohon-BD: Bangladeshi
                           local vehicle image dataset with annotation for classification,'' Data in Brief, Vol.
                           33, No. 1, pp. 1-6, Dec. 2020.

 
                     
                        
                        S. Wu, and X. Li, ``IoU-Balanced loss functions for single-stage object detection,''
                           Pattern Recognition Letters, Vol. 156, No. Apr., pp. 96-103, Jan. 2022.

 
                     
                        
                        S. Li, Y. Li, Y. Li, M. Li, and X. Xu, ``YOLO-FIRI: Improved YOLOv5 for Infrared Image
                           Object Detection,'' IEEE Access, Vol. 9, pp. 141861-141875, Oct. 2021.

 
                     
                        
                        R. Ranjan, A. Bansal, J. Zheng, H. Xu, J. Gleason, B. Lu, A. Nanduri, J. C. Chen,
                           C. D. Castillo, and R. Chellappa, ``A Fast and Accurate System for Face Detection,
                           Identification, and Verification,'' IEEE Transactions on Biometrics Behavior & Identity
                           Science, Vol. 1, No. 2, pp. 82-96, April. 2019.

 
                     
                        
                        U. H. Gawande, K. O. Hajari, and Y. G. Golhar, ``Scale Invariant Mask R-CNN for Pedestrian
                           Detection,'' Electronic Letters on Computer Vision and Image Analysis, Vol. 19, No.
                           3, pp. 98-117, Nov. 2020.

 
                     
                        
                        D. Sun, Y. Yang, M. Li, J. Yang, B. Meng, R. Bai, L. Li, and J. Ren, ``A Scale Balanced
                           Loss for Bounding Box Regression,'' IEEE Access, Vol. 8, pp. 108438-108448, June.
                           2020.

 
                     
                        
                        W. C. Hunter, A. D., Jasper, K. Barnes, L. L. Davis, K. Davis, J. D. Singleton, S.
                           Barton-Arwood, and T. M. Scott, ``Promoting positive teacher-student relationships
                           through creating a plan for Classroom Management On-boarding,'' Multicultural Learning
                           and Teaching, Vol. 18, No. 1, Feb. 2021.

 
                     
                        
                        J. D. McLennan, H. Sampasa-Kanyinga, K. Georgiades, and E. Duku, ``Variation in Teachers'
                           Reported Use of Classroom Management and Behavioral Health Strategies by Grade Level,''
                           School Mental Health, Vol. 12, No. 1, pp. 67-76, March. 2020.

 
                     
                        
                        A. Al-Bahrani, ``Classroom management and student interaction interventions: Fostering
                           diversity, inclusion, and belonging in the undergraduate economics classroom,'' The
                           Journal of Economic Education, Vol. 53, No. 3, pp. 259-272, May. 2022.

 
                     
                        
                        J. Zhang, ``Computer Assisted Instruction System Under Artificial Intelligence Technology,''
                           Pediatric Obesity, Vol. 16, No. 5, pp. 1-13, March. 2021.

 
                     
                        
                        N. P. Putra, S. Loppies, and  R. Zubaedah, ``Prototype of College Student Attendance
                           Using Radio Frequency Identification (RFID) at Musamus University,'' IOP Conference
                           Series: Materials Science and Engineering, 2021, Vol. 1125, No. 1, pp. 1-8, May. 2021.

 
                   
                
             
            Author
            
            
               			Ms. Xing Su is currently a lecturer at Sias University in Zhengzhou. She graduated
               from Fort Hays State University in the United States with a master degree. Her research
               interests include economic management and educational management
               		
            
            
            
               			Wei Wang is a lecturer at Sias University in Zhengzhou, China. He graduated from
               Fort Hays State University in the United States with a master’s degree, and from Peking
               University HSBC Business School with an EMBA. He is working on his PhD at the University
               of Kuala Lumpur, Malaysia. His research interests include resource and environmental
               economics and industrial management. He has published one paper and one book.