MoonSeunghun1
YangChanghee1
KangBeoungwoo1
KangSuk-Ju1
-
(Department of Electronical Engineering, Sogang University / Seoul, Korea
{anstmdgns97, beoungwoo, qazw5741}@naver.com
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Object detection, Deep learning, Computer vision, Fall down, Real-time detector, Dataset
1. Introduction
Falls are currently one of the leading causes of death, and according to the Centers
for Disease Control and Prevention (CDC), more than 800,000 people are hospitalized
each year with fall injuries [1]. Falls can be caused by abnormal health conditions, violence, or accidents. Therefore,
it is important that these falls are detected as quickly as possible and that appropriate
actions are taken. The consequences may be irreversible if the response is late.
Recently, many studies have been conducted to detect falls in various fields. Park
et al. [2] introduced a method using Mask-RCNN to detect a fallen person using CCTV [3]. Chen et al. [4] presented a method for detecting falls by extracting optical features using a CCD
camera and thermal imaging camera in a dark outdoor environment.
If a person falls on a deserted road or at a dark time such as at night or dawn, immediate
help is not available, and a situation may arise in which prompt action cannot be
taken. In these situations, it is possible to detect a fallen person economically
and effectively if we take advantage of CCTV. For example, Xu et al. [5] detected falls using tracking methods with prior information such as motion information
obtained from object detection in a CCTV environment. Salimi et al. [6] focused on a method of detecting falls through 2D-level human pose estimation by
sorting out key points such as a person’s head or joints in a CCTV environment. These
studies improved fallen person detection in CCTV environments, but they depend on
prior information and require massive hardware resources to store prior information
in memory. CCTV also has a critical problem in that hardware resources are limited.
These limitations make it hard for the system to operate in real time.
Another interest is creating a specialized dataset for fall detection. An et al. [7] proposed the specialized VFP290K benchmark dataset for fall detection by appending
a new fallen person class to their dataset. This approach was capable of real-time
operation. However, as shown in Fig. 1, this method caused a problem of falsely detecting dark objects such as dark shoes
and black scooters as fallen people.
Moreover, in video-level operation, object detection can be interrupted in the middle
of consecutive frames due to occlusion events. Also, situations that do not have high
risk, such as stumbling and slipping, can be wrongly detected as a falling event.
This results in both precision and recall degradation for the human class in the object
detector, and unnecessary human resources may be wasted due to false alarms when the
system is applied to an actual CCTV environment.
A dataset is proposed to address the limitations by integrating the fallen person
class and non-fallen class of the VFP290K dataset [7] while excluding the data in night conditions. We included additional classes of common
objects by using the AI-Hub dataset to avoid falsely detecting dark objects as a fallen
person. In addition, we introduce a bounding box ratio algorithm, which is an efficient
yet robust method for detecting a fall event. The algorithm makes no difference in
terms of speed when compared to the computational cost caused by object detection.
Furthermore, we use the distinct point when a fallen person does not move much. With
this knowledge, we propose a novel algorithm specialized in tracking the person ID
of a fallen person: the bounding box overlap algorithm. This algorithm presents robust
performance in video-level operation by adopting a time-merge method, which aggregates
frames over time at the video level to solve the problem of object detection being
intermittent. Combining all these proposed methods, we improved the fall detection
performance.
Fig. 1. Comparison with VFP290K[7]and Our Method.
2. Related Works
2.1 You Only Look Once (YOLO)
Object detection is used to find the location of given classes in an image. Vision-based
object detection is a huge research topic in the field of computer vision and is used
in a wide range of applications such as recognizing faces or obstacles. YOLO methods
combine vision-based object detection with deep learning and act as a one-stage detector
that carries out classification and localization simultaneously. Since YOLOv1 [8] was first introduced in 2015, YOLO methods have been continuously updated. It has
a lightweight structure and high detecting accuracy, so YOLO methods have been recognized
as state-of-the-art methods in the field of object detection.
YOLOv4 [9], Scaled-YOLOv4 [10], and YOLOR [11] were introduced in 2020 and 2021. The latest model, YOLOv7 [12], was introduced in 2022 and outperformed well-known object detectors such as R-CNN
[13], YOLOv5, YOLOX [14], PPYOLO [15], DETR [16], etc. It also reduced the number of parameters and computational cost due to the
optimization of the model structure and the training process. YOLOv4 [9] increased detection accuracy but failed to reduce the inference cost, while YOLOv7
[12] achieved both by presenting a planned re-parameterized method that trains multiple
convolutional layers in parallel and combines them into one convolutional layer in
the inference process.
YOLOv7 [12] has a model structure like that of YOLOv5, which consists of an input, backbone,
and head. The backbone layer extracts features of an input image, and the head layer
predicts and outputs the prediction in a bounding box format. The image is pre-processed
at the input state, goes through the backbone layer, and is converted into a feature
map. Finally, the prediction result is exported by Rep convolution and Imp convolution.
We conducted experiments with VFP290K [7] as a baseline, and comparative analysis was conducted with YOLOv5, the backbone of
VFP290K [7], and YOLOv7 [12], the state-of-the-art real-time object detector.
2.2 Fall Detection Specialized Datasets
Various studies have been conducted to create a dataset specialized for fall detection.
Charfi et al. [17] proposed the Le2i dataset, which captured falls in four types of indoor environments
(i.e., home, cafe, office, and classroom) with a single Kinect camera. Auvinet et
al. [18] presented the MultiCam dataset, which captured falls in a living room environment
with 8 general cameras. Mastorakis et al. [19] introduced a dataset containing 48 types of falls photographed in an indoor environment
at a height of 2 m. Zhang et al. [20] proposed a dataset with occlusion cases. These approaches concentrated on robust
fall detection in natural situations by considering various factors such as camera
type, filming height, backgrounds, environments, and occlusion.
An et al. [7] proposed the VFP290K dataset [7], which is composed of 294,713 frames in diverse circumstances such as light conditions
(i.e., day and night), background conditions (i.e., street, park, and building), camera
heights (i.e., high and low), and occlusion cases. The dataset’s validity and usefulness
were verified through a performance evaluation experiment using well-known object
detectors like R-CNN [13] and YOLOv5.
Despite these attempts, the fall detection performance is not robust enough for real-world
applications across wild environments with diverse domains. The training process using
two-classes (i.e., fallen and non-fallen) also generated false-negative cases, where
dark objects were predicted as a fallen person, which degraded recall values and the
fall detection performance of the system.
3. The Proposed Method
Fig. 2 shows an overview of our method. The object detection part receives an input image
through encoding in video-level. After that, person detection takes place with the
bounding box as an output. This is followed by fall detection via the bounding box
ratio. The bounding box overlap algorithm calculates the IoU (i.e., Intersection over
Union) with the bounding box of the immediately preceding frame.
If the IoU exceeds 0.9, it means that the bounding boxes overlap by 90\%, and the
system regards the object as a fallen person. After that, the information of the corresponding
bounding box is stored in the person ID memory. The duplicate person IDs are removed
in the time-merge part. Finally, the final fall-down memory that stores information
of the falls is exported as output. In addition, we mixed the VFP290K dataset [7] and the AI-Hub dataset [21] provided by the National Information Society Agency (NIA) for robust fall detection
in natural conditions and the possibility of training falls in diverse viewpoints
and environments.
Fig. 2. Overall Process of the Proposed Method.
3.1 Fallen Person Detection
3.1.1 Bounding Box Ratio based Fall Detection
Fig. 3 shows the details of the bounding box ratio method. The basic premise of this method
is that a standing person seen in an image would be vertically long, while a fallen
person would be horizontally long. The ratio of the bounding boxes $\text{ratio}_{\mathrm{wh}}$
can be expressed as:
where W represents the width of the bounding box, and the H represents to the height
of the bounding box. The $\text{ratio}_{\mathrm{wh}}$ denotes the ratio of the width
and the height of the bounding box. We used $\text{ratio}_{\mathrm{wh}}$ of a predicted
person through object detection to determine a fall. We empirically obtained a threshold
value of 1.2 by analyzing $\text{ratio}_{\mathrm{wh}}$ of the ground truth of a fallen
person in our dataset. This simple rule-based method determines a fall if $\text{ratio}_{\mathrm{wh}}$
is greater than 1.2 and non-fall if it is not. As shown in Fig. 3, it is intuitive, fast, and powerful. Using this method, training and fall detection
can be done with only a single class, and robust operation of multiple fall detections
can also be done in a CCTV environment.
Fig. 3. Bounding Box Ratio Algorithm.
3.1.2 Bounding Box Overlap for Video-level Fall Tracking
Fall events have an apparent restriction on the object’s movement compared to other
events. Two assumptions can be made from this restriction. First, a person who falls
will hardly move from one location, which means the bounding box will not have notable
change in its size and location. Second, since the person has fallen, the width of
the bounding box will be greater than the height, and $\text{ratio}_{\mathrm{wh}}$
will be greater than 1.2.
Using these assumptions, we propose the bounding box overlap algorithm, a tracking
method specialized for fall detection. The flow of the algorithm is shown in Fig. 2. If a fallen person is predicted, the person ID is assigned to the person ID memory.
The frame number and location information of the bounding box are saved in the memory.
Then, the person ID of this frame is matched with the person ID of the previous frame
stored in the person ID memory. If the IoU between these boxes overlaps by more than
90\%, the person ID is regarded as the same and is stored in the corresponding person
ID memory.
The detection of a fallen person may not be performed accurately in every frame at
the video-level since an unexpected situation such as occlusion can take place. To
handle this, a time-merge process is performed to calculate the IoUs between the person
IDs stored for up to 10 seconds and the new person ID. If no bounding box that is
considered to be the same person ID is detected after 10 seconds, the person ID stored
in the person ID memory is stored in the merge fall-down memory, and the person ID
memory is reinitialized.
In the merge fall-down memory, the person IDs generated up to the n$^{\mathrm{th}}$
frame are stored, and the person IDs with IoU higher than 0.9 are integrated before
storage. This method was optimized for application to environments where hardware
resources are scarce, like a CCTV environment. When a new fall event is detected once
and merge fall-down memory is created, the system checks the person ID one more time
to prevent cases where the same person is incorrectly assigned a different person
ID after the fall-down tracking. If another person ID is created for a fall detection
every 30 seconds, the bounding box overlap algorithm is used again because it is likely
to be the same person ID as that stored in the merge fall-down memory. If it is verified
that the new ID is the same person ID through this operation, a single person that
has several person IDs can finally be made into one person ID.
In addition, unintentional occlusions may occur, where other objects pass between
the fallen person and the camera. This may make the system give the passing object
the same person ID as the fallen person. In this case, the bounding box ratio method
is applied again, and the person ID is given only to those satisfying the threshold.
This can be done to accurately detect the fallen person and efficiently track the
person through the time-merge process.
3.2 Mixed-up Dataset Configuration for Fall Detection
It is important that the model detects both people who have fallen and those who have
not as the same single person class so that it is possible to track them at the video
level. In addition, to solve the problem of mistaking a black object for a fallen
person, as shown in Fig. 4, we created a new dataset by mixing up the AI-Hub dataset, as shown in the fall-down
dataset in Fig. 2.
In the VFP290K dataset [7], the background and camera positions are limited to ``high'' and ``low.'' As shown
in Fig. 4, training using the VFP290K dataset [7] leads to falsely detecting a dark object as a fallen person. To address this limitation,
we added some objects that are easily found to our dataset for the training process.
We also added people wearing dark clothes to our training data.
Fig. 4. False positives of fall event detection via VFP290K.
4. Experiments
4.1.1 Experimental Setting
Verification of the proposed method was done by evaluating the performance of object
detection by training the backbone models YOLOv5 and YOLOv7 [12], with the VFP290K dataset [7] and our dataset. Then, evaluation of fall detection performance was conducted in
wild conditions with diverse domains using our test dataset. Our experiments were
conducted with the same model as that of An et al. [7]. Since the VFP290K dataset [7] has been evaluated using YOLOv5, our method also used YOLOv5 as the backbone. The
current state-of-the-art real-time object detector, YOLOv7 [12], was also used as our backbone model to evaluate the proposed method. This experiment
verified the versatility of fall detection application for when new YOLO methods are
proposed in the future.
4.1.2 Test Dataset
Since the background, environment, and camera height of the VFP290K dataset [7] do not vary, it was necessary to verify that our dataset robustly detects falls even
in natural situations. Therefore, we constructed a test dataset by dividing 1,531
seconds of video taken with a new camera height and background into 6,124 frames.
This video contains 21 fall events. All parts concerning personal information were
cropped, mosaicked, and anonymized.
4.1.3 Evaluation Metrics
Precision is the proportion of objects that the model detects as a fall to what are
truly falls:
True positives denote the number of correctly predicted falls, and false positives
mean the cases where the prediction failed despite predicting a fall.
Recall refers to the proportion of correct predictions and total actual falls:
AP is calculated as the mean precision of each class at certain thresholds. $\mathrm{mAP}_{50}$
is the average of AP over all detected classes with an IoU threshold of 0.5. In the
case of VFP290K [7], it is the average value for the two classes, fallen and non-fallen. Since we trained
the system with a single person class, the ratio of IoU over 50\% between all the
bounding boxes predicted to be a person and the ground truth label for the answer
was calculated and averaged.
$\mathrm{mAP}_{95}$ is the average value of all mAP values for IoU thresholds ranging
from 0.5 to 0.95 with a step size 0.05. The F1 score is the harmonic average of precision
and recall, and in our study, it represents the comprehensive fall detection performance
considering the tradeoff between precision and recall.
4.2 Overall Performance Compared to VFP290K (Object Detector)
In this experiment, we evaluated the object detecting performance compared to the
baseline VFP290K [7] with the same backbone models, YOLOv5 and YOLOv7 [12]. The baseline was trained with two classes, fallen and non-fallen, but we conducted
the model training process with a single person class. The result is shown in Table 1. When trained with YOLOv5 as the backbone model, our dataset improved precision,
recall, $\mathrm{mAP}_{50}$ and $\mathrm{mAP}_{95}$by 0.126, 0.084, 0.156, and 0.11
compared to VFP290K, which showed results of 0.906, 0.724, 0.841, and 0.49, respectively.
This verified that our dataset is specialized for the fall detection task.
As shown in Fig. 4, the model using the baseline dataset repeatedly outputted incorrect results and
detected a dark object as fallen people. This was because the baseline dataset was
trained without considering objects other than humans. This led to false-positive
cases, which decreased the recall value, so it could consequently be inefficient when
applied to a real CCTV environment.
Table 1 also shows the experimental results with YOLOv7 [12]. The precision, recall, $\mathrm{mAP}_{50}$, and $\mathrm{mAP}_{95}$ were 0.954,
0.844, 0.904, and 0.531, respectively, showing the best performance. In Fig. 5, it can be seen that the false-positive cases that occurred in our baseline method
disappeared when our dataset was trained with YOLOv5. This verified that even if YOLO
methods or other object detectors with better performance are proposed in the future,
our method can be applied universally with robust fall detection performance.
Fig. 5. Object Detecting Performance Comparison Between Our Method and Baseline.
Table 1. Overall Performance Evaluation: Object Detector Trained with Different Datasets.
|
YOLOV5
|
YOLOV7
|
Evaluation
|
Precision
|
Recall
|
$\mathrm{mAP}_{50}$
|
$\mathrm{mAP}_{95}$
|
Precision
|
Recall
|
$\mathrm{mAP}_{50}$
|
$\mathrm{mAP}_{95}$
|
VFP290K
|
0.78
|
0.64
|
0.685
|
0.38
|
0.79
|
0.578
|
0.651
|
0.351
|
Our dataset
|
0.906
|
0.724
|
0.841
|
0.49
|
0.954
|
0.844
|
0.904
|
0.531
|
4.3 Overall Performance Compared to VFP290K (Fall Detector)
We created a new test video to test the fall detection performance in wild conditions
with diverse domain gap, which was not included in the training dataset. The video
contained a total of 21 fall events, and an experiment was conducted to compare the
fall detection test results with the model trained with YOLOv5 on the baseline dataset.
The results are shown in Table 2.
Our method enhanced precision by 0.349 compared to the baseline. Our system achieved
an F1 score of 0.624, which is 0.104 higher than that of the baseline. While the VFP290K
dataset [7] showed relatively poor performance at the video level, our method’s fall detection
performance was verified through the experiment.
Table 2. Comparison of VFP290K[7]and Our Proposed Method on Our Fall Dataset.
|
Precision
|
Recall
|
F1-score
|
VFP290K
|
0.56
|
0.48
|
0.52
|
Our method
|
0.909
|
0.48
|
0.624
|
5. Conclusion
We proposed new methods to detect fallen people and constructed a new dataset specialized
for a fall detection task. Our method showed better performance in object detection
and fall detection at the video level than the baseline. The algorithm can be applied
in a way that it learns the characteristics or clothing of a fallen person and stores
them together in the person ID. It could be applied in a variety of ways, such as
transmission to a hospital or police station, leading to quick action taken for people
who have fallen.
ACKNOWLEDGMENTS
This work used datasets from The Open AI Dataset Project (AI-Hub, S. Korea). All
data can be accessed through AI-Hub (www.aihub.or.kr).
This work is also supported by the Korea Agency for Infrastructure Technology
Advancement (KAIA) grant funded by the Ministry of the Interior and Safety (Grant
22PQWO-C153359-04).
REFERENCES
"Facts About Falls," Centers for Disease Control, Aug. 2021.
Park et al., "Emergency Situation Recognition System Using CCTV and Deep Learning,"
Korea Information Processing Society, Nov. 2020.
He, Kaiming, Georgia Gkioxari, Piotr Dollar, and Ross Girshick, ``Mask R-CNN,'' 2017
IEEE International Conference on Computer Vision (ICCV), Oct. 2017.
Chen, Ying-Nong, Chi-Hung Chuang, Chih-Chang Yu, and Kuo-Chin Fan, ``Fall Detection
in Dusky Environment,'' SpringerLink, Nov. 2013.
Xu, Teng, et al., "Fall Detection Based on Person Detection and Multi-target Tracking,"
2021 11th International Conference on Information Technology in Medicine and Education
(ITME). IEEE, Nov. 2021.
Salimi, Mohammadamin, José JM Machado, and João Manuel RS Tavares, ``Using Deep Neural
Networks for Human Fall Detection Based on Pose Estimation,'' Sensors, 22(12), Jun.
2022.
An, Jaeju, et al., "VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen
Person Detection," Thirty-fifth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (Round 2), Aug. 2021.
Redmon, Joseph, et al., "You only look once: Unified, real-time object detection,"
Proceedings of the IEEE conference on computer vision and pattern recognition, Jun.
2016.
M. Ning, Y. Lu, W. Hou and M. Matskin, "YOLOv4-object: an Efficient Model and Method
for Object Discovery," 2021 IEEE 45th Annual Computers, Software, and Applications
Conference (COMPSAC), Jul. 2021, pp. 31-36.
WANG, Chien-Yao; BOCHKOVSKIY, Alexey; LIAO, Hong-Yuan Mark, ``Scaled-yolov4: Scaling
cross stage partial network,'' Proceedings of the IEEE/cvf conference on computer
vision and pattern recognition, Jul. 2021. p. 13029-13038.
WANG, Chien-Yao; YEH, I.-Hau; LIAO, Hong-Yuan Mark, ``You only learn one representation:
Unified network for multiple tasks,'' arXiv preprint, arXiv:2105.04206, May. 2021,
WANG, Chien-Yao; BOCHKOVSKIY, Alexey; LIAO, Hong-Yuan Mark, ``YOLOv7: Trainable bag-of-freebies
sets new state-of-the-art for real-time object detectors,'' arXiv preprint, arXiv:2207.02696,
Jul. 2022.
GIRSHICK, Ross, et al, ``Rich feature hierarchies for accurate object detection and
semantic segmentation,'' Proceedings of the IEEE conference on computer vision and
pattern recognition, Jun. 2014. p. 580-587.
GE, Zheng, et al, ``Yolox: Exceeding yolo series in 2021,'' arXiv preprint, arXiv:2107.08430,
Aug. 2021.
LONG, Xiang, et al., ``PP-YOLO: An effective and efficient implementation of object
detector,'' arXiv preprint, arXiv:2007.12099, Aug. 2020.
ZHU, Xizhou, et al., ``Deformable detr: Deformable transformers for end-to-end object
detection,'' arXiv preprint, arXiv:2010.04159, Oct. 2020.
CHARFI, Imen, et al., ``Definition and performance evaluation of a robust SVM based
fall detection solution,'' 2012 eighth international conference on signal image technology
and internet based systems. IEEE, Nov. 2012, p. 218-224.
AUVINET, Edouard, et al., ``Multiple cameras fall dataset,'' DIRO-Université de Montréal,
Tech. Rep}, Jul. 2010, 1350: 24.
MASTORAKIS, Georgios; MAKRIS, Dimitrios, ``Fall detection system using Kinect’s infrared
sensor,'' Journal of Real-Time Image Processing, Dec. 2014, 9.4: 635-646.
ZHANG, Zhong; CONLY, Christopher; ATHITSOS, Vassilis, ``Evaluating depth-based computer
vision methods for fall detection under occlusions,'' International symposium on visual
computing. Springer, Cham, 2014, p. 196-207.
Author
Seunghun Moon received the B.S. degree in electronics engineering from Sogang University,
Seoul, South Korea, in 2023, where he is currently pursuing the M.S. degree in electronics
engineering. His current research interests include deep learning, anomaly detection,
and computer vision.
Changhee Yang received the B.S. degree in electronics engineering from Dankook
University, Jukjeon, South Korea, in 2022, and he is currently pursuing the M.S. degree
in electronics engineering. His current research interests include image processing,
3D pose estimation, and computer vision.
Beoungwoo Kang received the B.S. degree in electronics engineering from Sogang
University, Seoul, South Korea, in 2022, where he is currently pursuing the M.S. degree
in electronics engineering. His current research interests include deep learning,
semantic segmentation, and computer vision.
Suk-Ju Kang (Member, IEEE) received a B.S. degree in electronic engineering from
Sogang University, South Korea, in 2006, and a Ph.D. degree in electrical and computer
engineering from the Pohang University of Science and Technology, in 2011. From 2011
to 2012, he was a Senior Researcher with LG Display, where he was a project leader
for resolution enhancement and multi-view 3D system projects. From 2012 to 2015, he
was an Assistant Professor of Electrical Engineering at Dong-A University, Busan.
He is currently a Professor of Electronic Engineering at Sogang University. He was
a recipient of the IEIE/IEEE Joint Award for Young IT Engineer of the Year, in 2019.
His current research interests include image analysis and enhancement, video processing,
multimedia signal processing, circuit design for display systems, and deep learning
systems.