R-CNN Auto-system for Detecting Text Road Signs in Baghdad
AliOmar M. S.1
Al-ZukyAli A. D.1
Al-ObaidiFatin E. M.1
-
(Department of Physics, College of Science, Mustansiriyah University, Baghdad, Iraq
{omar_m_sultan, prof.alialzuky, sci.phy.fam}@uomustansiriyah.edu.iq
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
R-CNN, Labeling, Epoch, Detection, Baghdad, Recognition
1. Introduction
Artificial intelligence (AI) and pattern recognition are being used in traffic sign
detection and recognition for many applications, like autonomous and assisted driving.
A driver who can understand text traffic signs is alerted to potentially unsafe situations
and inappropriate behavior [1]. Without text traffic signs to direct and inform motorists, pedestrians, and other
road users, the world's traffic system would not function as it does. Text traffic
signs reflect the closeness of various sites as well as the accessibility of services.
Text road signs are frequently posted on poles in the center or to the side of the
road. Textual traffic signs vary from country to country according to local norms
and legislation. Technology, traffic laws, and regulations all have an impact [2].
Construction of an autonomous driving system is a fascinating subject that is growing
in popularity. In some cases, the vehicle is equipped with sensors such as radar,
laser, GPS, and cameras to monitor the environment. Combining a camera with computer
vision technologies is the most typical method. When compared to other sensors, the
camera's low cost and high output make it an appealing alternative [3].
Text traffic signs were created as a result of efforts to modernize the traffic system
and improve driving safety. Government entities in charge of enforcing traffic regulations
and collecting data on traffic collisions and patterns are essential resources for
the scientific study of text traffic signs. International organizations and scientific
institutions perform studies and research on text traffic signs and make recommendations
and proposals to improve them more successful in increasing traffic safety [4]. Many nations across the world utilize instructional, cautionary, and directional
text traffic signs, which are divided into distinct categories. Text traffic signs'
colors serve as a classification system. For example, red color represents a danger
or caution, yellow color represents a warning, and green color represents an instruction
[5].
Every object-detecting system must go through two steps. A recognition process comes
after the detection procedure. Traffic signs are classified based on their colors
using a huge database built by training with video after detection. As a result, training
is an essential component of any object-detecting system. Color distinguishes the
traffic signs in each frame during the detection process. Text road signs are classified
using a large database created by training with video after color detection [6]. Text traffic signs provide users with a range of helpful information and aid in
improving road safety, reducing traffic collisions, and improving traffic control.
Following textual traffic signs correctly shall improve road safety and reduce collisions
[7].
The advancement of technology and cognitive science has enabled a more sophisticated
detecting system to notify drivers of text traffic signs inside an automobile via
a display screen utilizing recurrent convolutional neural network (R-CNN) algorithms
[8]. Several studies on textual traffic sign detection have been conducted. An approach
that employs a CNN based traffic sign classification algorithm is proposed. It also
has a camera detection feature for traffic signals. A motorist will be able to concentrate
carefully on the screen while retaining focus on a sign, saving time from having to
study each sign [9].
2. Related Work
Joint transform correlation (JTC) and picture segmentation were used to automatically
recognize road signs from any nation, regardless of color or shape. These methods
made several contributions, including the development of distortion-invariant fringe-adjusted
JTC, the introduction of two new criteria, and the reclassification of rectangular
signs, as proposed by J. F. Khan et al. [10]. Techniques for locating and extracting text from traffic sign panels were also employed,
and an OCR algorithm was utilized to recognize a variety of characters present on
the traffic panel for effective text string extraction, as demonstrated by A. Mammeri
et al. [11].
Experimental results using the German Traffic Sign Detection Benchmark (GTSDB) and
Chinese Traffic Sign Detection Benchmark (CTSDB) datasets showed that the combination
of Single Shot Multibox Detector (SSD) with Receptive Field Module (RFM) and Path
Aggregation Network (PAN), which abbreviated to SSD-RP achieved a higher mean average
precision (MAP) than other SSD algorithms and exhibited superior detection precision
for identifying small traffic signs. SSD-RP surpassed well-known object recognition
algorithms such as Faster R-CNN, Retina-Net, and YOLOv3 in terms of balancing detection
speed and precision, as indicated by J. Wu et al. [12]. Furthermore, a lightweight, YOLOv4-based integration framework was suggested for
real-time traffic sign detection using deep learning techniques. The architecture
facilitated sharing of information and flow at different levels while reducing network
computation overhead to address latency issues. The goal was to ensure a certain level
of generalization and resilience while enhancing the detection performance of traffic
signs in various objective environments, including scale and illumination fluctuations,
as proposed by Y. Gu et al. [13].
3. The Proposed Scheme
As one of the most significant technologies, an R-CNN is frequently used to carry
out image processing tasks. It consists of rectangular area proposals with CNN features.
An R-CNN is a kind of neural network that resembles how the visual cortex in the human
brain functions. Recognizing the most significant elements in an image is the main
objective of a CNN, of which convolutional layers make up the majority of layers.
There is a difference between the CNN algorithm and the R-CNN algorithm. The CNN algorithm
identifies, distinguishes, and can track a target within an image, while an R-CNN
identifies a target within an image and follows it in an easy way due to its comprehensive
manner that deals with every pixel within the image. An R-CNN has five stages [14]:
1. First, the region of interest is defined by labels that outline a set of intersecting
squares.
2. The second stage, convolution, is used to apply filters and identify characteristics.
3. Third, a max pooling step is used to reduce the image's size while preserving its
key details.
4. The image is converted to a 1D array (vector) after flattening.
5. All necessary connections in the full connection stage are completed. The phases
are shown in Figs. 1 and 2.
The applications of R-CNN categorization techniques are numerous. For example, botany
relies on precise standards to categorize and arrange plant specimens [17]. To find and diagnose cavities, an R-CNN is extensively utilized in dentistry [18]. Medical uses of classification include the identification and classification of
brain tumors [19].
In the area of remote sensing, a study compared and contrasted the efficiency of the
R-CNN approach with conventional photography for automatically identifying and mapping
trees in UAV imagery [20]. Additionally, classification techniques are frequently used in object detection.
An R-CNN is a popular method that has been effective in identifying items in photographs
including faces, automobiles, and people [21]. The versatility of classification systems is demonstrated by these numerous applications.
Fig. 1. Stage of convolutional learning for filtering and feature detection [15].
Fig. 2. A fully linked recurrent convolutional neural network (R-CNN) [16].
3.1 Tools and Methodology
In the current research, several short video clips not exceeding nine seconds in length
were captured textual traffic signs during different times throughout the day (before
and afternoon), as shown in Fig. 3. The signs had white text and we divided into two groups. One group had a full blue
background, and the second had a green rectangular shape at a height of 1.5-1.6 meters
above the ground on highways in Baghdad. The videos were captured using an iPhone
X equipped with a 10-megapixel (MP) camera and a mobile phone holder mounted inside
a moving vehicle at different speeds, as shown in Fig. 4. In addition, data analysis and post-processing were performed with the help of a
computer equipped with MATLAB (R2020a).
Text road signs are mounted on a platform in the middle or to the side of the road
so that drivers can read them from a distance of 100 meters or less. In Baghdad, residential
streets have a speed limit of roughly 60 km/h, and motorways have a limit of 100 km/h.
An automobile travels 100 meters in more than three seconds. The motorist needs to
decide what to do at this point.
This study suggests a frame rate of 30 frames per second for a video recording system.
The system must have appropriate sign detection and recognition in at least one of
90 consecutive frames for a driver to make a correct decision, which means that the
proposed system must reliably and accurately identify at least one such sign for all
this time. Therefore, it is sufficient and reasonable for drivers to take the correct
driving direction if a sign is confirmed and detected correctly every three seconds.
The R-CNN system must determine the appropriate course of action if the drivers are
distracted from their targets because they are concerned with something other than
the road. By using a trained database and an R-CNN object detector, text traffic signals
can be detected. As can be seen in Fig. 5, a manual image labeler was utilized to outline each frame of collected video that
contains the desired textual road signs. To begin, we used the training code in Fig. 6 to train the model by extracting as many features as possible from 463 images with
544 targets for blue signs and 482 images with 582 targets for blue-green signs in
the datasets.
After performing numerous experimental activities, the layers were purposefully chosen,
and the outcomes of this experiment inspired us to select values from the layers.
The term "epoch" refers to the process of optimizing and training neural networks
and deep learning models. A single iteration of the model-building process, known
as a "training pass," involves processing the entire training dataset by the model,
calculating losses, and updating parameters. The complete training dataset is cycled
through the model several times when training for a certain number of epochs. In this
manner, the model may "learn" from the data and improve over time. In addition, increasing
the number of epochs may improve model performance. The model may "learn" from the
data and improve over time. Increasing the number of epochs has the potential to improve
model performance, but caution must be exercised to avoid the model overfitting the
training set [22].
The current work used 20, and 60 epochs to observe the sensitivity during training.
Epochs were used to extract the model, and the training time for models with 1-20
epochs was 7 minutes and 34 seconds, resulting in obtaining 16,380 images. For 1-60
epochs, the training time for this model took about 22 minute, and 42 seconds resulting
in 49,140 images. Hence, the recognition stage in Fig. 7 was initialized.
The recall (R), sensitivity (S), precision (p), and F1 score can be calculated through
the following equations [23]:
where TP represents the cases in which the model predicted the presence of the target correctly.
FN represents the cases where the target was present, but the model failed to detect
it. The recall represents the variable calculated by dividing TP by the total number of instances or samples in the dataset (N). FP represents the
cases where no target present, but the model predict it as present.
Fig. 4. Camera setup inside a car.
Fig. 5. Manual Labeling stage.
Fig. 6. Training algorithm (I).
Fig. 7. Recognition algorithm (II).
4. Performance Evaluation
R-CNN technology plays an effective role in accurately identifying text road signs.
In order to evaluate the performance improvement achieved by the proposed scheme,
we considered a small test for 20, and 60 epochs. The variation of the parameters
are shown in Tables 1 and 2, while Fig. 8 illustrates the algorithm's achievement in detecting the specified textual road signs.
For all employed signs, it is apparent that the precision and epoch variation have
a polynomial relationship. A comparison between 20, and 60 epochs can be seen in Fig. 9 for all used parameters. The R-CNN approach succeeded in detecting blue and blue-green
textual road signs with recall values equal to 0.4743 and 0.9519, and sensitivity
values of 47.43% and 95.19%, respectively for 20 epochs. For 60 epochs, the recall
values were equal to 0.4835 and 0.9519, with sensitivity values of 48.35% and 95.19%,
respectively. For all textual road signs, the precision values were unity for both
20 and 60 epochs. The results of the F1 score were 0.6435 and 0.9753 for 20 epochs,
while for 60 epochs they were 0.6518 and 0.9753 for blue and blue-green signs, respectively.}
In the tables above, one can noticed that there are excellent results in the detection
process of blue-green signs. This is due to several factors such as the contrast between
the colors of the signs (such as blue, green, and white), in addition to the quality
of imaging during the day, which reduces the dispersion that may occur on the surface
of the sign. The lack of contrast for the blue signs as well as the time of imaging
(afternoon near sunset) led to a lack of detection. In general, the model is considered
to be successful in the detection process.
Fig. 8. R-CNN application in detecting text road signs: (a) Marking the detected target with red rectangular shape; (b) Extracting target with score; (c) Extracted detected sign.
Fig. 9. The variation of (a) Tp, Fn; (b) recall; (c) sensitivity; (d) F1 Score for the used text road signs with the aid of the R-CNN model.
Table 1. Result of applying the R-CNN model at 20 epochs.
Data
|
Blue Sign
|
Blue-Green Sign
|
No. of images
|
463
|
482
|
No. of signs
|
544
|
582
|
TP
|
258
|
554
|
FN
|
286
|
28
|
FP
|
0
|
0
|
Recall
|
0.4743
|
0.9519
|
Sensitivity
|
47.43%
|
95.19%
|
Precision
|
1
|
1
|
F1 Score
|
0.6435
|
0.9753
|
Table 2. Result of applying the R-CNN model at 60 epochs.
Data
|
Blue Sign
|
Blue-Green Sign
|
No. of images
|
463
|
482
|
No. of signs
|
544
|
582
|
TP
|
263
|
554
|
FN
|
281
|
28
|
FP
|
0
|
0
|
Recall
|
0.4835
|
0.9519
|
Sensitivity
|
48.35%
|
95.19%
|
Precision
|
1
|
1
|
F1 Score
|
0.6518
|
0.9753
|
5. Conclusion
The R-CNN technique is one of the technologies used in computer vision and object
recognition. It was developed as a practical answer to the problem of accurately identifying
and categorizing things in visual data. The aim behind R-CNN is to exploit parts of
images that may include objects of interest. The R-CNN technique is well known for
its ability to recognize and localize objects. The R-CNN technique is one of the most
successful approaches for detecting and pinpointing targets in images. Its high classification
and identification accuracy can be attributed to its use of deep learning and prospective
areas.
The Results showed that the contrast in text road signs affects the detection of them.
The R-CNN approach succeeded in detecting blue and blue-green textual road signs with
recall values equal to 0.4743 and 0.9519, and sensitivity values of 47.43% and 95.19%
for 20 epochs, while for 60 epochs, the recall values equal to 0.4835 and 0.9519,
with sensitivity values of 48.35% and 95.19%, respectively. For all textual road signs,
the precision values were unity for both 20, and 60 epochs. The results of the F1
score were 0.6435 and 0.9753 for 20 epochs, while for 60 epochs they were 0.6518 and
0.9753 for blue and blue-green signs, respectively.
Thus, the issue of automatic text road signs detection and identification has been
resolved. The scientific originality of the acquired results is that the sample detection
method may be accurately and precisely identify blue and blue-green signs indicators
in various situations. Prospects for future research include examining and contrasting
various object detection methods, with other text road signs.
Recommendation
Even though this study only covered a tiny geographical region (a few streets
in the capital of Baghdad), the technology used is regarded as cutting-edge in the
field of artificial intelligence. Hence, more research into this area is necessary.
Observance of moral requirements
The research was conducted as part of the authors' jobs and had no outside funding.
Therefore there are no conflicts of interest.
REFERENCES
Albelwi, S., & Mahmood, A. (2017). A framework for designing the architectures of
deep convolutional neural networks. Entropy, 19(6), 242.
Brahim, J., Khalid El, M., & Noureddine, F. (2023). Developing an Efficient System
with Mask R-CNN for Agricultural Applications. AGRIS on-line Papers in Economics and
Informatics, 15(1), 61 - 72.
Gu, Y., & Si, B. (2022). A novel lightweight real-time traffic sign detection integration
framework based on YOLOv4. Entropy, 24(4), 487.
Hameed, K., Chai, D., & Rassau, A. (2022). Score-based mask edge improvement of Mask-RCNN
for segmentation of fruit and vegetables. Expert Systems with Applications, 190, 116205.
He, P., Zuo, L., Zhang, C., & Zhang, Z. (2019). A value recognition algorithm for
pointer meter based on improved Mask-RCNN. 9th International Conference on Information
Science and Technology (ICIST), (pp. 108-113). Hulunbuir, China.
Hussien, R. S., Elkhidir, A. A., & Elnourani, M. (2015). Optical character recognition
of Arabic handwritten characters using neural network. 2015 International Conference
on Computing, Control, Networking, Electronics and Embedded Systems Engineering (ICCNEEE),
(pp. 456-461).
Hyder, A. A., Norton, R., Pérez-Núñez, R., Mojarro-Iñiguez, F. R., Peden, M., Kobusingye,
O., et al. (2016). The Road Traffic Injuries Research Network: a decade of research
capacity strengthening in low- and middle-income countries. Health Res Policy Sys
, 14(14), 1-9.
Jain, S. (2020). Pushing the boundary of Semantic Image Segmentation. ETH Zurich:
KTH, School of Electrical Engineering and Computer Science (EECS).
Kattenborn, T., Leitloff, J., Schiefer, F., & Hinz , S. (2021). Review on Convolutional
Neural Networks (CNN) in vegetation remote sensing. ISPRS Journal of Photogrammetry
and Remote Sensing, 173, 24-49.
Kesav, N., & Jibukumar, M. G. (2022). Efficient and low complex architecture for detection
and classification of Brain Tumor using RCNN with Two Channel CNN. Journal of King
Saud University - Computer and Information Sciences, 34(8), 6229-6242.
Khan, J. F., Bhuiyan, S. A., & Adhami, R. R. (2011). Image segmen-tation and shape
analysis for road-sign detection. IEEE Transactions on Intelligent Transportation
Systems, 12(1), 83-96.
Li, W. (2021). Analysis of object detection performance based on Faster R-CNN. 6th
International Conference on Electronic Technology and Information Science (ICETIS
2021), 1827. Harbin, China.
Mammeri, A., Khiari, E. H., & Boukerche, A. (2014). Road-sign text recognition architecture
for intelligent transportation systems. IEEE 80th Vehicular Technology Conference
(VTC2014-Fall), (pp. 1-5). Vancouver, BC, Canada.
Mehta, S., Paunwala, C., & Vaidya, B. (2019). CNN based traffic sign classification
using adam optimizer. 2019 International Conference on Intelligent Computing and Control
Systems (ICCS), (pp. 1293-1298). Madurai, India.
Mogelmose, A., Trivedi, M. M., & Moeslund, T. B. (2012). Vision-Based Traffic Sign
Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and
Survey. IEEE Transactions on Intelligent Transportation Systems, 13, pp. 1484-1497.
Qin, F., Fang, B., & Zhao, H. (2010). Traffic sign segmentation and recognition in
scene images. 2010 Chinese Conference on Pattern Recognition (CCPR), (pp. 1-5). Chongqing,
China.
Reinius, S. (2013, 1 30). Object recognition using the OpenCV Haar cascade-classifier
on the iOS platform. Institutionen för informationsteknologi, Department of Information
Technology, Uppsala Universitet.
Robielos, R., & Lin, C. J. (2022). Traffic Sign Comprehension among Filipino Drivers
and Nondrivers in Metro Manila. Appl. Sci., 12(16), 8337.
Sai, B. N., & Sasikala, T. (2019, February). Object detection and count of objects
in image using tensor flow object detection API. 2019 International Conference on
Smart Systems and Inventive Technology (ICSSIT), (pp. 542-546). Tirunelveli, India.
Wang, Y., Jiang, Z., Li, Y., Hwang, J. N., & Xing an, G. (2021). RODNet: A Real-Time
Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization.
IEEE Journal of Selected Topics in Signal Processing, pp. (99):1-1.
Wu, J., & Liao, S. (2022). Traffic sign detection based on SSD combined with receptive
field module and path aggregation network. Computational Intelligence and Neuroscience,
Hindawi, 2022, 1-13.
Yu, K., Hao, Z., Post, C. J., Mikhailova, E. A., Lin , L., Zhao, G., et al. (2022).
Comparison of classical methods and mask R-CNN for automatic tree detection and mapping
using UAV imagery. Remote Sensing, 14(2), 295.
Zhu, Y., Xu , T., Peng, L., Cao, Y., Zhao, X., Li, S., et al. (2022). Faster-RCNN
based intelligent detection and localization of dental caries. Displays, 74, 102201.
Author
Omar M. S. Ali is an Ph.D. student at the Physics Department, College of Science,
Mustansiriyah University, Baghdad, Iraq. He obtained his M.Sc degree in Remote sensing
and Image processing from the Physics Department/ College of Science / Baghdad University
in 2019. He obtained his B.Sc. degree in Physics from the Physics Department / College
of Science/ Baghdad University in 2012. His interests are in Remote sensing, Image
Processing, GIS, robotics programming, mathematics, and website interface design.
He can be contacted by email at: omar_m_sultan@uomustansiriyah.edu.iq
Ali A. D. Al-Zuky is a Professor at the Physics Department, College of Science,
Mustansiriyah University, Baghdad, Iraq. He holds a Ph.D. degree in Physics / Digital
Image Processing, from the Physics Department/ College of Science / University of
Baghdad, 1999. He supervised more than 40 M.Sc and 20 Ph.D. projects for postgraduate
students in Physics, Computer Science, Computer Engineering, and Medical Physics.
He published more than 200 papers in scientific journals and at various local and
international scientific conferences in addition to two patents. He received awards
for Science Day from the Ministry of Higher Education and Scientific Research in Iraq
in 2011 and 2012 and Education Award for Science in 2013. He can be contacted by email
at:
prof.alialzuky@uomustansiriyah.edu.iq
Fatin E. M. Al-Obaidi is an Assistant Professor at the Physics Department, College
of Science, Mustansiriyah University, Baghdad, Iraq. She holds a Ph.D. degree in Physics
from the Physics Department / College of Science / Mustansiriyah University. She received
awards for Science Day from the Ministry of Higher Education and Scientific Research
in Iraq in 2011. Her research areas are Image/Signal Processing, Analysis, Pattern
Recognition, and Numerical Analysis. She can be contacted by email at: Sci.phy.fam@uomustansiriyah.edu.iq