3.1. LOGO Image Classification Based on Region Guided and Enhanced Networks
To achieve intelligent design of trademark logos, extracting and classifying features
from image information are needed firstly. The traditional neural network classification
model uses CNN, which mainly extracts features from various levels of the image through
convolutional operations. CNN mainly optimizes network parameters and weights through
forward and backward propagation of extremum values to make the output value infinitely
close to the target value [14,15]. The specific implementation of CNN is shown in Fig. 1.
Fig. 1. The specific implementation of CNN.
With the increasing complexity and diversity of logo images, traditional CNN classification
algorithms are no longer able to effectively extract robust image features. Therefore,
fine-grained image classification technology has been introduced in the research.
It is an important branch of image classification, characterized by subtle differences
between subcategories, but significant differences within subcategories. Due to the
similar characteristics of logo images, they can be considered as fine-grained image
classification to be solved [16]. The dataset selected for the study is the Logo 2K+classification dataset, which
features logo images with high shape class similarity and high background complexity.
The study proposes a DRGE-Net based on CNN, which belongs to a self supervised training
mechanism. This method locates LOGO regions with relatively large amounts of information,
then strengthens the data according to the guidance of regional features, and finally
uses data augmentation strategies to further strengthen the information region, thereby
achieving more effective feature learning [17]. DRGE-Net consists of regional enhancer sub network, teacher sub network, guidance
sub network and inspection sub network. Its model structure is shown in Fig. 2.
Fig. 2. DRGE-Net classification network model.
The role of the guided sub network in DRGE-Net is to calculate the information content
of all predicted regions in the image to obtain the region with the highest information
content. The specific implementation steps are to first give an input image X, and
then guide the sub network to generate regions A with different proportions using
convolutional layers, maximum pooling, and ReLU activation. Then it performs Non Maximum
Suppression (NMS) on the region to reduce redundant regions and obtain the first M
information regions. Finally, it feedbacks the obtained regions together into the
teacher's sub network to obtain the most informative regions. The teacher subnetwork
mainly calculates the confidence level by guiding the areas provided by the subnetwork.
Generally, areas with high confidence level belong to areas with high probability
of real category, and their representation is shown in Eq. (1).
In Eq. (1), $R$ represents the candidate region; $C$ refers to the confidence level of the region;
$I$ represents the amount of information corresponding to each region. The teacher
sub network mainly optimizes the detected regions and unifies the confidence and information
order of the regions through ranking loss. The ranking loss function is calculated
as shown in Eq. (2).
In Eq. (2), $a$ and $b$ represent the index of the region. The calculation of the $f$ function
is shown in Eq. (3).
Then the teacher sub network performs optimization processing through the loss function
to minimize the loss difference between the probability of the complete image and
the classification probability of the region. Its calculation is shown in Eq. (4).
Under the interaction of the guidance sub network and the teacher sub network, it
can obtain $J$ regions that are most closely related to the image information, then
uses the enhancer sub network to enhance these $J$ regions, and uses region clipping
and discarding operations to obtain finer granularity LOGO regions. The mapping calculation
of the enhanced feature map is shown in Eq. (5).
Region clipping extracts local features with a larger amount of information by amplifying
the logo region, and then it determines whether it belongs to the foreground or background.
If it is detected as the background, it is cropped. Region discarding mainly involves
hiding the background region, as shown in Eq. (6).
In Eq. (6), $\theta _{c} $ and $\theta _{d} $ are both set thresholds. The calculation of cross
entropy loss is shown in Eq. (7).
In Eq. (7), $E$ is the probability that the enhancement area maps to the real category label.
Through the synergy of guidance sub network, teacher sub network and regional enhancer
sub network, the most relevant characteristic area with the largest amount of information
will be obtained, which can help the inspection sub network to make decisions. The
specific steps are to check the sub network and first perform global and regional
feature extraction and fusion on $J$ enhanced information regions. Next, it connects
the input image feature vectors with the enhanced feature vectors, and inputs them
into the classification layer and convolution layer. Finally, it checks the sub network
to gain the final prediction result, which is expressed in Eq. (8).
In Eq. (8), $F$ represents a function transformation. After completing the loss calculation
of the four sub networks, DRGE-Net then uses the random gradient descent method for
joint loss calculation, as shown in Eq. (9).
In Eq. (9), $\alpha $ and $\beta $ are hyperparameter. Through the synergistic effect of four
sub networks, it is possible to obtain the regions with the most relevant information
and perform enhancement operations on them, thereby achieving more accurate and effective
logo image classification tasks. The detailed network structure of DRGE-Net is displayed
in Fig. 3
Fig. 3. DRGE-Net detailed network structure.
3.2. LOGO Detection Based on YOLOv3 Improved Loss Function
To better achieve the positioning and classification of target logos, it is necessary
to detect key information in the target area, determine the boundary box of the target
logo, and provide its category. The current detection model has improved in accuracy,
but its detection speed has significantly decreased. Moreover, due to the presence
of many distorted and occluded targets and complex backgrounds in the image of the
logo, it is difficult to accurately recognize and detect the target logo [18,19]. Therefore, the YOLOv3 detection algorithm was introduced in the study. The basic
YOLO network consists of 2 FC layers and 24 convolutional layers. It mainly predicts
the bounding box through the top-level feature map and estimates the probability of
different categories. YOLOv3 uses the DarkNet-53 network for feature extraction and
multi-scale feature extraction through three networks, while also introducing a dual
dimensional attention mechanism. The usage of residual structure to extract features
of different scales can ensure the convergence of the deep network and avoid the occurrence
of overfitting [20]. YOLOv3 not only ensures the distinguishability of features, but also effectively
achieves real-time detection from image to target classification and regression. The
specific structure diagram is shown in Fig. 4. The feature pyramid will match the feature information of 9 detection boxes to 3
feature maps of different sizes. The $38 * 38$ size corresponds to a large target
box, the 76 * 76 size corresponds to a medium target box, and the $152 * 152$ size
corresponds to a small target box.
The logo images in the LogoDet-3K dataset have characteristics such as high background
complexity, varying shapes and sizes of the logo area, and the presence of distortion
and occlusion. A logo detection method called Logo YOLO has been developed to achieve
faster and more accurate detection of target logos [21]. The specific operation steps of this method are to first recalculate the anchor
size of the LogoDet-3K dataset using the K-means clustering algorithm to improve the
output scale of the network. Then, the effective features in the logo image are collected
through the DarkNet network, and the residual module is used to remove excess information.
Then, the feature pyramid is used to fuse multi-scale features and detect them at
three scales. Finally, the classification loss function is carried out to reduce the
negative impact of difficult and easy samples [22]. The detection framework of Logo YOLO is shown in Fig. 5.
Fig. 4. Specific structural diagram of YOLOv3.
Fig. 5. Logo YOLO's detection framework.
Anchor boxes represent candidate boxes with fixed height and width. The original YOLOv3
can cluster the COCO dataset to generate 9 anchor boxes and output at 3 scales. Among
them, the 13?13 scale output has the largest Receptive field, so it is used to detect
large targets; 26?26 scale output is for detecting medium-sized targets; the 52 ?52
scale output should be used for detecting small targets. However, due to the fact
that the proportion size of the original YOLOv3 network is no longer suitable for
the detection of the LogoDet-3K dataset, the K-means clustering algorithm is used
for clustering analysis of the target bounding boxes in LogoDet-3K, and the average
overlap is used as the evaluation indicator for the result analysis. The objective
function representation of the average folding degree is shown in Eq. (10).
In Eq. (10), $N$ is the total number of samples; $k$ means the number of clusters; $B$ represents
a rectangular box of real samples; $C$ represents the clustering result of K-means.
YOLOv3 mainly extracts image features through the Darknet-53 network, which introduces
a residual module that can effectively control the propagation of gradients, thereby
reducing the difficulty of training deep networks. YOLOv3 mainly utilizes 53 convolutional
layers in the Darknet-53 network to achieve feature extraction, multi-scale feature
fusion, and detection. Among them, the final number of classification categories obtained
is the number of convolutional kernels in the last convolutional layer. In addition,
the loss function of target detection task includes regression loss and classification
loss. The original loss function of YOLOv3 mainly includes loss of box coordinate,
category, confidence and width height, and its expression is shown in Eq. (11).
In Eq. (11), $Loss_{xy} $, $Loss_{class} $, $Loss_{conf} $, and $Loss_{conf} $ respectively represent
bbox coordinate loss, category loss, confidence loss, and width height loss. Due to
the imbalance of samples, the loss function during training is very small, which reduces
the detection accuracy. To solve this problem, Focal loss is introduced based on the
original loss function, and its calculation is shown in Eq. (12).
In Eq. (12), $y$ represents the true category; $y'$ refers to the model probability estimated
by the activation function; $\omega $ represents the hyperparameter of balanced positive
and negative samples; $\lambda $ represents the hyperparameter of the adjustment difficulty
sample, which represents the attenuation degree of the sample loss. In addition, the
Intersection over Union (IoU) is an indicator for detecting the accuracy of real and
predicted boxes, which can characterize the degree of loss in target localization.
Its calculation is shown in Eq. (13).
In Eq. (13), $B_{gt} $ denotes the target box; $B^{pd} $ is a prediction box. Although IoU solves
the problem of variables not having scale invariance, it only acts at the intersection
of bounding boxes. On the basis of IoU, the minimum bounding rectangle of real box
and prediction box is designed to realize gradient optimization of disjoint rectangular
box, but its convergence speed is still limited. Therefore, CIoU loss is introduced
on the basis of YOLOv3 loss function to achieve effective regression of prediction
box, and its calculation is shown in Eq. (14).
In Eq. (14), $R_{CIoU} $ represents the penalty term of the target box and the prediction box,
and its calculation is shown in Eq. (15).
In Eq. (15), $e^{gt} $ and $Loss_{conf} $ represent the center points of the target box and the
prediction box, respectively; $\varphi (\cdot )$ represents the Euclidean distance
function between two points; $\mu \upsilon $ refers to the aspect ratio parameter.
The modeling diagram of CIoU losses is shown in Fig. 6. In Fig. 6, the blue box represents the real box, the purple box represents the predicted box,
and the red box represents the smallest box containing both. $h$ represents the diagonal
distance between two boxes, and $d$ represents the distance between the center points
of the two boxes.
Fig. 6. CIoU loss modeling diagram.