1. Introduction
With the advent of the information age, the Internet has penetrated into everyone’s
life, and the rapid development of the application of multimedia databases makes the
scale of databases increase sharply. With the rapid development of information technology,
progress has appeared in the field of digital image processing with much data. With
the rapid development of the Internet, image processing has been widely used in human
life, especially in application environments such as remote sensing [1], crop detection [2,3], medicine and meteorology [4], and electric power systems [5]. As the process of describing things and their corresponding characteristics becomes
complicated and diverse, the introduction of image information has greatly helped
users. It can be said that increasingly innovative image processing technology is
an important part of modern life.
<New paragraph> Graphic information is stored on a computer by converting it into
digital information. After being converted into digital information, the user can
operate the computer to manipulate the image after digital informatization [6]. Therefore, the key to image processing is the computing power of the computer. The
computing power of a computer is often limited by the software and the hardware of
the computer. In the past, because of the limitation of a computer’s computing power,
a user could only realize the preprocessing of an image in a computer. However, with
the development of personal computers, digital image processing technology has changed.
Image processing is more diversified, there is higher image processing accuracy, the
reproducibility of an image processing algorithm is better. Therefore, it can be said
that after continuous development, today’s graphics processing technology has been
able to break through the preprocessing process, by the computer to make the corresponding
image understanding behavior of image understanding and computer vision technology
has become a new challenge in image processing <note: awkward/ambiguous; clarification
by the author is needed>.
<New paragraph> The importance of image processing has been constantly presented to
the world, so a worldwide upsurge is emerging, and there are many talents and materials
in every corner of the world. Faced with a large number of digital images, how to
effectively organize and manage these data to meet the needs of users has become an
urgent and meaningful research topic. Image retrieval [7] provides an effective solution to these problems to some extent. In the early days,
people added text annotation to images by hand and then did image retrieval. When
the size of an image database is small, the accuracy of manual annotation is high,
and the annotation process is relatively simple. With the growth of digital images,
traditional manual labeling has shown great limitations: (1) when the size of the
image dataset is large, manual labeling needs much time and labor, and (2) for the
same digital image, different people have different understanding, which has great
subjectivity. Therefore, an automatic image annotation algorithm has received extensive
attention.
<New paragraph> Image captioning [8,9] is a comprehensive problem combining computer vision, natural language processing,
and machine learning and requires generating a paragraph of text describing a given
image. The task is remarkably simple for humans, but it is a daunting challenge for
computers. The model not only needs to be able to extract the features of the image,
but also to identify the objects in the image and finally use natural language to
express the relationship between them. As two-dimensional data, an image has abundant
spatial distribution information, including the spatial relationship between the objects
contained in the image and the spatial structure of the object itself, which is of
great significance for image retrieval and classification. However, in the real world,
in addition to displaying complete images, there are a large number of images that
are damaged or intentionally obscured.
<New paragraph> To some extent, the incompleteness of image information brings inconvenience
and challenges to the interpretation and understanding of images. Therefore, it is
necessary and important to find an effective method to label the incomplete images.
Therefore, many scholars have carried out in-depth discussion and research on image
annotation. Srinivas discussed the importance of automatic and intelligent image annotation
in view of the fact that manual image annotation takes much time. Srinivas analyzed
the research results of automatic image annotation in the last decade to help to remedy
the defects of existing automatic image annotation methods [10].
<New paragraph> Zhang et al. proposed an image region annotation framework based on
the syntactic and semantic correlation between image segmentation regions. The results
show that the image annotation using this method has good performance on the Corel
5K dataset, and the annotation accuracy is high [11]. Mehmood et al. designed a support vector machine based on the weighted average of
a triangle histogram and applied the improved support vector machine to image retrieval
and automatic semantic annotation. Qualitative and quantitative analysis of three
image benchmarks confirmed the effectiveness of this method [12].
<New paragraph> However, the existing intelligent image annotation methods still have
some shortcomings, such as low efficiency, low precision, and so on. Therefore, by
combining the scaling invariant feature transformation (SIFT) algorithm with image
region selection and similarity transfer, an automatic image annotation algorithm
based on a mobile computing environment is proposed. The proposed automatic image
annotation algorithm can efficiently and accurately achieve intelligent image annotation,
which plays an important role in image processing, image recognition, and other fields.
2. Overall Design of Automatic Marking of Incomplete Images in Mobile Computing Environment
The main steps of automatic labeling are image preprocessing, image feature extraction,
image feature similarity measurement, model training, automatic labeling, and so on.
(1) Image preprocessing: In image annotation, the quality of an image will directly
affect the effect of the annotation algorithm, so it is necessary to perform image
preprocessing before extracting image features. Image preprocessing mainly removes
the useless noise information in the image, strengthens the useful information, and
enhances the robustness of image annotation.
(2) Image feature extraction: An image feature is a unique property of a certain type
of image, and feature extraction realizes the quantitative expression of these image
properties by programming with some mathematical means. Good features can often greatly
improve the effect of image annotation.
(3) Similarity measurement of image features: The image to be marked has a defect,
and the defect cannot provide help in understanding the content information of the
image. Therefore, the similarity measurement of the image is not taken into account,
and the similarity of the overall content information of the image is determined by
the similarity of the information contained in the display part of the image. The
missing part is selected with a rectangular selection box tangent to the edge of the
missing part, and the image is divided proportionally based on the reverse extension
line <note: ambiguous> of the rectangular selection. If the image contains more information,
the image is divided continuously as appropriate. The aim of image segmentation is
to obtain meaningful partitions from the input image [10], which is basic work in the field of image processing and is also an important step
for subsequent image processing and analysis. Therefore, before the low-level feature
extraction of the image, in order to improve the annotation effect of the image, all
images are uniformly segmented in this form.
(4) Model training: Model training is the core part of the image labeling algorithm.
Whatever kind of image labeling method is used, after obtaining good image feature
expression, it constructs its own model and then learns an image feature to find the
relevance between an image and semantic keywords.
(5) Automatic labeling: Automatic labeling of test images is done for different data
or application scenarios.
3. Design of Automatic Labeling Algorithm for Incomplete Image
3.1 Image Preprocessing
According to the characteristics of a defect image, the defect part of the image cannot
help us understand the content of the image. Therefore, when research measure the
similarity of the image, research do not consider the similarity of the image, and
the similarity of the overall content of the image is determined by the similarity
of the information contained in the display part of the image. Image segmentation
is basic work in the field of image processing and is also an important step in the
following image processing and analysis. Thus, before extracting the low-level features
of the image, in order to improve the effect of image annotation, all the images to
be annotated are divided into regions and segmented uniformly.
In region division and image segmentation, the method of region selection was used.
The aim of region selection is to select several regions from the whole image in a
certain way, describe the content of the image based on these basic regions, and better
mine the information of different objects in the image [12]. Image region selection methods are mainly divided into fixed division, image segmentation,
and prominent points. Among these, image segmentation is the most effective and the
longest method of region selection. Image segmentation aims to segment the image into
regions corresponding to several objects so that each region corresponds to one object.
<New paragraph> Image segmentation is basic research content in the field of computer
vision. The regions after image segmentation are often irregular regions. A regional
covariance description method can be used to extract each region. Let I be a 1-D gray
or 3-D color image and F be a feature image of W * H * d extracted from I:
$\left(x,y\right)$ is the coordinates of the feature point, W is the width of the
image, H is the height of the image, d is the number of features extracted, and $\phi
\left(\cdot \right)$ can be any mapping function, such as the image gray value, color,
gradient, and filter response. For a given region R, the order $z_{i}$ is the d-dimension
characteristic point inside R. Region R is represented by the covariance of the feature
points:
$\mu =\frac{1}{n}\sum _{i=1}^{n}z_{i}$ is the mean of all feature points, and n is
the number of pixels in the region.
Firstly, an image is segmented into different regions to make each region correspond
to an object. Then, the region after image segmentation is described using covariance.
The difference from the original covariance description is that the region corresponding
to the original covariance is a regular rectangular region, while the region corresponding
to the covariance matrix is an irregular region. Regular regions usually contain multiple
objects, and the region after image segmentation usually corresponds to a semantic
object. Covariance description is a regional representation method from the perspective
of regional feature point distribution, which is independent of the size of the region.
For the same semantic goal, the corresponding regional distribution will be similar,
and the covariance will also be similar. Obviously, the region after image segmentation
is represented by covariance, which can be used to distinguish regions of different
objects.
3.2 Image Feature Extraction
3.2.1 Pre-extraction of Image Features
SIFT is a feature description method used in the field of image processing. The scale-invariant
feature description method has a scale-invariant feature and can be made by detecting
the key points in the image. Therefore, SIFT is used to extract image features.
Transform feature detection can be summarized as four basic steps. Firstly, the extremum
of the scale space is detected, and the position information of all images in different
scale spaces is calculated. Potential feature positions that are invariant with scale
and rotation can be identified using Gaussian differentials. Then, the key feature
points are calculated and fitted to determine the location and scale to compare the
information at each location. The higher the stability is, the better the selection
of feature points will be. Then, the direction is calculated, and the gradient direction
is compared based on the image information.
<New paragraph> These two steps are repeated to ensure that the algorithm achieves
a relatively high value in terms of invariance with scale. In the last stage, research
compute the expression information of the feature points, which is used to compute
the gradient in the adjacent region around the feature points. The feature of the
gradient attribute is that it can capture the change information of the measured position
more strongly so as to allow the deformation and illumination change of the larger
local shape.
According to the data structure and data type selected in this paper, research can
choose the best number of feature points returned by the algorithm. <note: ambiguous
(this is not a complete sentence)> Research filter out the absolute thresholds of
poor feature points <note: ambiguous>. The larger the threshold is, the smaller the
number of feature points returned is. Research use a threshold to filter out edge
effects. The larger the threshold is, the smaller the number of feature points filtered
out is. <note: ambiguous (incomplete)> A Gaussian pyramid is a concept put forward
in scale-invariant feature transformation. Firstly, a Gaussian pyramid is composed
of many pyramids, and each pyramid contains several layers. The Gaussian pyramid is
constructed by doubling the original image to be the 1st level of Group 1 of the Gaussian
pyramid and smoothing the 1st level of Group 1 to be the 2nd level of Group 1 of the
pyramid. The Gaussian smoothing function is as follows:
$\sigma $ is the standard deviation of the normal distribution. The larger the standard
deviation is, the more blurred the image is. The blur radius is r, which refers to
the distance from the target to the center of the circle. The Gaussian function in
2D space is:
For parameter $\sigma $, a fixed value of 1.6 is obtained in SIFT. $\sigma $ is multiplied
by a scale factor k to find a new smoothing factor $\sigma $, which is used to smooth
group 1 and layer 2 images, and the result image is used as layer 3. Research repeat
this way to find the L-layer image.
Generally, the number of L-levels is determined by the size of the image, and t is
the logarithmic value of the dimension of the topmost image in the pyramid. In the
same group, the dimensions of each layer have the same numerical size, but the values
of the smoothing coefficients are not the same. Their corresponding smoothing coefficients
are $0,\sigma ,k\sigma ,k^{2}\sigma ,\left| k^{3}\sigma ,\right.\cdots ,k^{l-2}\sigma
$.
The first group of inverse layer 3 is sampled with scale factor 2, and the obtained
image is taken as the first layer of group 2. Then, Gaussian smoothing is performed
on the first layer of the second group to find the second layer of the second group.
As in Step 2, the L layer of the second group is the same size in the same group.
The corresponding smoothing coefficients are $0,\sigma ,k\sigma ,k^{2}\sigma ,$ $k^{3}\sigma
,\cdots ,k^{l-2}\sigma $. But group 2 is half the size of group 1.
3.2.2 Image Feature Extraction
A Kalman filter analysis model of controllable direction of the incomplete image was
constructed by using the method of region merging, and the sparse feature points of
the incomplete image were described by $I(i,j)$ with the method of region equivalent
histogram analysis. The target template $I_{\left(k\right)}(i,j)$ is as follows:
k represents the equivalent area control coefficient of the regional equivalent histogram,
and the output of the layered feature extraction of the controllable direction of
the incomplete image is:
When $W_{i,j}=\left\{\begin{array}{ll}
0 & \left| W_{i,j}\right| \leq \lambda \\
W_{i,j}-\beta (W_{i,j}-\mu ) & W_{i,j}>\lambda \end{array}\right.$ is used,
the histogram of each window is weighted. Combining this with the method of texture
recognition, the difference function of template matching is as follows:
$h_{k}$ and $g_{k}$ represent image fusion and filtering coefficients. $X_{i}$ is
the linear distribution sequence of the original image texture, $y_{i}$ and $z_{i}$
are the fusion coefficients of image features, and the current processing area $R_{i}$
is $A_{i}$. The gray level distribution sequence of the incomplete image is obtained
as:
Based on the histogram analysis of the matching template and linear tracking recognition,
the iterative function of feature extraction output of the incomplete image is:
Under the optimal feature matching, the edge feature extraction output of the incomplete
image is as follows:
The edge information weighting coefficients $\lambda $ and $\nu $ <note: There should
be an ``and'' here instead of a comma (your file does not allow editing here)> are
all constant and $\lambda >0$. The distribution area of the feature extraction of
the incomplete image is:
$I(x,y)$ is the gray histogram of the incomplete image, sgn (.) is the symbol function,
and $G_{\sigma }$ is the error coefficient. The directional histogram fusion algorithm
is adopted to realize the feature extraction of the incomplete image.
3.3 Similarity Measure of Image Features
In the set of segmented sub-blocks with annotation, the similarity of the tagged image
is measured. That is, $I=\left\{I_{1},I_{2},\cdots ,I_{s}\right\}$ of the segmented
sub-blocks with annotation is used to obtain nearest neighbor $I_{i}=\left\{I_{1}^{i},I_{2}^{i},\cdots
,I_{s}^{i}\right\}$. Each image segmentation block with annotation is represented
as $K$, and then all the image segmentation blocks with annotation in the training
set are represented by matrix $I'=\left\{I_{1},I_{2},\cdots ,I_{i}\right\}\in R^{1\times
si}$. Because each part of the image has certain spatial information, each block of
the image to be labeled is located in the training set. That is, only the subset $I_{1}^{i}$
of the training set is taken into account when the nearest neighbor $I_{1}$ is obtained.
For the lower-level eigenvectors extracted from the two partitioned blocks, a and
b are represented as $F_{a}\left(hsv_{1}^{a},hsv_{2}^{a},\cdots ,hsv_{256}^{a},tex_{1}^{a},tex_{2}^{a},\cdots
,tex_{t}^{a}\right)$ and $F_{b}\left(hsv_{1}^{b},hsv_{2}^{b},\cdots ,hsv_{256}^{b},tex_{1}^{b},tex_{2}^{b},\cdots
,tex_{t}^{b}\right)$, respectively. The distance between the two partitioned blocks
is:
Then, Research can find $D_{s}=\left\{d_{s1},d_{s2},\cdots ,d_{sk}\right\}$ of the
distance vector between $K$ and $I_{s}$, and $d_{s1}<d_{s2}<\cdots <d_{sk}$. $d_{s1}$
is the distance between two blocks with the highest similarity <note: awkward/ambiguous>.
Matrix $D=\left[D_{1},D_{2},\cdots ,D_{s}\right]\in R^{1\times sk}$ is the distance
between all partition blocks of the incomplete image to be marked and their corresponding
neighborhood blocks [13,14]. <note: ambiguous (incomplete)> Research define the subblock and subblock similarity
metrics as:
$d_{ab}$ denotes the distance between partition block a of the incomplete image and
its nearest neighbor block b. The closer the distance is between two blocks, the greater
the similarity measure is. The similarity measure between segmented sub-block $I_{s}$
of the incomplete image to be labeled and its corresponding neighborhood segmented
sub-block is $W_{s}\,\left(w_{s1},w_{s2},\cdots ,w_{sk}\right)\,,$ and the similarity
measure matrix of $I$ is $W=\left[W_{1},W_{2},\cdots ,W_{s}\right]\in R^{1\times sk}$.
3.4 Training of Models
The training process for the model is shown in Fig. 1. In the training process, the input of the model is the extended training set of
the sentence description generated by the model:
Here, $m$ ranges from 1 to $M$, and $C_{\theta c}\left(I\right)$ represents the output
of the last layer of $CNN$. $h_{m}$ is the output of the hidden layer, $h_{0}$ is
initialized as a 0 vector, and the input of the neurons in the hidden layer includes
the expanded word vector $x_{m}$ and the previous moment’s information $h_{m-1}$ (contextual
information). But research only consider the influence of the image information $b_{v}$
in the first step of training. Experiments have shown that the effect is better than
adding $b_{v}$ at each step, $x_{1}$ as a specific START vector, $x_{2}$ as the first
word in a sentence, $x_{3}$ as the second word, and $x_{m}$ as the last word.
<New paragraph> $y_{m}$ is the output of the output layer, indicating the probability
of a word in the dictionary and the probability of a terminator. The $y_{1}$ tag corresponds
to the first word in the sentence during the exercise, the $y_{2}$ tag corresponds
to the second word, and the $y_{M}$ tag being a specific END vector. The training
of the model is then realized.
Fig. 1. Training process for a model.}
3.5 Image Annotation Algorithm
Regarding the incomplete image, the incomplete part and the display part of image
information have relevance <note: ambiguous>. To eliminate the interference of the
defective part on the image recognition, research use a certain spatial relationship
between the subblocks of the image display part. This kind of spatial relation is
more embodied between the partitioned sub-blocks and the partitioned sub-blocks of
the image than between the objects and objects in the image. In view of this characteristic,
in the process of automatic labeling of incomplete image, the fused spatial information
is the proportion of the segmented sub-blocks of the image in the whole image spatial
distribution (that is, the spatial structure information).
Table 1. Software and hardware conditions during the experiment.
Project
|
Parameter
|
Hardware
|
CPU
|
Intel®Core™i7-4790 CPU
|
Physical memory
|
16 G
|
Dominant frequency
|
3.60GHz
|
Software
|
Operating System
|
Centos 7
|
Development language
|
Python2.7
|
Corpus preprocessing tool
|
Wiki Extractor
|
Word vector training tools
|
word2vec
|
Keyword extraction tool
|
gensim
|
Automatic Image Annotation Evaluation Tool
|
coco-caption
|
According to the idea of similarity transfer, the similarity between tagged words
and images is related. The higher the similarity between images is, the closer their
tagged words should be. Therefore, the similarity between images can be transferred
to the relevance between their corresponding tagged words. The similarity measure
between image blocks is used to transfer the similarity relationship between images
in the process of annotation and to transform the similarity from an image to annotated
words. The similarity measure transfer matrix defining image I is $W^{*}$:
In this equation, $f\left(d\right)$ is the transfer function of the similarity measure.
In this paper, all the annotated words corresponding to the partition block $I_{i}(i=1,2,\ldots
,9)$ of unknown annotated image I are represented by annotated word vector $T(t_{1},t_{2},\ldots
,t_{p})$. p is the total number of annotated words, and repeated annotated words are
considered. For the nearest neighbor block j of the sub-block i of image I, the corresponding
annotation word is marked as 1 in T, and no corresponding annotation word is marked
as 0. <note: ambiguous> $T_{'}$ is obtained. Research multiply $T_{'}$ and the corresponding
similarity measure transition value $w_{ij}^{*}$ to find the similarity measure transition
vector of annotation words:
For the complete image,
$M^{*}$ is the transfer vector of similarity measure corresponding to the annotation
words of the image. In order to make the data format consistent <note: ambiguous>,
$M^{*}$ is normalized, and then the threshold is set according to the actual situation,
and the annotation words above the threshold are reserved. Thus, the automatic labeling
of the incomplete image is realized. During the annotation, research set the iterative
step $r_{t}=r_{0}/(1+0.001\eta t),\,\,\eta =10\mathrm{e}-5.$ The sub-concept parameter
K = 5 of the label should not be too large. In addition, when the K value is too large,
the training time of the algorithm will be increased. The overall algorithm flow is:
(1) Training process:
Input: Training dataset $\left\{\left(B_{1},Y_{1}\right),\left(B_{2},Y_{2}\right),\cdots
,\left(B_{M},Y_{M}\right)\right\}$, parameters $K$ and $\gamma _{t}$ output: $u_{lk},V_{lk}\left(l=1,2,\cdots
,L+1;\right.$ $\left.k=1,2,\cdots ,K\right)$
1) Training.
2) Initialize $u_{lk}$ and $V_{lk}\left(l=1,2,\cdots ,L+1;k=1,2,\cdots ,K\right)$.
3) Circular execution.
The training images to be labeled are divided into regions and segmented, and the
segmented image features are extracted to measure the similarity of the images.
(a) A package B and one of its associated tags is randomly selected.
(b) Research obtain key examples and their corres-ponding sub-concepts: $\left(X,k\right)=\arg
\max _{X\in B,k\in \left\{1,\cdots ,K\right\}}f_{yk}\left(X\right)$.
(c) If $y$ is not a virtual label $\hat{y}$, then $\overline{Y}=\overline{Y}\cup \left\{\hat{y}\right\}$.
(d) Circular implementation: $i=1\colon \left| \overline{Y}\right| $.
(e) Random selection of an unrelated tag $\overline{y}$ from the $\overline{Y}$ and
selection of the key example $\overline{X}$ and its corresponding sub-concept $\overline{k}\colon
\left(\overline{X},\overline{k}\right)=\arg \max _{\overline{X}\in B,\overline{k}\in
\left\{1,\cdots ,K\right\}}f_{\overline{y}\overline{k}}\left(\overline{X}\right)$.
(f) If $f_{y}\left(X\right)-1<f_{\overline{y}}\left(X\right)$.
(g) Order $v=i$.
(h) Updating and standardizing $u_{yk},v_{yk},u_{\overline{y}\overline{k}},v_{\overline{y}\overline{k}}$.
4)~Ending the cycle: The cycle ends when the termination conditions are met.
(2) Testing stage:
The associated label set for the test package B test is: $\left\{l\left| 1+f_{l}\left(B_{test}\right)>f_{\hat{y}}\right.\left(B_{test}\right)\right\}$.