2.1 Operation of the SIFT Algorithm in Multi-source Information Image Monitoring
Image registration is an important step in image monitoring, which performs geometric
calibration on multiple images containing overlapping regions from different viewpoints
in the same scene. This step is essential in multi-source information image monitoring.
In current remote sensing image monitoring, it is usually automatic registration that
requires the appropriate algorithms. The research uses the SIFT algorithm as the main
remote-sensing image registration technology.
The SIFT algorithm is used mainly in general optical images with a small amount of
additive Gaussian noise and related problems. Therefore, its design is also more inclined
to general optical images, and its applicable image types are limited [16]. In multi-source image monitoring, the template registration of the image domain
is needed first. In the two images, the mutual information is maximized to solve the
best matching model parameters between the images. (Ed note: Two complete sentences
should not be combined with a comma.)
The entropy of image A is expressed as Eq. (1).
where $PA(a)$ represents the probability that a certain area in image A appears in
the overall information. The entropy of image B is the same. The joint entropy of
images A and B is shown in Eq. (2).
where $PAB(a,b)$ represents the probability that a certain area in the co-existing
area of images A and B appears in the overall information. According to Eqs. (1) and (2), the information entropy of images A and B is expressed as Eq. (3).
After obtaining the information entropy between the images, the phase correlation
algorithm was used to perform Fourier transform, and the corresponding transform parameters
were obtained. The translation is $(x_{0},y_{0})$ assuming that the scale of the image
$f_{1}$ is zoomed $\alpha $. (Ed note: Short coordinating conjunctions like ``and''
and ``but'' should not be used at the beginning of sentences.) The linearly transformed
image $f_{2}$ is obtained after the rotation angle $\theta $. Eq. (4) expresses the relationship between $f_{1}$and $f_{2}$, the sum $F_{2}$ obtained by
Fourier transform in polar coordinates $F_{1}$.
where $\rho $ is the rotation angle corresponding to the peak position after the inverse
Fourier transform. Different feature extractions are performed using the SIFT algorithm,
among which the first is point feature extraction. Point feature extraction relies
only on local information and uses some image elements to replace the entire image.
In online feature extraction, the edge detection algorithm is used to extract the
line features, and the matching lines are then matched according to the feature descriptors.
Face feature extraction is performed after the line features are obtained. The surface
feature is mainly to extract the closed area of the image and perform a feature calculation
according to the closed area, and use the feature as a description of a certain area.
The surface features often identify large areas, such as large waters, cities, forests,
and deserts. The multispectral images of these areas have different spectral components,
which are easy to identify and monitor. The virtual structure feature is finally extracted
after the surface feature extraction is completed. The virtual structure is obtained
using the above three basic actual structure features, and the corresponding virtual
features were obtained by matching according to the similarity criterion [17]. Fig. 1 shows the basic process of image registration of the SIFT algorithm.
In crucial point detection, the Gaussian scale difference space is generated using
the Gaussian function, and the scale space of the image $S(x,y,z)$ is the convolution
of the Gaussian function with the scale parameter and the image $s(x,y)$, as shown
in Eq. (5).
where $\ast $is a convolution operator; $G(x,y,z)$is a two-dimensional Gaussian function,
and its calculation is expressed as Eq. (6).
In the scale space, multiple groups form a group $z$, and the calculation of the response
value image $D(x,y,z)$ is expressed as Eq. (7).
where $k$ is the space multiple constants of adjacent scales. Using the response value
image can simplify the computation, and the SIFT algorithm also constructs an image
pyramid through the response value to realize the operation [18]. The pyramid is divided into groups $o$, each group is a layer $s$, and the images
of each group are obtained by down-sampling the previous group of images. Each group
of images whose adjacent height is the scale space is described in detail. The response
value $D$is obtained, and a pyramid is constructed with two sets of Gaussian scale
space images. Each group of pyramids has five layers of space images of different
scales. After the images of two adjacent layers meet, four layers of $D$ response
values are obtained, and the key points are detected on these four layers. The second
group is obtained by reducing the scale space of the first group by two times the
resolution, and the image area is 25% of the previous group. Fig. 2 presents the scale pyramid.
At this time, in SIFT, the relationship between $z$and $o$ $s$is expressed as Eq.
(8).
where $\omega $is the number of groups; $\sigma $is the scale of the reference layer.
After obtaining the critical point in the scale space that are invariant to image
scaling and rotation, SIFT uses the principal direction and axis direction of each
key point to generate descriptors and keeps these descriptors invariant as far as
possible. The method completes the monitoring of the image.
Fig. 1. Basic flow of the SIFT algorithm registration.
Fig. 2. Schematic diagram of the scale space pyramid.
2.2 Improvement of the SIFT Algorithm based on Gabor and CNN
Although the conventional SIFT algorithm can solve many image monitoring problems,
the effect is not ideal on some specific images, such as synthetic aperture radar
(SAR) images. In SAR images, the speckle noise effect of radar imaging cannot be avoided,
and the signal-to-noise ratio of the matching feature space is also reduced significantly.
The interior points are more difficult to filter out during feature matching. In addition,
the Gabor filter has its advantages in detecting, and the CNN can simultaneously grasp
local and global features when processing pictures. Therefore, combining a Gabor filter
with CNN can achieve intelligent recognition of pictures. The research combines the
Gabor descriptor and CNN with SIFT to form an improved algorithm to solve the monitoring
problem of multi-source image information.
In the Gabor texture, the image does not depend on the reflection of color and brightness;
it has high efficiency and homogeneity. In the spatial domain, the Gabor filter is
usually regarded as a sine plane wave modulated by a Gaussian function, and its function
representation is expressed as Eq. (9).
where $\gamma $ is the frequency parameter; $\delta $ is the scale parameter; $\theta
$is the direction parameter; $(x,y)$ is the pixel coordinate in the space domain.
Any filter can be obtained by translation, rotation, or scale transformation through
the Gabor filter. The Gabor kernel function is the waveform phase function. The Gabor
kernel function $X$is convolved with the image, and the obtained Gabor response is
expressed as Eq. (10).
The Gabor response is always complex, which is the extracted Gabor feature. The Gabor
feature reflects some specific features of the image, including the edge direction
information, texture direction information, and scale information of the image. The
Gabor filter bank used in the study contains three scales and eight directions with
24 filters. The Gabor kernel function increases from the top to the bottom of the
scale function, and the radian of each clockwise rotation from left to right is ${\pi}$
/8.
A 24-dimensional local Gabor feature descriptor is selected to reduce the computation
time of the Gabor features as much as possible. With the support of the SIFT descriptor,
the width of the support domain window is $m\cdot d\cdot \delta $; $m$ is the size
parameter of the sub-region. The research is set to 3. The number of subregions $d$is
the parameter, and the research is set to 4; $\delta $is the same as the previous
one and is the key point scale parameter. To generate Gabor descriptors, Gabor filter
banks are first generated, and images of specific regions required to generate the
descriptors are then selected. After obtaining a specific corresponding area, the
SIFT algorithm is used to calculate the relevant scale of the feature map, and the
feature map of 33 ${\times}$ 33 pixels is obtained. Finally, a Gaussian weighted average
is obtained, and the eigenvalue vector is the Gabor descriptor [19]. After obtaining the Gabor descriptor, the feature vector needs to be normalized,
as shown in Eq. (11).
In Eq. (11), $F_{Gabor}$ is the descriptor before normalization and $F`_{Gabor}$is the descriptor
after normalization. For the key points $\delta $of each scale, the research uses
two mutually nested support regions as the basis for generating descriptors to complete
the fusion of descriptors. The two support regions are of different sizes, but both
have an average of 24 dimensions.
As the basis of convolution operation, CNN exists as a basic operation in the SIFT
and Gabor operations and strengthens the enhancement of the entire algorithm for multi-source-information
image monitoring because of its sensitivity to image features. The components of CNN
include the convolution operator, convolution feature kernel, convolution layer, and
pooling layer. The structure is divided into an input layer, convolution layer, activation
function, pooling layer, and fully connected layer. Fig. 3 presents its simple structure.
Fig. 3. Basic structure diagram of the CNN.
Fig. 4. Schematic diagram of the pooling operation.
The input layer is a pixel matrix, which performs various processing on the sample
data, including data normalization, dimensionality reduction, pixel correction, and
scale normalization. The convolutional layer contains multiple feature data. By learning
the feature expression, the local area is processed by local perception, and the processed
object is the corresponding feature data. A synthesis operation is then performed
on each part, which integrates the information of each part, and obtains the global
information through the convolution operation. The convolution process is relatively
stable because the size of the convolution kernel weight does not change due to parameter
sharing during the convolution process. The output of the convolutional layer is input
to the activation layer. The activation function of the activation layer is operated.
The activation function generally adopts the sigmoid function. In some cases, the
Gaussian kernel function or spatial function can be used [20]. The method of activation function operation is nonlinear mapping. At this time,
the convolution layer can extract more abstract features and improve the function
of the convolutional neural network. The sigmoid function is shown in Eq. (12).
where $\theta $ is the mapping of the Sigmoid function. The first feature map $f_{k}$
obtained after the sigmoid operation $k$is expressed as Eq. (13).
where $x$is the input value; $W$is the weight; $b$is the bias. The pooling layer is
between the two convolutional layers. The function of the pooling layer is to reduce
the size of the parameter matrix and the overall number of relevant parameters of
the fully connected layer. Pooling operations usually include max pooling and average
pooling. Fig. 4 shows the method of pooling operations.
The pooling layer affects the parameters of the fully connected layer. The fully connected
layer is generally at the end of the entire CNN structure and usually has several
layers. The main function of the fully connected layer is to transform each local
feature extracted by the convolutional layer into a whole through the weight operation
to obtain a more complete and hierarchical overall feature. Assuming an input initial
feature map of each convolutional layer $x_{j}$, the convolution operation is expressed
as Eq. (14).
where $f(x)$ is the activation function; $M_{j}$represents the set of initial feature
maps; $i$ is the matching result; $k_{ij}$ is the input of the first feature map $i$
and the convolution kernel of the output of the first feature map$j$.
In order to solve the weights$l$ and update values of all neurons on the layer, it
is necessary to find the sensitivity at each node first. The size $\theta $ is then
calculated, and the corresponding parameters $l$ required by the layer through the
sensitivity size are deduced. The sum of the sensitivity values from the connection
layer $l$ to the interest definition $l+1$ is multiplied $\theta _{j}^{l+1}$by the
corresponding weight $W$. The activation function is obtained using the reciprocal
$f(u^{l})$to obtain Eq. (15).
where $u$is the input value of the neurons of the layer $l$. A modified linear activation
is applied to the CNN structure. The 1${\times}$1 convolution can reduce the dimension
of the feature map to expand the application scale of the network, increase the width
and depth of the convolutional neural network, and improve the application performance
of the network. The operation process of CNN mainly strengthens the image features,
which can theoretically shorten the operation time and increase the recognition accuracy.
Integrate Gabor and CNN into the SIFT algorithm to form the Gabor-CNN-SIFT algorithm.
Fig. 5 presents the flow of the entire algorithm.
The improved algorithm and other algorithms were simulated and compared. The samples
of the simulation experiment were 240 different multi-source information images. In
addition to the improved Gabor-CNN-SIFT algorithm, the other two algorithms were the
Gabor-SIFT and only SIFT algorithms, of which Gabor-SIFT is an algorithm that does
not combine CNN and only combines Gabor. The samples were randomly divided into two
sets, including the test set and the validation set, to obtain the test results more
accurately.
Fig. 5. Basic flow chart of the Gabor-CNN-SIFT algorithm.