In this study, the construction of pattern element feature extraction matching and
image-aware hash model is proposed. On this basis, a perceptual hash model combined
with SURF algorithm is further introduced. In order to enhance the rotation and translation
invariance, a Siamese network model with improved SURF algorithm is proposed.
3.1 Pattern element feature extraction and matching and image perception hash model
construction
When studying the feature extraction of Mazu patterns in the design of homestay spaces,
the SURF algorithm was used to extract Mazu features. Compared to other feature extraction
algorithms, SURF algorithm can reduce computational complexity while maintaining the
stable performance of SIFT algorithm. At the same time, the SURF algorithm has good
robustness and a higher feature point recognition rate than SIFT. It is generally
superior to SIFT in situations such as viewpoint, lighting, and scale changes. The
research first focuses on the identification and classification of key areas in the
image, especially when the object is moved, distorted, and occluded. Compared with
global features, local features perform better at adapting to these changes, effectively
reducing mismatches. The process of local feature extraction involves key point detection,
feature description, matching and classification. The detection of key points relies
on the definition of salient points, and their descriptions are based on the characteristics
of proximity pixels. In addition, the Gaussian difference operator is proposed, which
is more efficient than the Hessian-Laplace operator in processing the local structure
of the image, as shown in Eq. (1).
In Eq. (1), the expression on the left can be approximated as shown in Eq. (2).
In the field of feature point detection, both Harris-Affine and Hessian-Affine operators
are derived from Harris-Laplace algorithms. These two operators improve the isotropy
limit of the original algorithm and introduce the invariance of affine transformation.
They enhance the recognition and retrieval capabilities of images by iteratively adjusting
the position and scale of feature points. In contrast, the MSER operator achieves
affine invariance through binarization [16]. The stable extreme regions identified by MSER are defined in Eq. (3).
In the face of huge image datasets, traditional data processing methods are no longer
suitable. In this study, a Perceptual Hashing (PH) model is proposed, which can convert
key image content into unique binary sequences, thereby simplifying the storage and
management of images, especially in the field of image search, which significantly
improves query efficiency [17,18]. The perceptual hash model is shown in Fig. 1.
Fig. 1 shows that the perceptual hash model resembles the digitization of multimedia content,
creating a one-way connection as a unique feature of multimedia content and ensuring
the security and robustness of the technology, the feature volume of datasets A1,
A2, and A3 is significantly reduced after perceptual hashing, while these processed
features still carry the key information of the original data. The perceptual hashing
algorithm can effectively convert large-scale data objects into smaller binary formats,
and this conversion maintains a certain consistency among similar data objects. The
core of perceptual hashing technology, i.e., its mapping mechanism, can be defined
in Eq. (4).
In Eq. (4), $I$ represents the input data and $h$ represents the result of the mapping; $PH$
stands for Perceived Hash Model. In image retrieval, the hash encoding of the image
is stored in the database, and then similar images are found by comparing the hash
value of the image to be retrieved with the value in the database. Comparing hashes
often uses Hamming distance, which is one of the standard methods used in machine
learning to measure the similarity of two pieces of data.
Fig. 1. PH mapping model structure diagram.
3.2 Feature extraction of pattern elements based on SURF algorithm
In the design of B&B space, the key to feature extraction of Mazu patterns lies in
the use of effective image analysis methods. Traditionally, image retrieval has relied
on perceptual hashing techniques, including grayscale basis thresholds, frequency
thresholds based on discrete cosine transforms, and multi-dimensional global feature
methods. These are mainly focused on the global nature of the image, with low sensitivity
to local details. On the contrary, the SURF algorithm uses Hessian matrix to identify
local extremum, which improves the accuracy of feature extraction. Although SURF lacks
real-time performance, its accuracy is remarkable. In this study, based on the feature
extraction and matching of pattern elements and the image perception hash model, a
perceptual hash model combined with the SURF algorithm is further proposed, which
aims to improve the retrieval efficiency and accuracy, and its core principle is shown
in Eq. (5).
In Eq. (5), the derivative convolution of the point $\varepsilon $ and Gaussian equations is
represented as $L_{xx} (\varepsilon ,\partial )$. In the specific application, this
expression is approximated because the Gaussian equation needs to be discretized,
as shown in Eq. (6).
In Eq. (6), $D_{xx} $ and $D_{yy} $ are approximate to the box filters of $L_{xx} $ and $L_{yy}
$, respectively. This approximation makes the recognition of point $\varepsilon $
as local extremum points more efficient, especially in the case of accelerated computation
with integrated images. The SURF algorithm establishes the scale space by adjusting
the size of the filter, and selects the extreme points as the feature points. In addition,
SURF also optimizes the accuracy of feature points through interpolation operations,
and uses Hessian matrix and interpolation algorithm to locate feature points in detail.
The final selection of feature points depends on their stability, and unstable points
are discarded, as shown in Eq. (7) [19].
The circular area around the feature point is analyzed, and the Harr wavelet response
within $x$ and $y$ on the axis is calculated. These reactions are summarized using
a Gaussian weighted sector window to determine the cardinal orientation of the feature
points, as shown in Fig. 2.
Fig. 2. Main direction diagram of feature points.
Next, for each feature point, a square area is selected around its principal direction,
which is divided into 16 sub-areas. The Harr wavelet response is calculated for each
region and is Gaussian-weighted. The resulting 4-dimensional vectors are assembled
into 64-dimensional eigenvectors. These vectors are normalized to ensure rotation,
illumination, and scale invariance, as shown in Fig. 3.
Fig. 3. SURF feature points describe the substructure diagram.
The direct application of the SURF algorithm in image retrieval is limited by its
high time complexity. It consists of two steps, feature detection and descriptor definition,
to ensure scale and rotation invariance, respectively. The construction of the scale
pyramid is a time-consuming step, and although the scale adaptability is improved,
the neighborhood differences of feature points are still significant. In order to
reduce this effect, this study fuses scale transformation and perceptual hash encoding
to create a hashing algorithm that is resistant to rotation and scale transformation.
SURF algorithm to locate the image feature points, as shown in Eq. (8).
In Eq. (8), the total number of identified feature points is described as $k$. Subsequently,
the K-means algorithm determines the center point in the set $P$, which is defined
in Eq. (9).
Next, the Euclidean distances from all elements in the set $P$ to the pixels $(x_{z}
,y_{z} )$ are calculated, and these distances are arranged in ascending order. The
result of the sorting is shown in Eq. (10).
Next, $R$ is set to $10/k$ and a series of concentric circles are drawn with $(x_{z}
,y_{z} )$ as the center, the radius is $R/64$, $R/32$, ..., $R$ in turn. The number
of feature points is calculated within each ring, as shown in Eq. (11).
In order to enhance the anti-rotation and anti-scale variation characteristics of
the encoding, the image is scaled at a scale of 0.5 to 4. The transformed image is
processed in steps (8) to (11) of equations to calculate the $N_{i} $ values and form a $K$ set as $\{ N_{1} $,
$N_{2} $, $N_{3} $, ..., $N_{8} \} $. The $K$ is then hashed, the resulting encoding
is $h_{2} $, and the final encoding is synthesized as $h=[\!\!\begin{array}{cc} {h_{1}
} & {h_{2} } \end{array}\!\!]$, as shown in Eq. (12).
3.3 Grayscale histogram and Siamese network feature extraction based on improved SURF
algorithm
To improve the accuracy of the retrieval algorithm, the global features of the image
were included in the coding results. Through the grayscale histogram, the vector representation
of the image is generated, and the cosine similarity measures the similarity between
the vectors. The grayscale histogram counts the frequency of each grayscale value
in the image and reflects the global color distribution. This method, combined with
the above-mentioned SURF base-perceptual hashing algorithm, can effectively improve
the sensitivity to global features [20]. Firstly, the number of pixels in the grayscale image is counted, the histogram is
divided into 64 regions to generate vectors, and finally the global grayscale feature
similarity is compared by cosine distance, as shown in Eq. (13).
In Eq. (13), $\theta $ represents the angle between vectors, and $a$ and $b$ represent the histogram
vectors of the two images, respectively. Considering that the traditional SURF algorithm
mainly recognizes the underlying features such as edge and brightness changes, it
is not enough to reveal the high-level semantics. This limits the accuracy of the
search results to reflect user intent. In order to make up for this shortcoming, a
Siamese network model with improved SURF algorithm was proposed. The model transforms
the input into the target space by a mapping function, where the Euclidean distance
is used for similarity comparison. The training phase aims to minimize the loss between
samples of the same class and maximize the loss between different classes. Convolutional
neural networks process images through local feature abstraction, but the features
are significantly different when rotated or translated at large angles. In order to
enhance the rotation and translation invariance, a module is integrated in front of
the Siamese network, which first extracts SURF features, then matches them by the
nearest neighbor algorithm, and finally calculates the parameters of the correlated
affine transformation model, as shown in Fig. 4.
Fig. 4. Anti-rotation and anti-translation conversion module.
Due to the fixed image size caused by the fully connected layer, it is recommended
to add a space pyramid pool (SPP) before the fully connected layer of the network.
This enables the network to process input images of any size, enhancing their scale
invariance. SPP can extract fixed-dimensional feature vectors from feature maps, for
example, in a simple double-layer network, regardless of the size of the input image,
21-dimensional feature vectors can be extracted, as shown in Fig. 5.
Each image within the initial image and pyramid group is used as a match. Unlike the
earlier Siamese network, this network adds a regularization term to the objective
function to enhance scale invariance. In this way, images of the same scale group
produce more consistent features. The parameters of the scale-invariant layer are
denoted as $(W_{a} ,W_{b} )$, and its output is shown in Eq. (14).
In Eq. (14), $\kappa $ is described as $\max (x,0)$; $O_{m} $ represents the input of the scale
invariant layer; $B_{a} $ represents the regularization term. In this study, the Siamese
network model based on the SURF algorithm is improved, and its invariance to rotation,
scale and translation is enhanced. The network can process images of any size and
consists of two branches that share parameters. The image is corrected by the rotation
and translation module and processed by a convolutional neural network. The SPP layer
extracts fixed features from the feature map and outputs them to the fully connected
layer. Finally, the network outputs feature vectors with the aim of minimizing the
loss of similar images and maintaining the consistency of the feature vectors in training,
as shown in Fig. 6.
Fig. 6. SIAMESE network model structure diagram.