In the context of information explosion and multimedia, how to accurately obtain effective
information has become a key concern for researchers. Cross-modal graphic search (CMGS)
plays a vital role in this. To further improve the accuracy of CMR, the study will
build DL-based GFEM and GF-based CMHR models.
3.1. Deep Learning-based Graphic Feature Extraction Model Construction
Graphic fusion learning aims to deeply explore the shared semantic information and
potential correlations between image modal data and textual modal data, and to generate
high-quality feature representations or decision results for analysis with modal-specific
information as supplementary information. Efficient analytical modeling of graphic
fusion helps to better understand the current task scenario and make decisions accordingly.
Graphic fusion strategies are categorized into hybrid fusion, decision-level fusion
and feature-level fusion according to the fusion stage, and the specific framework
is shown in Fig. 2.
Fig. 2. Image and text fusion strategy diagram.
Among them, feature level fusion can better capture potential correlation relationships
at the feature level, but it is prone to overfitting training data. Decision level
fusion can effectively handle the asynchrony problem between heterogeneous data and
better learn modality specific information, but it ignores the correlation between
heterogeneous modality feature representations and is difficult to implement. Hybrid
fusion combines the advantages of feature level fusion and decision level fusion.
Therefore, this study adopts a hybrid fusion strategy, taking into account the overfitting
problem of small sample datasets in feature level fusion, and drawing on the modal
specific information preservation idea of decision level fusion for modeling. The
Bert model, a deep bidirectional Transformer model, and its extended models have achieved
state-of-the-art results in various natural language processing tasks. Therefore,
this study adopts a pre trained Bert model for text feature extraction [18]. Eq. (1) shows how the concept of joint output, which combines the outputs of multiple hidden
layers, is introduced in this study in order to fully utilize the features of the
12-layer coding network that makes up the Bert model, give it a more suitable basis
for discrimination, and improve the model's robustness and generalization ability.
In Eq. (1), $L$ denotes the number of hidden layers in the Bert model. The extraction of features
from the bottom, middle, and top layers of the Bert model is studied in order to be
spliced as indicated by Eq. (2). This is because splicing the outputs of all hidden layers as a joint output would
be too high dimensional.
In Eq. (2), $Hij$ is the output of the $ij$th hidden layer and $k$ is the hidden layers selected.
For the extraction of image features, the DenseNet network is used in this study.
DenseNet is a kind of DNN in which the output of each layer is not only passed to
the next layer, but will also be directly passed to all the subsequent layers, which
is characterized by dense connectivity, which enables a better flow of information,
thus improving the effectiveness of the network [19]. The core structure of DenseNet is the dense block, and the output of the dense block
is then compressed by a transition layer to reduce the channels in the output, thus
reducing the model parameters and accelerating the model training, as shown in Eq.
(3).
In Eq. (3), $H_l$ denotes the convolution operation of the first layer, $x_l$ denotes the output
of the dense block, $f_l$ denotes the compression operation of the $l$th layer, and
$x_{l-1}$ denotes the operation of the transition layer. In addition, this study constrains
the sequence length of the output to be consistent with the text features by changing
the last fully connected layer of the DenseNet network. Different modalities have
different forms of features, and the feature fusion process is prone to the loss of
single-peak information, so intra-modal enhancement of the features is required before
feature fusion. Transformer is a powerful sequence model that has achieved great success
in feature extraction and enhancement, but its required time and memory show a second-order
increase in the length of the sequence [20]. For this reason, this study employs the gated attention unit (GAU), which combines
a gated linear unit and an attention mechanism, to reduce the burden of self-attention
in Transformer. The gated linear unit is shown in Eq. (4).
In Eq. (4), $X$ denotes the original input vector, $\varphi $ denotes the activation function,
$W$ denotes the weight matrix, $\odot $ denotes the elemental multiplication, and
$O$ denotes the output. Utilizing the attention and the gated linear unit as a single
layer is the primary goal of the gated attention unit. And the calculation process
of the gated attention is shown in Eq. (5).
In Eq. (5), $Z$ refers to shared representation, $Q$ and $K$ denote two cheap transformations,
$b$ denotes relative positional bias, and $A$ denotes attentional weight. In this
study, a self-encoder is used to fuse image features and text features, and the loss
function of the feature fusion module is shown in Eq. (6).
The input and output text data are represented by $x_i$ and $\tilde{x}_l$ in Eq. (6), the number of modes are represented by $M$, and the input and output image data
are represented by $y_i$ and $\tilde{y}_l$, respectively.
3.2. Cross-modal Hash Retrieval Model Building Based on Graphic Features
Once the image and text features are extracted, the study suggests a CMHR using multilevel
adversarial attention, which consists of a modal attention module, a modal adversarial
module, and an HL module, to further remove the heterogeneous differences between
modalities and to supplement the fine-grained information of each modality. Channel
attention mechanism can effectively weight and optimize the features of single mode
to improve the representation ability of features. In cross modal hash retrieval,
weighting the channel attention mechanism independently on image and text features
can better preserve the unique information of each modality. Therefore, the study
utilized channel attention mechanism in the modal attention module. The channel attention
mechanism is intended to mimic human behavior in making decisions by using only a
portion of the data. It does this by modeling different modal data according to their
feature importance separately and assigning features to them based on their inputs.
This mechanism is straightforward and efficient, and it includes two methods: maximum
pooling and average pooling, which are used to gather modality features and the former
to gather spatial information. Fig. 3 illustrates the specific operation of the channel attention mechanism.
Fig. 3. Working principle diagram of channel attention mechanism.
Eq. (7) shows the channel attention method for image modalities. The characteristics of text
and image modalities are obtained, and the data are fed into the shared network to
form a one-dimensional attention mapping.
In Eq. (7), $\rho $ denotes the sigmoid function, $F^{v} $ denotes the image features, $W$ denotes
the weights, and $F_{avg}^{v} $ and $F_{\max }^{v} $ denote the average pooling and
maximum pooling of the image, respectively. For the visual modality, Eq. (8) shows the pertinent and irrelevant information.
In Eq. (8), $F^{rv} $ denotes relevant information, $F^{irv} $ denotes irrelevant information,
and $\otimes $ denotes multiplication of elements. Similarly, the one-dimensional
text channel attention mapping is shown in Eq. (9).
In Eq. (9), $F^{t} $ denotes the image features and $F_{avg}^{t} $ and $F_{\max }^{t} $ denote
the average and maximum pooling of the image, respectively. Eq. (10) shows the computation of pertinent and extraneous textual modality information.
Each modality has feature information that is both relevant and irrelevant, yet the
modal attention module will receive information about clear feature association. The
multilevel modal confrontation module, which includes intra-modal confrontation learning
and inter-modal confrontation learning, is designed to communicate the correlation
representation of each modality more effectively. Combining the generative model for
data distribution and the discriminative model for determining sample source judgment
results in the generative adversarial network. And the main role of the discriminative
model is to judge whether the samples are from the samples or the real data [21]. Finally, the generated model is estimated by the generative adversarial model. Fig. 4 illustrates the specific functioning concept.
Fig. 4. Schematic diagram of generating adversarial networks.
Through intra-modal adversarial learning, each modality's important feature information
can be enhanced by irrelevant background information, giving each modality a richer
set of relevant data. A generator is an important component in generative adversarial
networks (GANs) that receives random noise or other samples as input and learns to
generate data samples that are as similar as real data samples as possible. Generators
typically interact with discriminators during the training process of adversarial
networks, with the goal of being as close to real data samples as possible to deceive
the discriminators. The discriminator and generator learn in an adversarial manner,
supplementing the relevant information of the modality with irrelevant background
information. The objective functions of the discriminator and generator for adversarial
learning within the image modality are shown in formula (11).
In Eq. (11), $\theta D$ denotes the parameters of the judger $D$, $f_{ri}^{v} $ and $f_{iri}^{v}
$ denote the intra-modal relevant and irrelevant information of the $i$th image instance,
$G_{r}^{v} $ and $G_{ir}^{v} $ denote the image relevant and irrelevant information
generators, respectively, $\theta _{r}^{v} $ and $\theta _{ir}^{v} $ denote the parameters
of the generator, and $v_i$ denotes the $i$th image. Since the two intra-modal adversarial
learning is symmetric, the discriminator objective function and generator objective
function for textual intra-modal adversarial are identical. The inter-modal adversarial
learning discriminator objective function and generator objective function are shown
in Eq. (12).
In summary, the modal confrontation module can cross the heterogeneity gap between
modes, make the relevant feature information more tightly distributed, and improve
the accuracy of CMR, with the objective function as shown in Eq. (13).
In Eq. (13), $\gamma $ denotes the hyperparameter. Hash function is a one-shot that converts
a set into a fixed-length, irreducible set whose values are typically combinations
of numbers and letters. The input is unbounded, but the size of the output needs to
be specified at the time of output, and hence the probability of collision can be
significantly reduced. Hash functions are characterized by avalanche effect, irreversibility,
conflict avoidance and consistency. During the training process, each modality can
learn HC together, so the objective function of HL is shown in Eq. (14).
In Eq. (14), $L_p$ denotes the pairwise loss function, $\varepsilon $ denotes the hyperparameters,
$L_q$ denotes the quantization loss, $\Theta _{ij} =\frac{1}{2} h_{*i}^{t} h_{*j}
$ and $B$ denotes the HC. When optimizing the loss function, the model is optimized
iteratively by learning one parameter at a time and fixing the others. The GF-based
CMHR model framework is shown in Fig. 5.
Fig. 5. Framework diagram of a cross modal hash retrieval model based on graphic and
textual features.