Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 04, p.471-482

ISSN (online) :

2287-5255

Received : 11 January 2024Revised : 17 June 2024Accepted : 10 July 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.4.471

Regular Paper

Cross-modal Graphic Retrieval Optimization Method Based on Deep Learning and Hash Learning

TanLu¹

(Library Information Center, Changsha Aeronautical Vocational and Technical College, Changsha 410124, China)

^* Corresponding Author:Lu Tan, LuTan_111@outlook.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This work proposes a novel approach for cross-modal graphic retrieval, leveraging deep learning and hash learning techniques. It aims to address the limitations of current multimodal information retrieval methods in capturing detailed information within individual modalities. Initially, a deep learning-based model is developed to extract features from text and image modalities. To further enhance the granularity of modality-specific information, a cross-modal hashing retrieval model incorporating graphic features is proposed. This model leverages attention mechanisms and adversarial networks to optimize performance. Experimental results demonstrate the effectiveness of the proposed model, achieving an average recall of 77.8% in graphic feature extraction with the highest classification precision of 0.637 and 0.712 on two separate datasets. Furthermore, the cross-modal hash retrieval model achieves an impressive average precision mean value of 0.833 in the image retrieval text task using a 64-bit hash code. These findings indicate that the proposed model surpasses comparable models in terms of precision-recall curve. The attentional mechanism, intermodal confrontation, and intra-modal confrontation modules significantly contribute to the model’s performance in image and text detection. Notably, the attentional mechanism module plays the most significant role, followed by the intermodal confrontation module. Consequently, this study’s model is well-suited for cross-modal graphic retrieval tasks.

Keywords

Graph neural networks, Deep learning, Hash algorithms, Graph retrieval, Multimodality

1. Introduction

The rapid development of the Internet has made accessing information more convenient, but it has also led to the issue of information overload. Cross-modal retrieval (CMR) is a crucial method for effectively utilizing media data resources. It involves retrieving high-dimensional multimedia data by mapping it to a public semantic space, allowing for retrieval between multimodal instances ^[1,^2]. In order to enable quick and adaptable retrieval, cross-modal hash retrieval (CMHR) seeks to train hash functions that translate high-dimensional multimedia data into a low-dimensional public hash space. This procedure dramatically improves retrieval efficiency while reducing time and space complexity ^[3,^4]. Significant progress has been achieved in improving the performance of supervised cross-modal hashing in the last few years. However, the challenge of effectively mining data information to achieve highly detailed retrieval performance remains unresolved ^[5].

The current multimodal retrieval methods have high complexity, and most supervised cross modal hash retrieval methods only focus on the connections between modalities, with less exploration of fine-grained information within modalities. In this study, a graphic feature extraction model (GFEM) based on deep learning (DL) and a CMHR model based on graphic features (GF) are proposed. Specifically, this study first utilizes the Bert model and DenseNet network to extract text and image features respectively, and uses gated attention unit (GAU) to reduce the burden of self attention in the Transformer for feature enhancement. Then, the study divided the modalities into relevant feature representations and irrelevant feature representations through attention mechanisms, and enriched the information of modal related representations based on intra modal and inter modal adversaries, thereby optimizing cross modal hash retrieval. The research aims to build an effective cross modal image and text retrieval method to improve the efficiency and accuracy of multimodal retrieval methods. The contribution of this study is to emphasize the importance of fine-grained information within modalities and propose corresponding deep learning based image and text feature extraction models and cross modal hash retrieval models to meet the requirements of current multimodal retrieval methods.

The research presents two primary novelties: the first is the application of the channel attention mechanism to categorize modal information into features that are relevant and those that are irrelevant. Secondly, the modal confrontation module is designed to augment fine-grained information from various modes. The research is divided into four sections. The current state of relevant research is analyzed in the first section. A summary of the complete study's findings is presented in the last section. The second part involves the construction of DL-based GFEM and GF-based CMHR models. The third part evaluates the proposed models' application effects. The overall framework of the study is shown in Fig. 1.

Fig. 1. The overall framework of the study.

2. Related Works

A hashing algorithm converts data into a fixed-length output by taking in input of any length. To be considered a good hash function, it must meet the requirements of consistency, uniformity, and efficiency in its operation. Li Z et al. proposed a social media image classification and retrieval method based on deep hashing algorithm to extract the color features of social media images using color histogram. The outcomes indicated that the classification error of the proposed method is small, the retrieval accuracy was high, and it had a good retrieval performance ^[6]. To address the challenges of learning representative features of remote sensing images and enhancing the retrieval efficiency of large-scale content-based remote sensing image retrieval tasks, Liu C et al. proposed a new content-based method for retrieving images from remote sensing data that combines an adversarial Hash learning (HL) model with a deep feature learning model. The results indicated that the proposed method has certain effectiveness and superiority ^[7]. By tackling the issue of how to get precise semantic labels of returned similar photos to aid in further analysis and processing of the images, Song W et al. reclassified the image retrieval problem as visual and semantic retrieval of images ^[8]. Yan C et al. proposed a supervised multiview Hash model in response to the problem that large-scale high-speed retrieval of hash through binary codes reduces the retrieval accuracy to some extent, and tried to introduce multiview deep neural networks (DNN) into the HL domain to enhance multiview information ^[9]. To solve the issue that the underlying data structure is difficult for present DNN models to capture, Lu H et al. suggested a novel deep fuzzy hash network. The outcomes showed that the suggested strategy achieves competitive retrieval accuracy while maintaining an effective training pace ^[10]. Aiming at the problem that existing multimodal methods fail to binaryize the query when only one or some of the modalities are provided, Zhu L et al. proposed a new flexible multimodal hashing method, which was a way to learn multiple modality-specific hash codes and multimodal collaborative hash codes at the same time within a single model ^[11].

With its strong application results in face retrieval, material recognition, and sign language translation, CMR offers a promising path for information retrieval research and development. Zhen L et al. proposed a deep multimodal transfer learning method. The outcomes demonstrated the suggested method's superiority and efficacy ^[12]. A multimodal BigEarthNet benchmark archive was presented by Sumbul G et al. to facilitate deep learning investigations into multimodal, multilabel remote sensing image retrieval and classification. To assist it be more correctly defined, they also proposed an alternate class nomenclature as an extension of the original CORINE land cover labeling ^[13]. According to Ji Z et al., stacked multimodal attention mechanisms can be used to map the aggregation of attention segments into a common space by taking advantage of the fine-grained interdependencies between texts and images ^[14]. In response to the small size and unlabeled dataset faced by representation learning, Gu Y et al. proposed a depth-graph based multi-modal feature embedding framework for medical image retrieval. The findings demonstrated the great accuracy of the suggested approach ^[15]. The issue of how to fully account for the characteristics of remote sensing data in order to lower memory and boost retrieval efficiency in large-scale remote sensing data was tackled by Chen Y et al. They suggested deep image speech retrieval in order to extract additional information from the data and produce Hash Code (HC) with low memory and quick retrieval characteristics. Additionally, in order to successfully reduce the quantization error of HC learning, they devised the quantization error term to drive the HC-like approximation to HC. The results indicated that the proposed method has certain effectiveness and superiority ^[16]. Aiming at the problem that existing cross-modal generative adversarial network methods usually require a large amount of labor-costly labeled multimodal data to establish cross-modal correlations, Xu X et al. suggested a new method that jointly performs multimodal feature synthesis and public embedding space learning, and also proposed three high-level distributional alignment schemes with high-level cyclic consistency constraints to maintain semantic compatibility ^[17].

In summary, although numerous scholars have researched CMR and frequently implement Hash algorithms for graphic retrieval. But it may face high computational complexity and may not be practical enough for large-scale data or real-time applications. And most current approaches concentrate solely on the feature dissemination among modalities, disregarding the nuanced details within each modality. Consequently, this study suggests implementing DL-based GFEM and GF-based CMHR models to draw out the attributes of varying modalities and supplement the nuanced information within each modality. The research aims to propose effective cross modal image and text retrieval methods to meet the requirements of current multimodal retrieval methods.

3. Cross-modal Graphic Search Model Building Based on Hash Learning and Deep Learning

In the context of information explosion and multimedia, how to accurately obtain effective information has become a key concern for researchers. Cross-modal graphic search (CMGS) plays a vital role in this. To further improve the accuracy of CMR, the study will build DL-based GFEM and GF-based CMHR models.

3.1. Deep Learning-based Graphic Feature Extraction Model Construction

Graphic fusion learning aims to deeply explore the shared semantic information and potential correlations between image modal data and textual modal data, and to generate high-quality feature representations or decision results for analysis with modal-specific information as supplementary information. Efficient analytical modeling of graphic fusion helps to better understand the current task scenario and make decisions accordingly. Graphic fusion strategies are categorized into hybrid fusion, decision-level fusion and feature-level fusion according to the fusion stage, and the specific framework is shown in Fig. 2.

Fig. 2. Image and text fusion strategy diagram.

Among them, feature level fusion can better capture potential correlation relationships at the feature level, but it is prone to overfitting training data. Decision level fusion can effectively handle the asynchrony problem between heterogeneous data and better learn modality specific information, but it ignores the correlation between heterogeneous modality feature representations and is difficult to implement. Hybrid fusion combines the advantages of feature level fusion and decision level fusion. Therefore, this study adopts a hybrid fusion strategy, taking into account the overfitting problem of small sample datasets in feature level fusion, and drawing on the modal specific information preservation idea of decision level fusion for modeling. The Bert model, a deep bidirectional Transformer model, and its extended models have achieved state-of-the-art results in various natural language processing tasks. Therefore, this study adopts a pre trained Bert model for text feature extraction ^[18]. Eq. (1) shows how the concept of joint output, which combines the outputs of multiple hidden layers, is introduced in this study in order to fully utilize the features of the 12-layer coding network that makes up the Bert model, give it a more suitable basis for discrimination, and improve the model's robustness and generalization ability.

(1)

$ H_{joint}=\left[H_1;~H_2;~...;~H_L\right]. $

In Eq. (1), $L$ denotes the number of hidden layers in the Bert model. The extraction of features from the bottom, middle, and top layers of the Bert model is studied in order to be spliced as indicated by Eq. (2). This is because splicing the outputs of all hidden layers as a joint output would be too high dimensional.

(2)

$ H_{joint}=\left[H_{i1};~H_{i2};~...;~H_{ik}\right]. $

In Eq. (2), $Hij$ is the output of the $ij$th hidden layer and $k$ is the hidden layers selected. For the extraction of image features, the DenseNet network is used in this study. DenseNet is a kind of DNN in which the output of each layer is not only passed to the next layer, but will also be directly passed to all the subsequent layers, which is characterized by dense connectivity, which enables a better flow of information, thus improving the effectiveness of the network ^[19]. The core structure of DenseNet is the dense block, and the output of the dense block is then compressed by a transition layer to reduce the channels in the output, thus reducing the model parameters and accelerating the model training, as shown in Eq. (3).

(3)

$ \left\{\begin{aligned} & x_l=H_l\left([x_0,~x_1,~...,~x_{l-1}]\right),\\ & x_{l-1}=f_l\left([x_l]\right). \end{aligned}\right. $

In Eq. (3), $H_l$ denotes the convolution operation of the first layer, $x_l$ denotes the output of the dense block, $f_l$ denotes the compression operation of the $l$th layer, and $x_{l-1}$ denotes the operation of the transition layer. In addition, this study constrains the sequence length of the output to be consistent with the text features by changing the last fully connected layer of the DenseNet network. Different modalities have different forms of features, and the feature fusion process is prone to the loss of single-peak information, so intra-modal enhancement of the features is required before feature fusion. Transformer is a powerful sequence model that has achieved great success in feature extraction and enhancement, but its required time and memory show a second-order increase in the length of the sequence ^[20]. For this reason, this study employs the gated attention unit (GAU), which combines a gated linear unit and an attention mechanism, to reduce the burden of self-attention in Transformer. The gated linear unit is shown in Eq. (4).

(4)

$ \left\{\begin{aligned} & U=\varphi u(XWu),~V=\varphi v(XWv)\in R^{T\times e},\\ & O=(U\odot V)Wo\in R^{T\times d}. \end{aligned}\right. $

In Eq. (4), $X$ denotes the original input vector, $\varphi $ denotes the activation function, $W$ denotes the weight matrix, $\odot $ denotes the elemental multiplication, and $O$ denotes the output. Utilizing the attention and the gated linear unit as a single layer is the primary goal of the gated attention unit. And the calculation process of the gated attention is shown in Eq. (5).

(5)

$ \left\{\begin{aligned} & Z=\varphi z(XWz),\\ & A=relu^{2} \left(Q(Z)K(Z)^{T} +b\right). \end{aligned}\right. $

In Eq. (5), $Z$ refers to shared representation, $Q$ and $K$ denote two cheap transformations, $b$ denotes relative positional bias, and $A$ denotes attentional weight. In this study, a self-encoder is used to fuse image features and text features, and the loss function of the feature fusion module is shown in Eq. (6).

(6)

$ S_{ff}=\frac{\frac{1}{M} \sum _{i=1}^{M}\left|x_i-\tilde{x}_l\right|^{2} +\frac{1}{M} \sum _{i=1}^{M}\left|y_i-\tilde{y}_l\right|^{2} }{2} . $

The input and output text data are represented by $x_i$ and $\tilde{x}_l$ in Eq. (6), the number of modes are represented by $M$, and the input and output image data are represented by $y_i$ and $\tilde{y}_l$, respectively.

3.2. Cross-modal Hash Retrieval Model Building Based on Graphic Features

Once the image and text features are extracted, the study suggests a CMHR using multilevel adversarial attention, which consists of a modal attention module, a modal adversarial module, and an HL module, to further remove the heterogeneous differences between modalities and to supplement the fine-grained information of each modality. Channel attention mechanism can effectively weight and optimize the features of single mode to improve the representation ability of features. In cross modal hash retrieval, weighting the channel attention mechanism independently on image and text features can better preserve the unique information of each modality. Therefore, the study utilized channel attention mechanism in the modal attention module. The channel attention mechanism is intended to mimic human behavior in making decisions by using only a portion of the data. It does this by modeling different modal data according to their feature importance separately and assigning features to them based on their inputs. This mechanism is straightforward and efficient, and it includes two methods: maximum pooling and average pooling, which are used to gather modality features and the former to gather spatial information. Fig. 3 illustrates the specific operation of the channel attention mechanism.

Fig. 3. Working principle diagram of channel attention mechanism.

Eq. (7) shows the channel attention method for image modalities. The characteristics of text and image modalities are obtained, and the data are fed into the shared network to form a one-dimensional attention mapping.

(7)

$ Mcv(F^{v} )=\rho (MLP(avgPool(F^{v} ))\\ \quad +MLP(\max Pool(F^{v} )))\nonumber\\ =\rho (W1v(W0v(F_{avg}^{v} ))\nonumber\\ \quad +W1v(W0v(F_{\max }^{v} ))). $

In Eq. (7), $\rho $ denotes the sigmoid function, $F^{v} $ denotes the image features, $W$ denotes the weights, and $F_{avg}^{v} $ and $F_{\max }^{v} $ denote the average pooling and maximum pooling of the image, respectively. For the visual modality, Eq. (8) shows the pertinent and irrelevant information.

(8)

$ \left\{\begin{aligned} & F^{rv} =F^{v} \otimes Mcv(F^{v} ), \\ & F^{irv} =F^{v} \otimes (1-Mcv(F^{v} )). \end{aligned}\right. $

In Eq. (8), $F^{rv} $ denotes relevant information, $F^{irv} $ denotes irrelevant information, and $\otimes $ denotes multiplication of elements. Similarly, the one-dimensional text channel attention mapping is shown in Eq. (9).

(9)

$ Mct(F^{t} )=\rho (MLP(avgPool(F^{t} ))\nonumber\\ \quad +MLP(\max Pool(F^{t} )))\nonumber \\ =\rho (W1t(W0t(F_{avg}^{t} ))+W1t(W0t(F_{\max }^{t} ))). $

In Eq. (9), $F^{t} $ denotes the image features and $F_{avg}^{t} $ and $F_{\max }^{t} $ denote the average and maximum pooling of the image, respectively. Eq. (10) shows the computation of pertinent and extraneous textual modality information.

(10)

$ \left\{\begin{aligned} & F^{rt} =F^{t} \otimes Mct(F^{t} ),\\ & F^{irt} =F^{t} \otimes (1-Mct(F^{t} )). \end{aligned}\right. $

Each modality has feature information that is both relevant and irrelevant, yet the modal attention module will receive information about clear feature association. The multilevel modal confrontation module, which includes intra-modal confrontation learning and inter-modal confrontation learning, is designed to communicate the correlation representation of each modality more effectively. Combining the generative model for data distribution and the discriminative model for determining sample source judgment results in the generative adversarial network. And the main role of the discriminative model is to judge whether the samples are from the samples or the real data ^[21]. Finally, the generated model is estimated by the generative adversarial model. Fig. 4 illustrates the specific functioning concept.

Fig. 4. Schematic diagram of generating adversarial networks.

Through intra-modal adversarial learning, each modality's important feature information can be enhanced by irrelevant background information, giving each modality a richer set of relevant data. A generator is an important component in generative adversarial networks (GANs) that receives random noise or other samples as input and learns to generate data samples that are as similar as real data samples as possible. Generators typically interact with discriminators during the training process of adversarial networks, with the goal of being as close to real data samples as possible to deceive the discriminators. The discriminator and generator learn in an adversarial manner, supplementing the relevant information of the modality with irrelevant background information. The objective functions of the discriminator and generator for adversarial learning within the image modality are shown in formula (11).

(11)

$ \left\{\begin{aligned} & \min_{\theta D}L_{intra}^{di} =\sum _{i=1}^{N}\| D(f_{ri}^{v} )-1\| ^{2} +\| D(f_{iri}^{v} )-0\| ^{2} , \\ & \min_{\theta _{r}^{v} ,\theta _{ir}^{v}} L_{intra}^{gi} =\sum _{i=1}^{N}\| D(G_{r}^{v} (v_i))-0\| ^{2} \\ & +\| D(G_{ir}^{v} (v_i))-1\| ^{2}. \end{aligned}\right. \hskip -1pc $

In Eq. (11), $\theta D$ denotes the parameters of the judger $D$, $f_{ri}^{v} $ and $f_{iri}^{v} $ denote the intra-modal relevant and irrelevant information of the $i$th image instance, $G_{r}^{v} $ and $G_{ir}^{v} $ denote the image relevant and irrelevant information generators, respectively, $\theta _{r}^{v} $ and $\theta _{ir}^{v} $ denote the parameters of the generator, and $v_i$ denotes the $i$th image. Since the two intra-modal adversarial learning is symmetric, the discriminator objective function and generator objective function for textual intra-modal adversarial are identical. The inter-modal adversarial learning discriminator objective function and generator objective function are shown in Eq. (12).

(12)

$ \left\{\begin{aligned} & \min_{\theta D}L_{inter}^{d} =\sum _{i=1}^{N}\| D(f_{ri}^{v} )-1\| ^{2} +\| D(f_{iri}^{t} )-0\| ^{2} ,\\ & \min_{\theta _{r}^{v} ,\theta _{r}^{t}} L_{inter}^{g} =\sum _{i=1}^{N}\| D(G_{r}^{v} (v_i))-0\| ^{2} \\ & +\| D(G_{r}^{t} (t_i))-1\| ^{2}. \end{aligned}\right. \hskip -1pc $

In summary, the modal confrontation module can cross the heterogeneity gap between modes, make the relevant feature information more tightly distributed, and improve the accuracy of CMR, with the objective function as shown in Eq. (13).

(13)

$ L_{alm}=L_{inter}^{d} \!+\!L_{inter}^{g}\!+\!\gamma (L_{intra}^{d_v} \!+\!L_{intra}^{g_v} \!+\!L_{intra}^{d_t} \!+\!L_{intra}^{g_t} ). $

In Eq. (13), $\gamma $ denotes the hyperparameter. Hash function is a one-shot that converts a set into a fixed-length, irreducible set whose values are typically combinations of numbers and letters. The input is unbounded, but the size of the output needs to be specified at the time of output, and hence the probability of collision can be significantly reduced. Hash functions are characterized by avalanche effect, irreversibility, conflict avoidance and consistency. During the training process, each modality can learn HC together, so the objective function of HL is shown in Eq. (14).

(14)

$ \left\{\begin{aligned} & L_{hash}=L_p+\varepsilon L_q,\\ & \min_{\theta _{r}^{v} ,\theta _{r}^{t} ,\theta _{h}^{v} ,\theta _{h}^{t} } L_p=-\sum _{i,j=1}^{n}(S_{ij}\Theta_{ij}-\log (1+e^{\Theta_{ij}} )),\\ & \min_{\theta _{r}^{v} ,\theta _{r}^{t} ,\theta _{h}^{v} ,\theta _{h}^{t} } L_q=\| B^{v} -h^{v} \| _{F}^{2} +\| B^{t} -h^{t} \| _{F}^{2}\\ &\hskip 3.8pc =\| B-h^{v} \| _{F}^{2} +\| B-h^{t} \| _{F}^{2}. \end{aligned}\right. $

In Eq. (14), $L_p$ denotes the pairwise loss function, $\varepsilon $ denotes the hyperparameters, $L_q$ denotes the quantization loss, $\Theta _{ij} =\frac{1}{2} h_{*i}^{t} h_{*j} $ and $B$ denotes the HC. When optimizing the loss function, the model is optimized iteratively by learning one parameter at a time and fixing the others. The GF-based CMHR model framework is shown in Fig. 5.

Fig. 5. Framework diagram of a cross modal hash retrieval model based on graphic and textual features.

4. Effectiveness Analysis of Cross-modal Graphic Search Model Based on Deep Learning and Hash Learning

The study builds DL-based GFEM and GF-based CMHR models, which are of some significance for improving the accuracy of information retrieval, but their application effects have to be further verified. The study mainly analyzes from two aspects, firstly, the effect of DL-based GFEM is analyzed, and then the effectiveness of GF-based CMHR model is analyzed.

4.1. Analysis of the Effectiveness of Graphic Feature Extraction Model Based on Deep Learning

The R@K indicator is the recall rate @K, which refers to the ratio of the number of relevant results retrieved from the first K search results to the number of relevant results in the database. It measures the recall rate of the retrieval system. The Flickr30K dataset is a corpus of image captions containing 31783 images, each with 5 different text annotations, and has become the standard benchmark for sentence based image descriptions. To validate the effectiveness of DL-based GFEM, the study uses the Flickr30K dataset and compares it with three state-of-the-art benchmark methods, namely, deep visual-semantic alignments (DVSA), learning two-branch neural networks (LTBN), and stacked cross attention network (SCAN), three state-of-the-art benchmark methods, and the comparison results of R@K metrics of the four methods are demonstrated in Table 1. With an average memory rate of 77.8%, the model suggested in this study outperforms the other four techniques in the recall index table. The SCAN model comes in second with an average recall rate of 77.5%. R@1 represents the accuracy of information matching and is the most demanding indicator for features. In this study, the R@1 for image retrieval was 56.4%, while the R@1 for text retrieval was 68.6%. The findings show that DL-based GFEM performs well in text and image feature extraction and has a significant impact on semantic recognition in images.

Table 1. Comparison results of R@K indicators for four methods.

Methods	Image retrieval				Text Retrieval			Average recall rate/%
Methods	R@1/%	R@5/%	R@10/%	R@1/%		R@5/%	R@10/%	Average recall rate/%
DVSA	15.3	37.8	50.4	22.1		48.3	61.4	58.5
LTBN	7.5	23.4	33.2	16.5		38.7	50.9	28.4
SCAN	48.5	77.8	85.3	67.4		90.2	95.7	77.5
Our	56.4	81.0	86.9	68.6		85.3	88.7	77.8

To verify the efficacy of DL-based GFEM, the study uses the NUS-WIDE-Object dataset and the Animal With Attributes dataset and compares them with four multimodal feature extraction methods, namely, Support vector machine (SVM), PCA-SVM, Localized Multiple Kernel Learning (LMKL) and MKL. Kernel Learning (LMKL) and MKL. The classification accuracies of the five methods are shown in Fig. 6. The suggested model in this study has a better classification accuracy (0.637 and 0.712, respectively) in this figure than the other four techniques in both datasets. Moreover, in Fig. 6(a), the classification accuracy of features using text modalities in the NUS-WIDE-Object dataset is higher than that of features using image modalities. On the other hand, PCA-SVM's classification accuracy is less accurate than SVM's. This is because PCA causes information loss, and typical feature extraction techniques cannot be used directly to multimodal data when there is a significant degree of modalities' difference. The outcomes demonstrate the high classification accuracy, certain viability, and effectiveness of the DL-based GFEM for multimodal data.

Fig. 6. Comparison results of classification accuracy of 5 methods.

In order to verify the effect of four modules, namely, image feature extraction module, multi-level text feature extraction module, modality-specific feature enhancement module and feature fusion module, on the model performance in DL-based GFEM, the study conducts ablation experiments. Using weighted F1 value, macro F1 value, and accuracy as indicators, Fig. 7 displays the ablation experiment's outcomes. All four modules have a greater impact on the feature extraction effect of the model, with the feature enhancement module having higher indicators than the other three modules, thus the feature enhancement module plays the biggest role.

Fig. 7. Results of ablation experiment.

4.2. Effectiveness Analysis of Cross-modal Hash Retrieval Model Based on Graphic Features

To validate the effectiveness of GF-based CMHR model, the study uses MIRFlickr dataset and conducts experiments in pytorch framework by setting the initial learning rate to 0.0006, the batch size to 256, and the number of rounds to 100. the length of HC is set to 16, 32, and 64 bits, respectively. And with Collective reconstructive embeddings (CRE), Consistency-preserving adversarial hashing (CPAH), Fusion similarity hashing (FSH) and Deep multiscale fusion hashing (DMFH) are compared with the average precision mean of the four methods, and the comparison results are shown in Fig. 8. The suggested model in this study had the highest average accuracy mean across the four techniques in this figure for various HC lengths. And the average accuracy mean in the image retrieval text task is the largest when the HC length is 64 bits, which is 0.833, showing an excellent performance, with certain feasibility and superiority.

Fig. 8. Comparison results of average accuracy mean of four methods.

The study analyzes the precision-recall (P-R) curves of the aforementioned algorithms and sets the length of HC to 64 in attempt to further validate the efficacy of the GF-based CMHR model. The findings are displayed in Fig. 9. The suggested model in this study performs the best on the P-R curve, with the highest precision and recall, out of the four approaches shown in this figure. The outcomes demonstrate the effectiveness and superiority of the GF-based CMHR model by demonstrating greater retrieval accuracy performance.

To validate the effect of hyperparameters on the model performance, the study set the HC length to 64, changed the values of $\alpha $ and $\beta $ respectively, and the changes of the average accuracy mean of the model are shown in Fig. 10. In this figure, the optimal values of hyperparameters $\alpha $ and $\beta $ are both 0.1, indicating that both hyperparameters play an important role in CMHR.

Fig. 9. Comparison results of P-R curves for 5 methods.

Fig. 10. The average accuracy mean change result of the model.

The study uses the WIKI dataset for ablation experiments to confirm the impact of various modules in the GF-based CMHR model on performance. The findings are displayed in Table 2. In the table, all the three modules, namely, attention mechanism, inter-modal confrontation and intra-modal confrontation, have a large impact on the model's graphic detection performance, in which the attention mechanism plays the largest role, followed by the inter-modal confrontation module.

Table 2. The ablation experimental results of the model on the WIKI dataset.

Module	Image retrieval text			Text retrieval images
Module	16 bits	32 bits	64 bits	16 bits	32 bits	64 bits
Without attention	0.27	0.26	0.27	0.60	0.63	0.66
Without Intra mode confrontation	0.27	0.29	0.30	0.73	0.74	0.76
Without Inter mode confrontation	0.26	0.28	0.28	0.72	0.74	0.74
Complete model	0.29	0.30	0.30	0.75	0.76	0.76

5. Conclusion

With the Internet's rapid development, the multi-modal content included in multimedia presents additional challenges for retrieval technique application. To tackle the CMGS problem, this study has developed a GFEM based on DL and a CMHR model based on GF. The GFEM model exhibited the best recall metrics, averaging 77.8%, followed closely by the SCAN model at an average recall of 77.5%. The proposed model achieved classification accuracy of 0.637 and 0.712 on the two datasets, respectively, with the feature enhancement module proving most crucial. Notably, the CMHR model recorded the highest average precision mean of 0.833 in image retrieval text tasks with HC length of 64 bits. Additionally, the proposed model demonstrated optimal performance on the P-R curve, having the highest precision and recall. The graphic detection performance of the model was greatly influenced by the attention mechanism, inter-modal confrontation, and intra-modal confrontation. Among these, the attention mechanism played the most significant role, followed by the inter-modal confrontation module. To summarize, the study's model has demonstrated a certain degree of feasibility and effectiveness. However, the study's data does not provide sufficient graphic retrieval on the Internet, potentially impacting the model's practical application. Thus, future research should collect more data to confirm the model's effectiveness and optimize its use for CMGS with extensive data.

Funding

The research is supported by grants from Hunan Provincial Philosophy and Social Science Foundation of China (No.22YBA332) and Hunan Vocational Education Teaching Reform Research Project of China (No.ZJGB2022457).

REFERENCES

Y. Chen, X. Lu, and S. Wang, ``Deep cross-modal image–voice retrieval in remote sensing,'' IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 10, pp. 7049-7061, 2020.

X. Liu, Y. Cheung, Z. Hu, Y. He, and B. Zhong, ``Adversarial tri-fusion hashing network for imbalanced cross-modal retrieval,'' IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 4, pp. 607-619, 2020.

Y. Wang, Z. D. Chen, X. Luo, and X. Xu, ``A high-dimensional sparse hashing framework for cross-modal retrieval,'' IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8822-8836, 2022.

M. Cheng, L. Jing, and M. K. Ng, ``Robust unsupervised cross-modal hashing for multimedia retrieval,'' ACM Transactions on Information Systems (TOIS), vol. 38, no. 3, pp. 1-25, 2020.

M. Gheisari, H. Hamidpour, Y. Liu, P. Saedi, A. Raza, A. Jalili, H. Rokhsati, and R. Amin, ``Data mining techniques for web mining: A survey,'' Artificial Intelligence and Applications, vol. 1, no. 1, pp. 3-10, 2023.

Z. Li, Y. Zhou, and H. Wang, ``Social media image classification and retrieval method based on deep hash algorithm,'' International Journal of Web Based Communities, vol. 18, no. 3, pp. 276-287, 2022.

C. Liu, J. Ma, X. Tang, F. Liu, X. Zhang, and L. Jiao, ``Deep hash learning for remote sensing image retrieval,'' IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 3420-3443, 2020.

W. Song, S. Li, and J. A. Benediktsson, ``Deep hashing learning for visual and semantic retrieval of remote sensing images,'' IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9661-9672, 2020.

C. Yan, B. Gong, Y. Wei, and Y. Gao, ``Deep multi-view enhancement hashing for image retrieval,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1445-1451, 2020.

H. Lu, M. Zhang, X. Xu, Y. Li, and H. Shen, ``Deep fuzzy hashing network for efficient image retrieval,'' IEEE Transactions on Fuzzy Systems, vol. 29. no. 1, pp. 166-176, 2020.

L. Zhu, X. Lu, Z. Cheng, J. Li, and H. Zhang, ``Flexible multi-modal hashing for scalable multimedia retrieval,'' ACM Transactions on Intelligent Systems and Technology (TIST), vol. 11, no. 2, pp. 1-20, 2020.

L. Zhen, P. Hu, X. Peng, R. S. M. Goh, and J. T. Zhou, ``Deep multimodal transfer learning for cross-modal retrieval,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, pp. 798-810, 2020.

G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M. Caetano, B. Demir, and V. Markl, ``BigEarthNet-MM: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets],'' IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174-180, 2021.

Z. Ji, H. Wang, J. Han, and Y. Pang, ``SMAN: Stacked multimodal attention network for cross-modal image–text retrieval,'' IEEE Transactions on Cybernetics, vol. 52, no. 2, pp. 1086-1097, 2020.

Y. Gu, K. Vyas, M. Shen, Y. Jie, and G. Yang, ``Deep graph-based multimodal feature embedding for endomicroscopy image retrieval,'' IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 481-492, 2020.

Y. Chen, X. Lu, and S. Wang, ``Deep cross-modal image–voice retrieval in remote sensing,'' IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 10, pp. 7049-7061, 2020.

X. Xu, K. Lin, Y. Yang, A. Hanjalic, and H. T. Shen, ``Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3030-3047, 2020.

Y. Lei, ``Research on microvideo character perception and recognition based on target detection technology,'' Journal of Computational and Cognitive Engineering, vol. 1, no. 2, pp. 83-87, 2022.

S. H. Wang and Y. D. Zhang, ``DenseNet-201-based deep neural network with composite learning factor and precomputation for multiple sclerosis classification,'' ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 2, pp. 1-19, 2020.

V. Bagal, R. Aggarwal, P. K. Vinod, and U. D. Priyakumar, ``MolGPT: Molecular generation using a transformer-decoder model,'' Journal of Chemical Information and Modeling, vol. 62, no. 9, pp. 2064-2076, 2021.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ``Generative adversarial networks,'' Communications of the ACM, vol. 63, no. 11, pp. 139-144, 2020.

Author

Lu Tan

Lu Tan obtained her M.Phil. degree in management from Northwest A&F University, in 2009. Presently, she is working in Library Information Center, Changsha Aeronautical Vocational and Technical College. She has published over 20 articles in multiple domestic and international journals and conference proceedings. Her areas of interest include smart library, artificial intelligence, educational informatization and vocational education.