Mobile QR Code QR CODE

  1. (School of Information Engineering, Guangxi Technological College of Machinery and Electricity, Nanning, 530007, China)
  2. (Faculty of Civil Engineering, Guangxi Technological College of Machinery and Electricity, Nanning, 530007, China )

Multisensor, Moving image, Information fusion, Multi-objective PSO, Color space model (CSM)

1. Introduction

Image fusion technology aims to associate and synthesize multi-source sensor image data to generate an estimation and judgment. As an effective information fusion technology, image fusion technology has been used widely in automatic machine recognition, earth remote sensing, computer vision, military reconnaissance, medical image pathological change recognition, and other fields. This method is more complete, reliable, and accurate than a single information source [1,2]. As the image is a unique signal form, its signal characteristics and data information will be subject to objective and subjective environments, and the influence of uncontrollable factors shows distinct differences [3]. At the same time, the limitation of image processing of a single sensor makes it difficult to process an image based on meeting the requirements of the fusion target. Multi-sensor image information fusion has particularity and complexity that can better achieve a multi-source combination of data information [4].

The interference of internal and external environmental factors will greatly restrict the image processing and fusion process, and there will be significant differences in its actual presentation effect. Different types of images have different spatial scales, forms of expression, information characteristics, and properties, so they need to consider these characteristics when performing image fusion. At the same time, the complexity of network environment changes, and the differences in images in time and space will make data processing more difficult. The image fusion method based on the multi-sensor proposed in this paper can realize the processing of different scale information images and transform them, achieving better processing results. Moreover, this method can effectively reduce the oscillating effect of moving target images on image information during the moving process and improve the accuracy of target image detection in a dynamic environment.

Y. B et al. [5] proposed a multi-exposure image fusion method based on tensor decomposition (TD) and convolutional sparse representation (CSR). H. B Scholar [6] proposed to realize the marking of infrared image targets by semantic segmentation and use different loss functions for image fusion for different target areas and background areas. The experimental results showed that the fusion results have higher contrast in the target area and more texture details in the background area. Y. S et al. [7] reported the fusion of multi-exposure images with three image quality attributes: contrast, saturation, and brightness. The multi-scale image fusion under different coefficients is realized by attribute weighting processing and pixel elimination with a poor visual effect. The image quality was improved using the color correction method of a local saturation. The experimental data showed that the algorithm could maintain the image details and correct the color of an exposure fusion image. M. Zheng, G et al. [8] used adaptive structure decomposition to achieve a defogging effect of multi-exposure images, i.e., through spatial linear adjustment, image sequence extraction, and the application of fusion schemes to increase the acquisition of image information. Hence, the method based on texture energy was studied to select the block size of image structure decomposition adaptively. The method had high effectiveness and applicability according to the experimental data. Y. Yang et al. [9] proposed a multi-layer feature convolution neural network to achieve the fusion of multiple input images. They generated a fused total clustered image using a weighted summation decision. This method showed a good fusion effect in the experimental evaluation. Q. Zhang et al. [10] proposed a method based on multi-sensor image fusion under computer vision. The method was based on sparse representation and compared with a dictionary learning method based on ensuring that the misregistration between source images is unaffected. In the process of an experimental evaluation, sparse representation could better fuse multi-sensor images.

Robust principal component analysis (RPCA) is often used in moving target detection problems, but its performance is poor because of the dynamic background and object movement. Therefore, S. Javed, A et al. [11] proposed a spatiotemporal structured algorithm, i.e., spatial and temporal normalization of sparse components. The spatiotemporal subspace structure could effectively constrain the sparse components and obtain a new target function. The experimental results showed that the algorithm could achieve better target detection in different data sets, and its performance was good. D. P. and Bavirisetti, G et al. [12] realized the induced transformation of different source images under the structure transfer attribute to guide the visual significance detection of image filtering and the selection of weight maps to achieve the extraction of image information and the integration of pixels. The video sequence can show the changes in the objects in the process of motion capture. Yadav SP scholar [13] improved the traditional frame difference algorithm with the help of MATLAB and considered the noise interference and structural differences to achieve the diversity of coding methods. The results show that the video sequence test algorithm can effectively recognize moving images and has high robustness. Considering the current single use of medical imaging methods, Yadav S P et al. [14] used a wavelet transform to achieve multimodal medical image fusion. They used wavelet transform, independent component analysis, and principal component analysis technology to perform image fusion, denoising, and data dimensionality reduction. This research idea can effectively improve the medical diagnosis effect. Based on the fact that the original driving target image detection relies on the RGB features, Ma X et al. [15] innovatively proposed the use of a 2D image plane for a 3D point cloud spatial point representation. They introduced the PointNet backbone and multimodal feature fusion module to realize 3D detection and image inference of automatic driving targets. This method improved the performance of monocular 3D target recognition. The previous salient object detection (SOD) method based on RGB - D could not fully capture the complex correlation between RGB images and depth maps and did not consider the cross-hierarchy and continuity of information. Therefore, Li G et al. realized information conversion interactively and adaptively and distinguished cross-modal features and enhanced RGB features from different sources by cross-modal depth weighted combination and depth algorithm. The method showed a good application effect and effectiveness on five test data sets.

Fusion performance analysis showed that the proposed method performs well in visual quality and fusion metrics with less running time. A two-way guarantee of image quality and SNR were difficult in image information fusion, and the feature difference of moving images in different time and space backgrounds was noticeable. A moving-image information fusion analysis algorithm based on multi-sensor is proposed to improve the SNR and information entropy of moving-image information fusion, reduce the standard mean square error, and improve the visual effect of moving-image information fusion.

Image information fusion technology is an important method of image information processing and analysis, which can be carried out at three levels: pixel, feature, and decision. Ensuring the clarity and integrity of image fusion is a problem requiring attention in information processing. Multi-sensor information fusion is actually a functional simulation of the human brain processing complex problems. The method can observe and process various image information in image information processing and show the complementarity and redundancy avoidance in space and time under different image optimization criteria. Research on image processing can consider the scale differences among image data with the help of the multi-sensor information transmission concept. As a local connection network, a convolutional neural network (CNN) has the characteristics of local connectivity and weight sharing. It can perform convolution processing on a given image and extract some features. The transformation of its convolution kernel parameters can meet the requirements of image processing. Wavelet decomposition and color mode conversion of moving images can effectively reduce the impact of interference factors on image processing. The proposed method can be considered from image feature extraction, information processing, and fusion image, and is significantly different from the previous research content in that it can generate a sequence of moving images, effectively avoiding the impact of precision interference caused by the missing number of moving images and detection errors.

2. Moving Image Preprocessing

(Ed note: ``Owing to'' means ``because of'' and ``due to'' means ``caused by''.) Moving images show different image scales owing to the difference in the information environment and information content. Therefore, it is necessary to preprocess the image before image information fusion. The primary purpose of image preprocessing is to eliminate the irrelevant information in the image, restore the useful and true information, enhance the detectability of relevant information, and simplify the data to the maximum extent to achieve the feature extraction and recognition of relevant information. At the same time, wavelet decomposition and color space conversion of moving images can reduce the impact of interference factors on image processing, ensure the quality of image information, and better judge the quality of image fusion. Moving image originally refers to the changes in speed, time, and displacement of the physical properties of the image, i.e., the movement changes in objects are expressed in the form of images. The image change caused by the relative motion between the object and the surrounding environment can be called a moving image. The image may have the characteristics of fuzziness, distortion, and overlap because of the uncertainty and difference of the motion, which also makes the fusion of moving images more difficult. The richness and integrity of the image information displayed by the captured moving images at different time intervals will also differ. The network environment and sensor types will also remove and retain the content contained in the image information to varying degrees [17]. At the same time, when moving image information is used for data input and digital image conversion, it will inevitably reflect the signal characteristics under different dimension levels. Identifying the signal characteristics can effectively capture the commonness of different moving images in the fusion process. Therefore, based on the characteristics of the moving image, the information of the moving image is analyzed based on retaining the characteristics of the moving image to obtain the image time series. The image time series are decomposed by a wavelet to obtain the signal characteristics under different time-frequency resolutions, providing a richer information basis for image information fusion.

2.1 Moving Image Feature Extraction

In the feature extraction of moving images, it is necessary to divide the time and amount of the pixel frames that constitute the image and then sample them step by step. A pixel frame refers to the complete sampling of an image, while a single frame pixel refers to the image pixel frame at a specific time. The neural network of a single frame pixel is then calculated using a CNN, i.e., the neural network represents the semantic information of the extracted convolutional feature image and covers each pixel in the feature map with a reference suggestion box to determine if the suggestion box is a foreground or a background. The classification of the foreground is determined according to the comprehensive characteristics of the subnet and the information of the mapping box [18]. The convolution-sharing feature is used to transmit and share the features of each single frame pixel to form a feature network and achieve the feature edge effect of determining the single frame pixel of a moving image.

Image edge refers to a collection of pixels whose surrounding pixels have a step change in grayscale, placed between the target and the background. The edge feature can better reflect the clarity of the target. On the other hand, the actual image is blurred due to optical reasons, the performance of image acquisition information, and sampling rate(Ed note: ``etc.'' can be deleted here. If there are other important items in the list, then they should be added.) [19]. Therefore, based on the characteristics of the human visual system, the key to whether the target image is clear is whether the edges of the target and the background are clear. The larger the image edge features, the clearer the effect it presents. The specific calculation steps of the feature edge of a single frame pixel are as follows:

The number of edge feature systems is defined as $E$ according to the motion position of a single frame pixel and its corresponding number of time systems, and $E=e\left(n\right)$, where $e\left(n\right)$ is the pixel corresponding to the amount of edge feature information. The one-dimensional curve composed of a single frame pixel and its adjacent single frame is calculated discretely to obtain two motion feature curves, $p\left(n\right)$ and $q\left(n\right)$, from which the calculation formula of the number of edge pixel feature systems corresponding to a single frame pixel can be acquired, as shown below:

$ E=\left\{p\left(n\right),q\left(n\right)\right\} $

where $n$represents the position of a single frame pixel on the motion curve; $n=1,2,\ldots ,m$, where, $m$represents the total amount of feature information.

Considering that the noise error of the moving image will affect $p\left(n\right)$ and $q\left(n\right)$, to ensure the number stability of feature systems, let $\varepsilon \left(n\right)$ be the discrete coefficient of the number of feature systems, and carry out $x+1$ iterative calculations. The number of feature systems after stable optimization can be obtained as follows:

$ \varepsilon ^{x+1}\left(n\right)=D\left(V^{g}\times e_{1}\left(i\right)+V^{g}\times e_{2}\left(j\right)\right) $

where $V^{g}$represents the weighted value of the CNN convolution coefficient. $e_{1}\left(i\right)$and $e_{2}\left(j\right)$represent the statistical characteristic quantities.

After the above calculations are completed, the pixel point $e\left(n\right)$ of a single frame is defined as the center amount of the characteristic pixel, the outward radiation distance is $D$, and the corresponding angle of the characteristic pixel point is calculated to obtain a more accurate characteristic error coefficient. The feature information association area is composed of the pixel point $e\left(n\right)$. All pixels within the radiation distance $D$ can be described as

$ H\left(n\right)=D\left[e\left(n\right)-\left(p\left(n\right),q\left(n\right)\right)\right]=\left[p\left(k\right),q\left(k\right)\right] $

where $k$ represents the characteristic labels of all pixels within the radiation distance range $D$ of the setting area $e\left(n\right)$.

The centers of two adjacent groups of feature single frame pixels of $e\left(n\right)$ are $e^{1}\left(n\right)$ and $e^{2}\left(n\right)$, and the feature angle relations formed by $e^{1}\left(n\right)$ and $e\left(n\right)$ and $e\left(n\right)$ and $e^{2}\left(n\right)$ are $\theta ^{1}\left(n\right)$ and $\theta ^{2}\left(n\right)$, respectively. The edge coefficient curvature angle of the feature area composed of single frame pixels is(Ed note: A complete sentence should precede a colon.)

$ \theta \left(n\right)=\frac{D_{n}\left[\theta ^{1}\left(n\right)-\theta ^{2}\left(n\right)\right]}{\sum _{i=1}^{n}\exp \left(h\left(n\right)\right)} $

$\theta \left(n\right)$ and $e\left(n\right)$ satisfy the positive gradient relationship so that the reference value of the moving image feature coefficient is $f$, if $\theta \left(n\right)>f$, the following equation can be obtained:

$ \theta \left(n\right)=1-\frac{e\left(n\right)}{\max e\left(n\right)\times f} $

Repeated iterative calculations on Eq. (5) are carried out to acquire the optimal coefficient number, and the range of the corresponding coefficient number of $f$ is set to be less than 0.4. The value range corresponding to $D$ is (4, 18).

The pixel characteristic parameters of the final moving image are

$ G\left(n\right)=\left\{\begin{array}{l} 0\theta \left(n\right)\leq f'\\ 1\theta \left(n\right)>f'\\ -1\theta \left(n\right)\prec f' \end{array}\right. $

In Eq. (6), $G\left(n\right)$ represents the pixel characteristic parameters of the final moving image. $f'$represents the number of curvature angle systems of pixels in a single frame of edge features.

2.2 Generation of Moving Image Time Series

The mixed function control curve method generates the moving image time series. The HC-B\'{e}zier curve and uniform B-spline curve with one shape parameter are defined in hyperbolic function space [16] to generate the moving image time series. The second-order $\lambda $ function can be written as

$ B=h\left(1-\lambda \right)\left(1-k^{2}\right)+h\left(t-\lambda k\right)\overline{h}\left(1-\lambda \right) $

Marked $\alpha =\left(1-\lambda \right)\left(1-k^{2}\right)$, $\beta =h\left(t-\lambda k\right)\overline{h}\left(1-\lambda \right)$. The mixed function with the parameters defined above is used to form the $\lambda $ function for control edge coincidence interpolation, and the image contour curve is defined as

$ B\left(\lambda \right)=\sum _{i=1}^{n}\left[\alpha _{i}\left(t\right)*\beta \left(t-\lambda \right)\right] $

where $i=1,2,\ldots ,n$.$n$is the position of a single pixel in the motion time series curve. Feature extraction is carried out for each contour point that produces the maximum gray value, and a combined surface composed of $A$and$N\times M$-order surface patches is defined using the control point $C_{k}$:

$ A\left(\lambda \right)=\lambda ^{n}+\sum _{i=1}^{n}a_{N\times M}^{i}\left(t\right) $

The bright spot area at the edge of the studied image is decomposed into a set of two-dimensional network points, which is expressed as

$ W=\left\{w\left(x,y\right)\right\} $

where $1\leq x\leq N$,$1\leq y\leq M$. The elements in this set are moving image time series.

2.3 Wavelet Decomposition of Moving Image

Based on the time series of moving images obtained in Section 2.2, the time series of moving images are further decomposed by a wavelet. The local features of moving images are generally determined by multiple pixels. The wavelet transform method [18] is adopted to decompose the moving images considering the differences in semantic information and detail features of moving images at different scales.

Wavelet transform is a method different from a Fourier transform. The multi-scale representation is also different from the traditional image pyramid decomposition representation. It can obtain the decomposition sub-bands of images with different resolutions and spatial scales. The wavelet transform method generates the corresponding scale and displacement function through a wavelet basis [19].

The definition of wavelet basis is expressed as

$ U\left(t\right)=\exp \left\{-\frac{1}{2}\left(\frac{z-z_{i}}{\sigma _{i}}\right)\right\} $

where $z_{i}$and $\sigma _{i}$represent the scale factor and displacement factor, respectively.

The wavelet transforms of the signal $L\left(t\right)$ can be expressed as

$ W\left(c_{1},c_{2}\right)=U\left(t\right)\sqrt{L\left(t\right)dt} $

Moving image is a kind of digital signal that generally does not meet the continuous condition, so $c_{1}$ and $c_{2}$ are discrete forms. Among them, the corresponding wavelet function is obtained after power series processing of $c_{1}$, i.e., $c=c_{1}^{n}$, and the corresponding wavelet function can be expressed as

$ c_{1}\left(t\right)=\int _{i=1}^{N}c_{1}^{i}\left(c-c_{2}\right) $

where $c_{2}$ is discretized uniformly. The form of $U\left(t\right)$is shown below:

$ U\left(t-c_{2}\right)=\psi \left(c_{2}^{-i}-kc_{1}\right) $

Finally, the discrete wavelet transforms [20] are defined as

$ W\left(c_{1}^{i},c_{2}^{i}\right)=\eta _{i}+\frac{1}{2}\sum _{t=1}^{N}U\left(t\right) $

The wavelet decomposition of moving images is similar to filtering images with multiple groups of filter banks that automatically adjust the parameters to obtain subbands with different frequencies. There are many commonly used wavelet decomposition algorithms, such as Mallat fast decomposition algorithm [21], and the mathematical expression of the algorithm is

$ V\left(a,b\right)=\frac{\mu \left(d_{1}+d_{2}+d_{3}\right)}{\frac{1}{\ln U\left(t\right)}\sum _{a,b=1}^{N}p_{a}\left(i\right)\ln p_{b}\left(j\right)} $

where $p_{a}\left(i\right)$refers to a high pass filter. $p_{b}\left(j\right)$refers to a low-pass filter; $a$ and $b$refer to the row and category of the image(Ed note: The possessive form (e.g., Mark's cup) is used with nouns referring to people, groups of people, countries, and animals. The possessive form is not used with inanimate objects. Instead you should use an adjective or an "of" phrase. For example, "The TV legs are broken". OR "The legs of the TV are broken".); $\mu $refers to the low-frequency part of the image. $d_{1}$, $d_{2}$, and $d_{3}$ refer to the image’s horizontal, vertical, and diagonal edge details, respectively, i.e., the high-frequency part.

This paper selects the haar, db6, bior4.4, and sym8 wavelets as wavelet basis functions for decomposing the image three times. Each decomposition will obtain four subbands, i.e., low-frequency approximate sub-band, $LL$, high-frequency vertical knot sub-band, $HL$, high-frequency diagonal detail sub-band, $HH$, and high-frequency horizontal knot sub-band, $LH$. The low-frequency subband is further decomposed according to the decomposition order next time. The Haar wavelet three-level decomposition transformation is an example, and the decomposition principal diagram is shown in Fig. 2.

Fig. 1. Moving image decomposition.
Fig. 2. Schematic diagram of wavelet decomposition.

3. Moving Image Fusion Algorithm

Based on the moving image features, time series, and decomposition results obtained in Section 2, a CSM was established to ensure the color consistency of image fusion. A multi-sensor was then adopted to fuse the moving image. Fig. 2 decomposes the image information using the wavelet basis function to obtain subbands in different frequencies and directions. The wavelet decomposition of moving images is similar to the filter decomposition, so the parameters can be adjusted on the filter to achieve a hierarchical division of signal features. The multi-target PSO algorithm is used to optimize the image fusion result and improve the image fusion effect.

3.1 Construction of CSM

A moving image can be decomposed into three channel components: R, G, and B. Because the three-channel components are related, they affect each other in the image fusion process, which is not conducive to calculating image fusion. IHS (Intensity, Hue, and Saturation) image fusion is based on color space conversion. The method can realize the conversion of RGB (Red, Green, and Blue) spatial information, transforming the image into an image containing three independent components. I (Intensity), H (Hue), and S (Saturation) represent the intensity information (i.e., brightness), Hue (i.e., the distinguishing property between colors), and saturation (i.e., the depth and concentration information of image colors). Compared with RGB space, IHS Color space is closer to the perception of color through the human visual system. Different regions of the same moving image will be blurred and clear owing to the different viewfinder depths of the unprocessed moving image. Consistent with grayscale multi-focus image fusion, moving image fusion combines the clear images of different focus targets in multiple color multi-focus images into a single image. IHS CSM was established to achieve this goal [22].

The IHS model is a CSM based on the three elements of human visual color. The I component mainly contains the gray information of the source image, and the H and S components together contain the spectral information of the source image [23]. The I component mainly contains the gray information of the source image. The H and S components jointly contain the spectral information of the source image. In addition, the I component is the weighted average of the three-color channels and is insensitive to noise. Therefore, the I component is selected to calculate the focusing degree, which is adopted to characterize the fusion degree of the pixel of interest in the resulting image. IHS CSM is established based on the above reasons. Taking two motions as examples, the schematic diagram of CSM is given, as shown below in Fig. 3.

According to Fig. 2, the main steps of building IHS CSM are as follows:

(1) The two moving images $A$ and $B$ are converted from RGB space to IHS space, and the brightness components $I_{A}$ and $I_{B}$ of the two moving images are separated;

(2) The luminance component $I_{A}$ and the luminance component $I_{B}$ are decomposed by DT-CWT to acquire the low-frequency component and high-frequency component of $I_{A}$ and $I_{B}$, respectively;

(3) In accordance with the fusion rules based on fuzzy theory, the moving image is preliminarily fused to acquire $S_{A}\left(x,y\right)$, the low-frequency component, and $H_{B}\left(x,y\right)$, the six high-frequency detail components;

(4) These low-frequency and high-frequency components are inversely transformed by DT-CWT to obtain the fusion result $I_{AB}$;

(5) $S_{A}^{'}$ and $H_{B}^{'}$ are obtained by the weighted average of $S_{A}$ and $H_{B}$, respectively, and the IHS inverse transformation is performed together with $I_{AB}$;

(6) From IHS space to RGB space, the preliminary fusion results of the moving images are obtained.

Fig. 3. IHS CSM principle.

3.2 Moving Image Fusion Method based on Multi-sensor

Multisensory image fusion is a comprehensive analysis technology that performs spatial registration of different image data corresponding to the same scene obtained by multiple different types of sensors. Hence, the advantageous information of each image data is complementary to each other and organically combined to produce new and more informative images [24]. Image fusion, which is a key branch and research hotspot of information fusion, has been applied extensively in fields, such as machine vision, military remote sensing, and medical diagnosis [25].

Multi-sensor image fusion is a processing that integrates images or image sequence information of a specific scene acquired by multiple sensors simultaneously or at different times for generating new information about the scene interpretation. Let $r_{1},r_{2},\ldots ,r_{n}$ represent the measured data obtained by the sensor from the measured parameters. Owing to the influence of the sensor accuracy and environmental interference, $r_{i}$ has randomness. Let its corresponding random variable be $R_{i}$, then in practical applications, $R_{i}$ will generally obey the normal distribution, and it is assumed that the measured values of each sensor are independent of each other. Owing to the randomness of environmental interference factors, the authenticity of $R_{i}$ can be determined only by the information contained in the measurement data $r_{1},r_{2},\ldots ,r_{n}$. Hence, a higher authenticity of $r_{i}$ leads to a higher degree that $r_{i}$ is supported by other measurement data. The degree that $r_{i}$ is supported by $r_{j}$ is the possibility of the measured data $r_{i}$ being real data from the measured data $r_{j}$. The concept of the relative distance is introduced for the support degree between two sets of sensor-measured data.

The relative distance $d_{ij}$ between the measured data of two sensors is defined as the following expression:

$ d_{ij}=\sqrt{\left| r_{i}-r_{j}\right| ^{2}} $

A larger $d_{ij}$leads to a greater difference between two measured data, i.e., a smaller mutual support between two data. The relative distance is defined based on the existing implicit information of the data, reducing the requirement for priori information. A support function $\vartheta _{ij}$ is defined to quantify the mutual support between the measured data further. $\vartheta _{ij}$ should meet two conditions:

(1) $\vartheta _{ij}$ should have an inverse proportional relationship with the relative distance;

(2) $\vartheta _{ij}\in \left[0,1\right]$ enables the measurement data processing to take advantage of the membership function advantages in fuzzy set theory and prevent the absoluteness of mutual support between two measurement data.

The support function $\vartheta _{ij}$ is defined as

$ \vartheta _{ij}=\frac{\pi \left(d_{ij}\right)}{\sum _{i,j=1}^{N}arc\cot \left(d_{ij}\right)} $

According to Eq. (18), a smaller relative distance between two measured data leads to greater mutual support between the two data. The support degree will be one when there is no relative distance between the equivalent measurement data and itself. In contrast, if $\vartheta _{ij}$ is very small, it means that there is a considerable relative distance between the two data. At this time, it can be deemed that there is no mutual support between the two data, and $\vartheta _{ij}$ is meaningless. According to the practical application background of the problem, the parameter $\xi \geq 0$ can be determined when $d_{ij}\geq \xi $, $\vartheta _{ij}=0$. When there is the largest relative distance between two measured data, there is no mutual support between the two measured data, and the support function value reaches zero at this time. Because the $\vartheta _{ij}$ value declines from 1 to 0 on $d_{ij}\in \left[0,+\infty \right)$, it satisfies the properties of the support function. Furthermore, the definition form of fuzzy support function $\vartheta _{ij}$ complies more with the authenticity of the practical problem. The method is easy to implement and can make the fusion result more accurate and stable.

3.3 Moving Image Fusion Optimization based on the Multi-objective PSO Algorithm

The spatial conversion of image information and the fusion of different image data by multi-sensor have improved the accuracy and stability of fusion results. On the other hand, a multi-objective PSO algorithm is used to optimize the fusion parameters and further improve the retention of image details and the richness of image information [26]. Multi-objective particle swarm optimization algorithm is efficient and straightforward, does not require complex parameter settings, and can meet many objective evaluation indicators and fusion requirements in moving image fusion. These objective evaluation indicators can optimize the objective function [27]. It is difficult for a single objective function to cover the features of image fusion completely, while the multi-objective particle swarm optimization algorithm can enhance the ability of image fusion based on considering multiple objective functions. The algorithm takes multiple motion sub-images obtained from training as training objects, analogy to the number of PSO space dimensions, preprocesses the sampled images, and takes the divided image blocks as input values, Select the best fitness individual to initialize the weight of the network structure, repeat the particle iteration optimization process, and calculate the best fitness of each example to obtain the optimal particle solution. The PSO algorithm will produce a global optimal particle when solving single objective problems and various non-dominated solutions for multi-objective problems, i.e., non-inferior solutions. Therefore, appropriate particles can be selected from this set of non-dominated solutions according to specific fusion needs [28,29].

The fusion parameter optimization algorithm is as follows:

(1) Set two fusion parameters $\delta _{1}$ and $\delta _{2}$, and initialize two particle populations $\lambda _{{\delta _{1}}}$ and $\lambda _{{\delta _{2}}}$ respectively, corresponding to fusion parameters $\delta _{1}$ and $\delta _{2}$. The search space of the two-particle populations is [0,1], and the number of particles is $N_{k}$. The initial position $\tau _{0}$ of each particle in the particle swarm is generated randomly, and the initialization speed $v_{0}$ is set to 0. Individual extremum $\delta _{best}$ is initialized.

(2) Calculate the optimization objective function value $\partial _{i}\left(k\right)$ corresponding to each particle in the population, $k=1,2,\ldots ,N_{s}$. $N_{s}$ is the number of objective functions, and the optimization objective function is the objective evaluation index of the selected image fusion.

(3) Initialize the external archive $O$ and store the non-dominated particles in $\delta _{best}$ into the external archive.

(4) Perform the following operations and iterate to the maximum evolutionary algebra.

1) Calculate the crowding distance of the non-dominated solution set in the external archive, and sort archive $O$ in grading down order according to the crowding distance. The calculation formula for congestion distance is

$ \varpi _{dist}\left(i\right)=\partial _{i}\left(k+1\right)-\partial _{i}\left(k-1\right) $

where $\varpi _{dist}\left(i\right)$represents the crowding distance of individuals, which is 0 during initialization. $\partial _{i}\left(k+1\right)$and $\partial _{i}\left(k-1\right)$represent the objective function value of the individual, respectively.

2) Updates the speed of the particles. The equation is

$ \begin{array}{l} V\left[i,j\right]=w\times \max \left(A\right)\times \delta _{best}\left(i,j\right)\\ A=F\left[c_{1},c_{2}\right]-E\left[i,j-1\right]+E_{W}\left[i-1,j\right] \end{array} $

where $w$ represents the inertia weight. $E_{W}$ is the learning factor; $\delta _{best}\left(i,j\right)$ is the individual extreme value, i.e., the best position searched by the particle.

3) Update the position of particles. When updating the position of the particles, keep the particles in the search space. If the particle crosses the boundary, the particle position is the corresponding boundary value, and the particle velocity is $-V\left[i,j\right]$, so a reverse search is carried out.

4) Set the iteration number to $T$ and mutate the particle swarm. The mutation operator is

$ \rho =0.5\times \left(\rho _{Up}-\rho _{Low}\right)\times \overline{\rho } $

where $\rho _{Up}$and $\rho _{Low}$ refer to the upper bound and the lower bound of the search space, respectively.

5) For particles in the population, calculate and evaluate their objective function values.

6) Update the external archive and insert the non-dominated particles in the current population into the external archive.

7) Update the individual extreme value. If the current position of the particle is better than that stored in the individual extreme value, $\delta _{best}\left[i\right]=\delta \left[i\right]$.

(5) The external archive is the non-dominated solution set.

The fusion processing of moving images is realized through the above steps. The proposed method will be experimentally analyzed to verify its practical application value.

4. Experimental Research

Experimental analysis was conducted to verify the comprehensiveness and effectiveness of the multi-sensor-based moving image-information fusion analysis algorithm. In the experiment, the TD and CSR-based fusion methods of multi-exposure images and the semantic segmentation-based infrared and visible image fusion methods are compared with the method put forward in many aspects.

4.1 Experimental Hardware Environment and Data Source

The experimental hardware environment includes an experimental processor of Intel Core- M480I5CPU@ 2.67GHz, 8GB memory, a 64-bit operating system, and a version of Windows 10. The images used in this study were obtained from the Imagenet database, the largest known image database. Five data sets were set up with 1000 images of various types from the database. The experimental images were 512 ${\times}$ 512 pixels and 256 gray levels after spatial registration.

4.2 Analysis of Experimental Results

The experimental indices were divided into objective and subjective evaluation indices. The objective evaluation indices included standard mean-square deviation, information entropy, and signal-to-noise ratio. The subjective evaluation index was the human visual effect.

(1) Objective evaluation

The standard mean-square error (MSE) refers to the information retention degree of the fused image to the original image. A smaller value will lead to a higher approximation. The SNR is opposite to the standard MSE. A larger value will lead to a better fusion effect. The calculation equation for the two is shown below:

$\begin{align} MSE&=\frac{1}{MN}\sum _{i=1}^{M}\sum _{j=1}^{N}\left(\varepsilon _{i}-\varepsilon _{j}\right)^{2} \end{align} $
$\begin{align} SNR&=10\ast \log _{10}\frac{\sum _{i=1}^{M}\sum _{j=1}^{N}\varepsilon _{ij}^{2}}{\sum _{i=1}^{M}\sum _{j=1}^{N}\left(\varepsilon _{i}-\varepsilon _{j}\right)^{2}} \end{align} $

where $\varepsilon _{i}$represents the pixel gray value of the target scene; $\varepsilon _{j}$is the pixel gray value of the fused image.

Image information entropy (IIE) is a key index for measuring the richness of the image information. Based on the IIE comparison, the performance ability of the image detail was compared. The information entropy reflects how much information the image carries. A larger entropy leads to better-fused image quality. The amount of information contained tends to be the largest if the probability of all gray levels in the image tends to be equal. The IIE is defined as

$ \sigma =-\sum _{i=1}^{N}\theta _{o}\ln \theta _{o} $

where $\theta _{i}$ is the ratio of pixels with a gray value equal to $o$, the total number of image pixels.

According to Eqs. (22) and (23), the MSE and SNR values of images were calculated under different data set numbers of the proposed method and the other two comparison methods. The experiment was repeated three times, taking the average MSE value and average SNR value as the experimental results. The results are shown in Tables 1 and 2.

According to an analysis of the data in Table 1, under different image information of data sets, the MSE values of multi-exposure image fusion methods based on TD and CSR and infrared and visible image fusion methods based on semantic segmentation were all above 500, and the maximum average MSE value of TD-CSR reached 1006.98. The minimum mean square error values of the two methods were 579.13 and 488.56, which were higher than the 410.65 of the research method year-on-year. The minimum average MSE of the proposed method was 286.25, and the maximum signal-to-noise ratio was 24.66. The above results show that the proposed fusion method can effectively improve image similarity and reduce the error value.

According to the data analysis in Table 2, the MSE values of the TD and CSR-based multi-exposure image fusion method and the MSE values of the semantic segmentation-based infrared and visible image fusion method were lower than those of the proposed method. The maximum SNR values of the two methods were 12.36 and 12.39, respectively, while the maximum MSE value of the proposed method was 24.75, indicating that the proposed fusion method has a better fusion effect.

Table 1. Comparison results of the MSE values.

Dataset Number

MSE value

The method put forward

TD and CSR

Semantic segmentation





















Table 2. Comparison results of the SNR values.

Dataset Number

SNR value

The method put forward

TD and CSR

Semantic segmentation





















According to the data analysis in Table 3, the feature classification accuracy of the method proposed in this paper was more than 91%; the maximum value was 95.16%, and the feature classification accuracy was higher than the other two fusion algorithms. The classification accuracy of the other two fusion algorithms was below 90%, and their maximum values were 85.24% and 86.14%. Table 4 shows that the maximum information entropy of the proposed method is 9.2, which is significantly higher than the two traditional methods. These results showed that after the fusion of the method proposed in this paper, the image detail retention was higher; the feature classification effect was better; the image richness was improved, and the fusion quality was better

The information entropy of different methods was calculated using Eq. (24), and the results are shown in Fig. 4.

Table 3. Comparison results of the feature classification accuracy.

Dataset Number

Feature classification accuracy (%)

The method put forward

TD and CSR

Semantic segmentation





















Table 4. Comparison results of the information entropy.

Dataset Number

Feature classification accuracy (%)

The method put forward

TD and CSR

Semantic segmentation





















(2) Subjective evaluation

The above objective evaluation results show that the proposed method has a good image fusion effect. To verify its application value, an image in the experimental set was selected arbitrarily for fusion processing. The visual effects of image fusion were compared with those of different methods. The results are shown in Fig. 5.

The fused original remote sensing data shall be preprocessed to reduce the impact of the spectrum and time on the image information and data errors. Multi-source image data shall be spatially registered, i.e., the high-resolution image data shall be used as the reference datum, and the method of selecting control points shall be used to perform geometric correction on other images. The original image in Fig. 5 is the image to be processed, and the target image is the image after preprocessing. The original image is sparsely represented and convolved so that the image under the multi-sensor can be fused better. In addition, the redundant information of the lower right part in figure (d) is eliminated, and the main feature information is saved. At the same time, the small rectangular box in the semantic segmentation effect in figure (d) shows that when the target images after different processing are fused, the information features it displays are relatively rich (the rectangular box on the right side in figure (d) is the increased part compared to the original image), and the image information can be smoothed based on the integration of image information.

The original image was input and preprocessed, including data format conversion and image filtering processing (wavelet change). The color space information of the processed image was then converted, and the data information was fused with a multi-sensor. Before image fusion, other images were calibrated and registered based on the high-resolution image. The processed object image was represented sparsely and convolved, and particle swarm optimization was carried out based on considering the requirements of multi-objective fusion to find the optimal particle solution. The fused image, including semantic and scene segmentation, was resampled and post-processed to obtain the information features of multiple motion images after fusion.

Fig. 5 shows that the fusion result of the proposed method retains rich image information and improves the spatial detail expression ability of the fused image. Compared to the traditional method, the proposed method produced clearer images and more obvious detailed features. There was less image distortion. The definition increased and had a better visual effect.

According to the experimental results, the image fused using the proposed method was better than the traditional fusion method in subjective visual effect and objective statistical data. Its performance of maintaining image information and enhancing image spatial detail information is improved.

The remote sensing image data were selected for fusion algorithm comparison, and Figs. 6 and 7 present the results. The dynamic visual effects of the panchromatic images, which were processed by a wavelet transform and inverse wavelet transform, and fused with spectral images by multi-sensor processing, were quite different. The research mainly evaluated the effect difference between spatial detail features and spectral information, in which information entropy and sharpness can reflect the image fusion quality. It can be seen in Figs. 6 and 7 that the image (b and c) after the wavelet transform and inverse wavelet transform has some information distortion and image blurring. A specific deviation was noted between the characteristic information content and the original image. The research proposed that multi-sensor image fusion can effectively consider the differences between different image scale information. The spectral distortion in Figs. 6(d) and 7(d) was less than in the other images, and the sharpness increased significantly, with better visual effects and presentation quality.

Fig. 4. Comparison results of the information entropy.
Fig. 5. Comparison of image fusion effects.}
Fig. 6. Image fusion effect based on the region features.}
Fig. 7. Image fusion effect based on the region features.}

5. Conclusion

A new image fusion algorithm, i.e., the moving image information fusion analysis algorithm based on multi-sensor, was proposed to address the shortcomings of traditional methods. According to the experimental results, the proposed method is effective for moving image fusion. The method can retain the original image detail and improve the definition and spatial resolution of the original image. The image after fusion was clearer, and the detailed features were more obvious. This method was more in line with machine perception and human vision characteristics.

The image fusion quality is mainly affected by the fusion method and registration quality. The implementation of most image fusion methods is based on image registration. Many indicators are used to evaluate image fusion, most of which are subjective and objective indicators. It is essential to pay attention to human vision to analyze the effect of image fusion. The sensor-based image fusion pair proposed in the research can highlight the reality of the target information, retain detailed information, and present an excellent visual effect. The fusion method can process dynamic image information with a good application effect and enlightenment in sports and remote sensing.


I. Abas, N. A. Baykan, “Multi-Focus Image Fusion with Multi-Scale Transform Optimized by Metaheuristic Algorithms,” Traitement du Signal, 2021, vol. 38, pp. 247-259.DOI
D. Khaustov, Y. Khaustov, R. Ye, E. Lychkowskyy, N. Yu, “Jones formalism for image fusion,” Ukrainian Journal of Physical Optics, 2021, vol. 22, pp. 165-180.URL
P. Singh, M. Diwakar, X. Cheng, A. Shankar, “A new wavelet-based multi-focus image fusion technique using method noise and anisotropic diffusion for real-time surveillance application,” Journal of Real-Time Image Processing, 2021, vol. 18, no. 4, pp. 1051-1068.DOI
S. Babu, R.V. Krishna, J.S. Rao, P.R. Kumar, “NSCT And Eigen Features Based Image Fusion,” Solid State Technology, 2021, vol. 63, no. 5, pp. 5806-5813.URL
Y. B. Qi, M. Yu, H. Jiang, H. Shao, G.Y. Jiang, “Multi-exposure image fusion based on tensor decomposition and convolution sparse representation,” Opto-Electronic Engineering, 2019, vol. 46, no. 1, pp. 4-16.URL
H. B. Zhou, J. L. Hou, W. Wu, Y. Zhang, Y. Wu, J.Y. Ma, “Infrared and Visible Image Fusion Based on Semantic Segmentation,” Journal of Computer Research and Development, 2021, vol. 58, no. 2, pp. 436-443.URL
Y. S. Du, C. B. Huang, “Multi exposure image fusion algorithm based on quality metric coupled with color correction”. Journal of Electronic Measurement and Instrumentation, 2019, vol. 33, no. 1, pp. 90-98.URL
M. Zheng, G. Qi, Z. Zhu, et al. “Image dehazing by an artificial image fusion method based on adaptive structure decomposition”. IEEE Sensors Journal, 2020, 14, pp. 8062-8072.URL
Y. Yang, Z. Nie, S. Huang, et al. “Multilevel features convolutional neural network for multifocus image fusion”. IEEE Transactions on Computational Imaging 2019, vol. 5, no. 2, pp. 262-273.DOI
Q. Zhang, Y. Liu, R. S. Blum, et al. “Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: A review”. Information Fusion 2018, vol. 40, no. 1, pp. 57-75.DOI
S. Javed, A. Mahmood, S. Al-Maadeed, et al. “Moving object detection in complex scene using spatiotemporal structured-sparse RPCA.” IEEE Transactions on Image Processing 2018, vol. 28, no. 2, pp. 1007-1022.DOI
D. P. Bavirisetti, G. Xiao, J. Zhao, et al. “Multi-scale guided image and video fusion: A fast and efficient approach”. Circuits, Systems, and Signal Processing 2019, vol. 38, no. 12, pp. 5576-5605.DOI
Yadav S P. Vision-based detection, tracking, and classification of vehicles[J]. IEIE Transactions on Smart Processing & Computing, 2020, vol. 9, no. 6, pp. 427-434.URL
Yadav S P, Yadav S. Image fusion using hybrid methods in multimodality medical images[J]. Medical & Biological Engineering & Computing, 2020, vol. 58. pp. 669-687.DOI
Ma X, Wang Z, Li H, et al. Accurate monocular 3d object detection via color-embedded 3d recon-struction for autonomous driving[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 6851-6860.URL
Li G, Liu Z, Ling H. ICNet: Information conversion network for RGB-D based salient object detection[J]. IEEE Transactions on Image Processing, 2020, vol. 29, pp. 4873-4884.DOI
V. Mandrikova, A. I. Rodomanskaya, B. S. Mandrikova, “Application of the New Wavelet-Decomposition Method for the Analysis of Geomagnetic Data and Cosmic Ray Variations,” Geomagnetism and Aeronomy 2021, vol. 61, no. 4, pp. 492-507.DOI
E. H. Hssayni, N. E. Joudar, M. Ettaouil, “KRR-CNN: kernels redundancy reduction in convolutional neural networks,” Neural Computing and Applications 2021, vol. 34, no. 3, pp. 2443-2454.DOI
G. Rousseau, C. S. Maniu, S. Tebbani, M. Babel, N. Martin, “Minimum-time B-spline trajectories with corridor constraints. Application to cinematographic quadrotor flight plans,” Control Engineering Practice 2019, vol. 89, no. 8, pp. 190-203.DOI
T. Tadakuma, M. Rogers, K. Nishi, M. Joko, M. Shoyama, “Carrier Stored Layer Density Effect Analysis of Radiated Noise at Turn-On Switching via Gabor Wavelet Transform,” IEEE Transactions on Electron Devices 2021, vol. 68, no. 4, pp. 1827-1834.DOI
N. A. Sheikh Ahmad, “Novel special affine wavelet transform and associated uncertainty principles,” International Journal of Geometric Methods in Modern Physics 2021, vol. 18, no. 04, pp. 1801-1803.DOI
T. Patcharoen, A. Ngaopitakkul, “Transient Inrush and Fault Current Signal Extraction Using Discrete Wavelet Transform for Detection and Classification in Shunt Capacitor Banks,” IEEE Transactions on Industry Applications 2020, vol. 56, no. 2, pp. 1226-1239.DOI
X. Liu, J. Ma, D. Chen, L. Y. Zhang, “Real-time Unmanned Aerial Vehicle Cruise Route Optimization for Road Segment Surveillance using Decomposition Algorithm,” Robotica 2021, vol. 39, no. 6, pp. 1007-1022.DOI
P. Sandeep, T. Jacob, “Joint Color Space GMMs for CFA Demosaicking,” IEEE Signal Processing Letters 2019, vol. 26, no. 2, pp. 232-236.DOI
S. Cope, E. Hines, R. Bland, JD. Davis, B. Tougher, V. Zetterlind, “Multi-sensor integration for an assessment of underwater radiated noise from common vessels in San Francisco Bay,” The Journal of the Acoustical Society of America 2021, vol. 149, no. 4, pp. 2451-2464.DOI
L. L. Luan, “Visual Information Visual Communication Simulation of Image and Graphics Fusion,” Computer Simulation 2021, vol. 38, no. 10, pp. 424-428.URL
S. Einy, C. Oz, Y. D. Navaei, “Network Intrusion Detection System Based on the Combination of Multi-objective Particle Swarm Algorithm-Based Feature Selection and Fast-Learning Network,” Wireless Communications and Mobile Computing 2021, vol. 2021, no. 10, pp. 1-12.DOI
W. D. Li, Z. J. Xie, M. Li, “Research on the algorithm of sweeping robot based on multi-sensor fusion,” Electronic Design Engineering 2021, vol. 29, no. 2, pp. 6.URL
M. Kaucic, “Equity portfolio management with cardinality constraints and risk parity control using multi-objective particle swarm optimization,” Computers & Operations Research 2019, vol. 109, no. 9, pp. 300-316.DOI


Shucheng Wei

Shucheng Wei graduated from Guangxi University majoring in information management and infor-mation system, he received his master's degree in engineering from Wuhan University in 2016.He is working in Guangxi Technological College of Machinery and Electricity, School of Mechanical Engineering.He has been invited by several software development companies in Nanning, Guangxi as a senior consultant.He has presided over and participated in a number of scientific research projects of government departments.He has published 13 papers in well-known international and domestic journals.His interests include computer application technology, artificial intelligence and information security.

Hui Wang

Hui Wang graduated from Chongqing University majoring in Architecture and Civil engineering,lecturer. She is working in Guangxi Technological College of Machinery and Electricity, School of Architectural Engineering. She has presided over and participated in a number of scientific research projects of government departments.She has published 8 papers in well-known international and domestic journals. She has won 10 honors above the university level. Her interests include Civil engineering, geotechnical engineering, architectural engineering.