A novel style transfer algorithm, AMS Cycle GAN, has been studied and designed. This
algorithm introduces position normalization and moment shortcut modules to preserve
the feature information of the input image, thereby achieving better image style transfer
effects.
3.1. Generator Network Structure Design
The purpose of researching and designing AMS Cycle GAN is to address the limitations
of existing image style transfer algorithms and meet their various potential needs
in practical applications. Traditional style transfer models lack consistency in both
content and style when generating images, and their generation efficiency is low.
To address these issues, AMS Cycle GAN was designed and studied. In addition, when
designing AMS Cycle GAN, the potential needs of image style transfer in different
fields such as digital art creation, advertising design, film and animation production
were considered to improve the visual quality of images and ensure consistency in
image style while efficiently generating images.
The network structure of AMS-cycle-GAN image style migration algorithm generator is
designed to provide improved image style migration effect. In order to achieve this
goal, the network optimizes the model through three aspects. Firstly, PONO-MS is alternately
used in the encoder decoder part, which can not only effectively retain the characteristic
information of the input image, but also improve the optimization and convergence
of the network. Secondly, the loss function is the same as the Cycle-DPN-GAN model,
including the generation confrontation loss, cycle consistency loss, etc. Finally,
a channel based attention mechanism is introduced into the discriminator to better
focus on important content in the process of migration. The whole network structure
has two generators and two discriminators, which realizes bidirectional data generation.
The PONO-MS module in the generator is a variant of jump connection. It processes
the average and standard deviation extracted by the encoder and injects them directly
into the corresponding decoder layer, so that the characteristic map generated by
the decoder has similar statistics with the corresponding layer of the encoder. Fig. 1 shows the overall structure.
Fig. 1. Overall network structure.
Fig. 2. Generator network structure.
Fig. 1, the $x$ and $y$ regions represent the input trends of the image, where $G$ and F
both represent generators. The image is transmitted from the $x$ region to the $y$
region through generator $G$, and then from the $y$ region to the $x$ region through
generator F. A PONO-DMS module is introduced between the encoders of the two generators
to effectively save the style information of the image. The generator network structure
includes encoder, converter, and decoder. The encoder captures the feature information
from the input image by position normalization and directly injects it into the moment
shortcut network layer of the decoder to effectively improve network optimization
and convergence. The converter is responsible for keeping the size of the feature
image unchanged, effectively transferring the features of the art image, and using
reflection to fill the boundary of the input tensor. The generator network structure
is shown in Fig. 2.
The moment shortcut network layer of the decoder receives the data information normalized
by PONO, converts the size of the feature image from 64*64*256 to 256*256*64 through
two layers of inverse transposition convolution, and maps the content and style information
of the image to pixels. Finally, the convolution kernel size is $7\times7$ network
layer outputs 256*256 three channel generated images. The role of PONO-MS module of
position normalization and moment shortcut is to capture the feature mean and standard
deviation extracted from the input image, and directly inject them into the moment
shortcut network layer of the decoder, so as to improve the convergence of network
training. Fig. 3 shows PONO-MS module schematic diagram.
Fig. 3. Schematic diagram of PONO-MS module.
All calculation formulas are derived based on the GAN algorithm and the network structure
discriminator of the Patch GAN model. The calculation method of MS layer is shown
in Eq. (1).
In Eq. (1), $F(x)$ represents the middle layer modeling, $\beta $ and $\gamma $ represents the
mean and standard deviation parameters. The extracted mean and standard deviation
are taken as new parameters respectively, as shown in Eq. (2).
3.2. Discriminator Network Structure Design
This discriminator network framework refers to the basic model of patch GAN, with
a size of $70\times70$, and the attention mechanism is introduced in the fourth and
fifth convolution layers. This mechanism helps to focus the key pixel areas in the
image while ignoring or filtering the irrelevant parts. The application of attention
mechanism can further improve the fine-tuning ability of generating confrontation
model. The discrimination network structure is shown in Fig. 4.
Fig. 4. Distinct network structure.
The attention mechanism is realized through the convolutional block attention module
(CBAM), which is a method to improve the performance of convolutional network, and
mainly deals with the relationship between characteristic channels. When the image
down sampling feature size is $1\times31\times31\times512$, the intermediate feature
map obtains spatial information through global average pooling (GAP) and global maximum
pooling (GMP) operations, generates two different spatial feature maps, and transmits
them to the shared network. The characteristic diagram is shown in Eq. (3).
In formula (3), $R$ represents vector elements, $C$ represents the number of channels, $H$ represents
the height of the feature map, and $W$ represents the width of the feature map.
The shared network is composed of multi-layer perceptrons with hidden layers, and
the importance of each feature map channel direction is calculated. After obtaining
the weight of each channel, they will excite J to the corresponding channel of the
intermediate characteristic graph. This process is shown in Eq. (4).
Then Eq. (5) can be obtained.
In Eqs. (4) and (5), $F_{gap} $ represents the global average pooling characteristic graph, $F_{gmp}
$ represents the global maximum pooling characteristic graph. $W_{0} $ and $W_{1}
$ represents the shared weight. $GlobalAvgPool(F)$ represents global average pooling,
$GlobalMaxPool(F)$ represents global maximum pooling, $MLP$ represents multi-layer
perceptron, and $concat$ represents concatenation of feature maps both meeting the
conditions as shown in Eq. (6).
In formula (6), $\frac{C}{r} $ represents the scaling factor, and $r$ represents the scaling ratio.
After this operation is completed, a GAP characteristic graph and a GMP characteristic
graph are generated, both of which have weights. Then, the two feature maps are summarized.
After concat operation, the number of feature map channels is doubled. Finally, a
network layer with a $1\times1$ convolutional kernel is used to reduce the number
of channels to 512. Then, a leaky rectified linear unit (Leaky ReLU) activation function
is added. This completes the construction of the channel-based attention mechanism
module. In the discriminator of the AMS-Cycle GAN model, a network layer with a $4\times4$
convolutional kernel is selected for output generation. Finally, a single-channel
prediction map of size $1\times30\times30\times1$ is generated. The internal structure
of the discriminator is shown below.
Fig. 5. Discriminator internal structure information.
To make the model training more stable and meet the 1-lipschitz condition, spectral
normalization (SN) is introduced into multiple convolution layers of the discriminator
during the training process. AMS-Cycle-GAN model is shown in Eq. (7).
In formula (7), $L_{Generator} $ represents the total loss function of the generator, $L_{lsgan\_
Generato} $ represents the loss function of the generated adversarial network, $\lambda
_{1} L_{identity} (G,F)$ represents the weight of identity loss, $\lambda _{2} L_{cycle}
(G,F,X,Y)$ represents the weight of cyclic consistency loss, $\lambda _{3} L_{MS-SSIM}
(G,F)$ represents the weight of MS-SSIM loss, and (G) and (F) represent the generator.
Then there is Eq. (8).
In formula (8), $L_{discriminators} $ represents the loss function of the discriminator, and ${\mathop{\min
}\limits_{D_{Y} }} L_{lsgan} (G,D_{y} ,X,Y)+\min L_{lsgan} (F,D_{x} ,Y,X)$ represents
the game process between the generator and the discriminator. On this basis, there
can be Eq. (9).
In Eq. (9), $(G,F)$ represents generator, $(D_{Y} ,D_{X} )$ represents discriminator, $L{}_{lsgan}
$ represents generation confrontation process, $L_{Generator} $ represents generator
model loss function and $L_{discriminators} $ represents discriminator model loss
function. $L(G,F,D)$ represents a loss function that represents the minimization of
all objectives. $MS\text{-}SSIM(x,y)$ is shown in Eq. (10).
In Eq. (10), $M$ is set as 5, $MS\text{-}SSIM(x,y)$ represents the multi-scale structural similarity
index, $l_{m} (x,y)$ represents the brightness comparison function, $(x,y)$ represents
the contrast comparison function, $s_{j} (x,y)$ represents the structural comparison
function, $\alpha $, $\beta $, $\gamma $ represents the weights of each comparison
function, and other parameters are set as Eq. (11).
$\alpha $ parameter setting is shown in Eq. (12).
Thus, the loss function $MS-SSIM$ can be obtained as Eq. (13).
$L_{identity} (G,F)$ in Eq. (7) is shown in Eq. (14).
In formula (14), $E_{x\sim P_{data(x)} } [\| F(x)-x\| _{1} ]$ represents the loss of transitioning
from domain $X$ to domain $Y$ and then returning $X$, and $E_{y\sim P_{data(y)} }
[\| G(y)-y\| _{1} ]$ represents the loss of transitioning from domain $Y$ to domain
$X$ and then returning $Y$. $L_{cycle} (G,F,X,Y)$is show in Eq. (15).
AMS-Cycle-GAN model first randomly extracts an image from the natural image domain
and inputs it into the generator. At the same time, position normalization is applied
to extract the mean and standard deviation from the image $x$, and these information
are transmitted to the MS network layer. Similarly, an image is randomly extracted
from the art image domain and input into the generator to obtain the output. This
process has also been processed by PONO-MS module. Then the generated image and the
original art image are input into the discriminator. After passing through the attention
mechanism module, the importance weights of different channels are obtained. Then
these weights are applied to the corresponding channels of the intermediate feature
graph. By minimizing the lsgan loss of the discriminator, the discriminator is optimized
and the parameters are updated. Similarly, the generated image and the original natural
image are input into the discriminator to minimize the lsgan loss of the discriminator,
optimize the discriminator, and update the parameters. Next, the reconstructed image
is obtained and the cycle loss is calculated. Then it is necessary to traverse and
update the iterations in the training process. Finally, after completing the training
process, the model establishes an image conversion model between the natural image
domain and the artistic image domain, which can generate images with artistic style.
During the training process, if the number of iterations exceeds a preset threshold,
the learning rate will gradually decline.
Import tensorflow as tf
From tensorflow.keras import layers
Sef PONO_MS(x):
Mean, var $=$ tf.nn.moments(x, axes$=[1$, $2]$, keepdims$=$True)
Return (x $-$ mean) $/$ tf.sqrt(var $+$ 1e$-8$)
Def build_generator(input_shape):
Inputs $=$ tf.keras.Input(shape=input_shape)
X $=$ layers.Conv2D(64,(7,7),padding$=$`same')(inputs)
X $=$ layers.ReLU()(PONO_MS(x))
X $=$ layers.Conv2D(256,(3,3), padding$=$`same')(x)
Outputs $=$ layers.Conv2D(3,(7,7),padding$=$`same', Activation$=$`tanh')(layers.ReLU()(PONO_MS(x)))
Return tf.keras.Model(inputs, outputs)
Def build_discriminator(input_shape):
Inputs $=$ tf.keras.Input(shape$=$input_shape)
X $=$ layers.LeakyReLU(alpha$=$0.2)(layers.Conv2D(64,(4,4), strides$=$2, padding$=$`same')(inputs))
X $=$ layers.LeakyReLU(alpha$=$0.2)(layers.Conv2D(128,(4,4), strides$=$2, padding$=$`same')(x))
Outputs $=$ layers.Conv2D(1, (4, 4), padding$=$`same', activation$=$`sigmoid')(x)
Return tf.keras.Model(inputs, outputs)