ARTWORK STYLE TRANSFER MODEL USING DEEP LEARNING APPROACH

Art in general and ﬁne arts, in particular, play a signiﬁ-cant role in human life, entertaining and dispelling stress and motivating their creativeness in speciﬁc ways. Many well-known artists have left a rich treasure of paintings for humanity, preserving their exquisite talent and cre-ativity through unique artistic styles. In recent years, a technique called ’style transfer’ allows computers to apply famous artistic styles into the style of a picture or photograph while retaining the shape of the image, cre-ating superior visual experiences. The basic model of that process, named ’Neural Style Transfer,’ has been introduced promisingly by Leon A. Gatys; however, it contains several limitations on output quality and implementation time, making it challenging to apply in practice. Based on that basic model, an image transform network was proposed in this paper to generate higher-quality artwork and higher abilities to perform on a larger image amount. The proposed model signiﬁcantly shortened the execution time and can be implemented in a real-time application, providing promising results and performance. The outcomes are auspicious and can be used as a ref-erenced model in color grading or semantic image segmentation, and future research focuses on improving its applications.


Introduction
Nowadays, tremendous efforts have been put into accelerating different techniques to assist computers in performing human-like tasks such as classification and communication due to the fast-growing artificial intelligence advancements. Most of the conventional deep learning techniques contribute to improving productivity and quality of life, and artworks are one of the promising areas. Preserving artistic features thereby becomes noteworthy for both human duties and technology, as they represent the heritage and culture through time [Doulamis and Varvarigou, 2012]. The first image transforming algorithm was founded as an innovation to change a picture's style while preserving its shape, which has drawn substantial attention [Gatys et al., 2016b]. The term 'neural style transfer' was introduced to indicate a machine learning technique for converting an image's style from an initial to another by blending a content image into a style reference image, such as artworks from reputed artists. The fundamental purpose is to generate a content-look-alike image but is artificially drawn in the style of the reference image, as illustrated in Figure 1.
Although several researchers have been involved in this trend, there are still many spaces for developing and solving the backlog of limitations [Liu et al., 2017;Chen et al., 2018;Gatys et al., 2016a]. In the scope of this study, the authors have pointed out those remaining limitations in the image transformation models and propose methods to resolve them. Moreover, a web application was implemented to apply the built model, attempted to create a number of exquisite art paintings. The approach of this study is divided into 3 steps: building a model to generate uncomplicated images and conveying observations related to the drawbacks of the model; proposing an image conversion network to enhance the pattern conversion model based on the initial model; and applying model transformation and building an artistic image generator website. The proposed model significantly shortened the execution time and can be applied as a referenced model in real-time applications.

Literature Review
In recent years, many research authors have been solving style image transformation problems by training convolutional neural networks with loss functions per pixel [Selim et al., 2016;Zhao et al., 2020]. A typical example can be seen from the 'Neural Style Transfer' model proposed by Leon A. Gatys [Gatys et al., 2016b], which has used the content image and reference image for training without any large training dataset directly. The model has produced new images of high perceptual quality that blend the appearance of famous artworks into the content of an arbitrary photograph, with insights into deep image representations. Although this model contains unavoidable shortcomings, it is a prerequisite for later studies on image generation methods [Chen et al., 2018;Jing et al., 2019]. AdaIN model, for example, was developed to match the mean-variance of a content image with a reference image for remodeling features [Huang and Belongie, 2017]. A patch match procedure was introduced in the swap model [Chen and Schmidt, 2016] for alternating content features with the nearest match style characteristic. Li et al. [Li et al., 2017] proposed a multilevel stylization for transforming whiten color recursively, improving the output quality and preserve the content structure. The style transfer problem can be widely used in other neural networks image processing applications, such as text classification, semantic parsing, and information extraction. The deep fake image approach [?], for example, might be considered a solution to comparable facial feature transfer challenges. Moreover, it can be seen from these studies that there is a trade-off between content and style losses. Therefore, many researchers have considered image super-resolution and image segmentation methods.
Image super-resolution is a classic problem related to image processing [Bazhanov et al., 2018], and there have been a number of researchers involved in this area [Cheng et al., 2019;Ma et al., 2020]. Yang et al. [Yang et al., 2014] provided a general review of previously standard techniques when applying convolutional neural networks to their research. Some other output quality improvement methods were suggested, such as a model of Chao Dong et al. [Dong et al., 2015], taking a lowresolution content image through a convolutional neural network [Andreev and Maksimenko, 2019] and producing an image with high resolution. While their neural network architecture is uncomplicated and less weight, it exhibits a high quality of image recovery and performs in a short time to apply in practical applications. The above studies are the driving force behind this research to produce the same high-quality transition images as conventional image hyper-resolution. Image Segmentation methods divide an image into many different image areas [Long et al., 2015]. Image segmentation also has the same objective as the object detection problem: detecting the image area containing the object and labeling them appropriately [Noh et al., 2015]. Although the issue of image segmentation requires a higher level of detail, in return, the algorithm gives an understanding of the image's content at a deeper level. Simultaneously, it reveals the position of the object in the image, the shape of the object, and which object each pixel belongs to [Zheng et al., 2015]. This method generates labels for image regions for the input image to run through a fully convolutional neural network, trained with the loss function per pixel.
The objective of this research is to develop a model that can provide smooth transitions and results as sharp as the study of Chao Dong et al. [Dong et al., 2015]. It is also proposed an image transformation network inspired by two studies of Long J. [Long et al., 2015] and Noh H. [Noh et al., 2015], improving the quality of the output image and shorten the transition time. The results are promising for transferring artwork style to other images, contributing to the applications in many natural science fields, including materials science and physics.

Standard Image Style Conversion Model
In the standard image style conversion model proposed by Leon A. Gatys et al. [Gatys et al., 2016b], its input data consists of a content image and a reference image. The resulting (target) image is initialized as white noise before applying convolutional neural networks (CNN) to transform the white noise image closer to the content and visual style.
As illustrated in Figure 2, this standard model comprises 2 steps: 1. Training by a convolutional neural network for extracting features. 2. Calculating the loss functions of the content image and the reference image to update the target image closer to the content and reference image. The result is obtained when the total loss is minimum.

Convection Neural Network Features Extraction
The standard model applies the VGG-16 pre-trained neural network to extract the content image and reference image features. Despite using the same VGG16 network, content images and reference images have different extraction features. When passing an image through the convolutional neural network, the higherorder layers in the convolutional neural network capture the 'high value' content about the objects and their arrangement from the input image without correct pixel values. In contrast, the lower layers reproduce the exact pixel values of the original image. Therefore, features extracted from higher-order layers are considered image content features, while features at lower layers are considered image style features, as shown in Figure 3. Based on the study of Leon A. Gatys et al. [Gatys et al., 2016b], the class of the VGG16 network was used to extract the characteristic of: − Content image: conv5 2; − Reference image: conv1 1, conv2 1, conv3 1, conv4 1, conv5 1.

Building Loss Function
This section calculates the loss functions of the content image and the reference image to update the target image closer to the content and reference image, as presented below.

Loss Function for Recreating Content
To extract content features, the content image was passed through filters of a convolutional neural network. For each class of the CNN network with N l filters, N l feature maps sized M l (M l = width × height of the image) were generated. Therefore, each class l stores the matrix F l ∈ R N l ×M l , in which F l ij is the trigger function of the i th filter at position j in class l. Let − → p and − → x is the input and output image, and P l and F l represent their respective features in the class. The loss function is based on content characteristics of the input and output image is: (1)

Loss Function For Recreating Style
Calculating the style loss function is relatively more complicated, whereas it follows the same principle. Instead of comparing the intermediate outputs of the content image and reference image, a Gram matrice was handled to compare the two outputs.
Calculating the Gram matrix: After the target image and reference image passed through a convolutional neural network, a n C feature map was obtained with n H ×n W dimensions. To calculate the similarity of the 2 images, n C feature maps were taken before computing the scalar product of two feature vectors (each feature map is flattened to a feature vector) on each corresponding pair of maps. As a result, a Gram matrix with n C ×n C dimensions was created, as shown in Figure 4. Given a set n C of n H ×n W vectors, the Gram matrix G is the matrix of all possible inner products of n C : The Gram matrix determines the vectors F i up to isometry and indicates the correlation between filters. The style features are given by the Gram matrix G l ∈ R N l xN l , in which G l ij is the inner product between the feature maps, are vectors i and j in class l: Let − → a and − → x is the input image and target output image, and A l and G l are their corresponding styles in class l. Then, the contribution of class l to the total loss is: And the sum of the style loss function is:

Synthesizing the Loss Function
To convert the style of a work of art − → a to an image − → p , it is needed to synthesize a new image while having a representation of its content of − → p and perform the style of − → a . Therefore, this study uses the content and style loss aggregation function.
L total = αL content + βL style (6) in which α and β are the weights for content reproduction and style reproduction, respectively. The selection of α and β also affects the quality of the final output image. In general, the standard model was able to apply the style of the given image to the content image. However, it can be seen from Figure 5 that the quality of the output image remains relatively low, although it has been through a long period of training. The color arrays were changed disorderly in the resulting image, containing noise, and its resolution was severely reduced. Not only that, on average, a transfer takes 10 to 15 minutes for 100 loops, and it increases correspondingly with the number of loops and the resolution of the input image. This is comparatively a long time to attain a possible pattern conversion when applying to the real practices.

Results
In the upcoming section, the study proposes a new type conversion model based on the standard type conversion model to improve image generation speed and image quality.

Proposed Image Style Conversion Model
From the standard image style conversion model outlined above, this section proposes a new image style conversion model. The concept, architecture, and experiment results of the proposed model are presented below.

The Concept
The standard conversion model initializes the target image as white noise, then attempts to match it in a typical way of the content image and reference image, leading to a time-consuming process and low-quality output. Accordingly, the concept of this model is to propose an image conversion network that possesses the ability to learn features similar to a content image. When having an image that needs to be transformed, its features can be obtained by the model faster and more accurately, thereby reducing the conversion time to the target image, as indicated in Figure 6.

Content Image
Target Image Reference Image 4.2 Model Architecture 4.2.1 Image Transformation Network The image transformation network is a convolutional neural network with the residual mass parameterized by the weights of W; it converts the input image x to the output imageŷ through mappingŷ = f W (x). Each loss function l i (ŷ,y i ) measures the differences between output imagê y and input image y i . The image transformation network was trained using stochastic gradient descent to optimize weights of the loss function: Using the above-mentioned loss function, the following details of the convolutional neural network have been used in the proposed image transformation network, as presented in Table 1.
There are a few points of interest in this image transformation network model; the stridden convolutions downsampling and upsampling blocks in the network were used instead of pooling layers. The network body consists of five residual blocks by using the architecture in the research of Kaiming H. et al. [He et al., 2016]. All convolutional classes (no residual block) were followed by the normalization class batch and the ReLU non-linear classes, as presented in Figure 7.

The Function of the Image Conversion Network Layers
Downsampling with strided convolutions: The first convolution in the proposed network has stride = 1, but the following two layers have stride = 2. This means that every time the filter was relocated, it shifts 2 pixels instead of 1 and the output image has the size of n/2 × n/2. With a convolution layer stride = 2, it reduced half of the input size, called downsampling. Since the input image can be sampled down; each pixel of the input results from a calculation involving the larger number of pixels from the original input image. This approach allows the kernel filters to access a much larger portion of the original input image without increasing the kernel size. Applying a certain conversion pattern to the entire image creates more information for each filter related to the original input image, making the conversion network performs better.
Upsampling with strided convolutions: Upsampling is the contrary of downsampling, which is used with stride = 2. After applying 2 layers of downsampling, the image size is reduced by 1/4 compared to the original. The desired output of the converter network is a typed image with the same resolution as the original content image. To achieve this, 2 convolution layers were applied with stride = 1/2. These layers, which increase the output image size, are called upsampling.
Residual convergence class [He et al., 2016]: Between the downsampling and upsampling classes, there are 5 residual blocks. These are first-order convolution layers, but the difference between these layers and the conventional convolution layers is that the network's input directly contributes to the output.

Constructing Loss Function
Like the standard model, the proposed model also utilizes the VGG16 pre-trained neural network [Van Hieu and Hien, 2020a;Van Hieu and Hien, 2020b] to measure the loss in the input images' content or art style. However, the contentfeature reconstruction loss function was computed at class relu2 2, and the style-feature reconstruction loss function was calculated in classes relu 2, relu 2 2, relu 3 3, and relu 4 3. Feature Reconstruction Loss: Instead of forcing pixels of the output imageŷ = f W (x) exactly matches the pixels of the target image, they have the same feature representations as computed by the loss network φ. Let φ j (x) are the features of the j layer of the network φ when processing image x; if j is a convolution class, φ j (x) will be a feature map of the shape Cj × Hj × W j. The loss on feature reproduction is the Euclidean distance between the object's features: Style Reconstruction Loss: The idea of calculating the style loss function is the comparison of the Gram matrix of the visual outputs as in the standard model.
Then, the style loss function is the Frobenius standard square of the difference between the Gram matrix of the output image and the input image: To perform style reproduction from a set of J classes instead of a class j, let define l φ ,J style (ŷ , y) is the sum of the losses per class j ∈ J.  It is undoubtedly noticed from Figure 8 and 9 that the improved conversion network of the proposed model is significantly better than the standard model, in general. The improved conversion is many times faster than the basic transitions model, and small images can even be executed in real-time. The essence of artwork generated from the proposed model is also more artistic and sharper than the first model. Moreover, proposed models with a trained style through one training can apply its style for any content image in later usage.

Dataset Preparation and Configuration
In this study, the dataset used for implementation includes the Flirck8k dataset (8100 images, 1Gb) and the Microsoft Coco dataset (80000 images, 13Gb). The content of these image datasets is mostly scenes of diversified environments such as mountains, rivers, and trees, as illustrated in Figure 10. Given the diversity of these datasets, it was expected the model to be able to style any content image even if it has never been trained before. The Flirck8k dataset was trained on Google Colab GPU Tesla K80, while the Microsoft Coco dataset was trained on a personal computer configured with Intel Core i7 -4700MQ. The standard conversion model was configured with the loss function shown in Figure 11 and the following characteristics.
And the proposed conversion model was configured with the loss function shown in Figure 12 and the following characteristics.

Results
Figures 13, 14 and Table 2 showcase the results and comparations of the models through different criteria, from which the advantages and disadvantages of each model will be concluded. Traditional metrics were used to evaluate output image quality, Peak signal to noise ratio (PSNR), and structural index similarity (SSIM) [Wang et al., 2004], both of which represent the human view of image quality [Hanhart et al., 2013;Huynh-Thu and Ghanbari, 2008;Sheikh et al., 2006]. The goal of these analyses was not to achieve the best PSNR or SSIM results but simply to show differences in output image quality, characteristic loss from the original image between different models trained.

Discussion
It is noticed from Figure 14 that the PSNR index of the image from the standard and proposed models is relatively close; sometimes, the standard model is higher than the proposed model (the higher the PSNR, the less interference effect). However, this index does not precisely determine the quality of the output image. Specifically for images from the standard model, the noise was calculated on the whole picture while it is measured in the picture's unimportant parts in the proposed model; therefore, the image quality of the proposed model was still better than from the standard model. Furthermore, the proposed model's image always retains the main image characteristics compared to the original image, performing much better than the standard model. This can be indicated that the SSIM of the proposed model was always higher than the standard one.  Figure 15 proves that with the same content and reference image, the proposed model produces a completely better image quality than the standard model, which proves the effectiveness of using additional conversion networks. Whereas Figure 16 shows that with the proposed model, the model trained on a larger data set gives superior results compared to the model on the small dataset (the Microsoft Coco dataset is 10 times higher than the Flick8k dataset). This supremacy once again confirms the efficiency of using a larger image conversion network of the proposed model compared to the smaller training dataset.

Artwork Generator Website
To create exquisite artistic paintings and demonstrate the effectiveness of the prosed image transformation model that can be applied in practice, an artistic photo generator website was built with the following main functions.   6 Conclusion and Future Work In this paper, a new image conversion model was proposed and compared with a previous conversion archi-tecture, using a convolutional neural network for transformation, pre-trained VGG16 model for measuring the loss in the input images' content. Despite substantial efforts that have been put in the past ( [Huang and Belongie, 2017;Chen and Schmidt, 2016;Bourached et al., 2021]), this research introduced a high-efficiency model for image style transfer in large-scale and time-efficient focus. The proposed model generates a remarkable outcome on output image quality, performing much better than the standard model while also retaining the main image characteristics compared to the original model in both PSNR and SSIM index. Moreover, it also reduced more than 90% the execution time, which is significant for executing in real-time practice. And these study outcomes open up new avenues for future research and can play as a crucial source for future image style transfer systems. The new image conversion model can not only be applied in arts but also in various areas such as geotechnics, civil engineering, materials science, and physics.
Future work may gear towards improving applications of the model, such as the ability to convert more artwork styles in a single training session or the flexibility to adjust the degree of transformation. This is also a promising and potential image conversion model that can be applied in color grading or semantic image segmentation.