• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Dept. of Computer Engineering, Myongji University.)



Dynamic Rating System(DRS), Distributed Temperature Sensor(DTS), Conductor Temperature Monitoring System(CTS), Correction Factor

1. Introduction

ML is becoming increasingly popular and used in various applications such as natural language processing (1), self-driving cars, computer vision, healthcare, user behavior analytics (2), among many others. ML algorithms solve various data-specific problems such as classification, regression, and clustering. Each algorithm deals with a number of hyperparameters that determine its structure and help the algorithm learn during training. These hyperparameters directly impact the learning process and their selection has direct control over the performance of the models. Therefore, researchers focus on designing a ML algorithm with a proper set of hyperparameters (3).

Hyperparameter Optimization (HPO) is a process of choosing a suitable set of hyperparameters to tune the ML models. It also involves the selection of optimal values to enhance the accuracy and effectiveness of the models. Optimizing these hyperparameters helps prevent overfitting, which minimizes the cost and speeds up the computation. The selection of hyperparameters relies on the experience of individuals. However, the credibility of manual optimization is low and costly because it requires a lot of logical reasoning for complex problems (4). To minimize the cost and the computing time, it is essential to develop automatic HPO approaches that automate the tuning process.

Automatic HPO approaches upscale the performance of the models, lead to a lightweight model with fewer parameters, and assist in selecting appropriate hyperparameters (5), (6). The most common HPO approaches are grid search (GS) and random search (RS) which attempt to find the best hyperparameters to minimize the loss function (7), (8). Bayesian Optimization (BO) (9) finds the configuration of hyperparameters from the previous search space and avoids evaluations that do not directly impact the model's performance. Unlike the GS and RS, BO finds the optimal hyperparameters within the limited evaluations. Tree-structured Parzen Estimator (TPE) is an iterative algorithm to optimize conditional hyperparameters based on historical measurements (7). It finds the optimal configuration of hyperparameters that achieves the best performance. Recently, automatic HPO frameworks have been developed which consist of more than one optimization algorithm. For example, Optuna has GS, RS, TPE, Hyperband, and Pruning optimization algorithms. Automatic HPO frameworks provide a user-friendly interface for the implementation of optimization methods and help increase the models' efficiency and accuracy. Thus, they solve large-scale problems efficiently.

This paper evaluates the comparative performance of the latest HPO frameworks such as BO, Optuna, HyperOpt, and Keras tuner (9)(12). In order to find an optimal combination of hyperparameters for each framework and improve the performance of various models, two different sets of experiments were carried out. First, different ML classifiers were optimized using the HPO frameworks on publicly available datasets. The selected classifiers are Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM) with their own set of hyperparameters to tune. Classifiers were trained and optimized using HPO frameworks on dry beans, raisin, and nomao datasets in order to obtain the best combination of hyperparameters. Secondly, a CNN architecture was built and optimized using HPO frameworks on the CIFAR-10 dataset by optimizing various hyperparameters such as convolutional layers, fully connected layers, the number of nodes, batch size, learning rate, etc. The accuracy, F1 score, and computing time were considered as the performance metrics. The obtained results show that the ML models and CNN optimized with HPO frameworks led to improved performance. For the ML model, Optuna and HyperOpt performed well and found the best combinations efficiently. Both frameworks used the TPE optimization algorithms and achieved an accuracy of 93.97% and 94.12% respectively on the nomao dataset. HyperOpt was a good choice when the accurate prediction matters the most. For a smaller task where the cost matters a lot, BO was quite efficient. On the other hand, Optuna was effective and worked well for large spaces which required more computing time. Considering the trade-off between the accuracy vs. the computing time, Optuna obtained the optimal set of hyperparameters within short computing time as compared with others. For the CNN model, almost all the HPO frameworks performed well. HyperOpt-TPE was the best in improving the training accuracy by 34% of CNN model, higher than all the other HPO frameworks.

The rest of the paper is organized as follows: Section 2 introduces the overview of HPO and its techniques. The HPO frameworks, such as BO, Optuna, HyperOpt, and Keras Tuner are explained in Section 3. Section 4 presents the comparative performance evaluations with experimental results and analyses. Section 5 reviews the previous research on performance studies using state-of-the-art HPO frameworks. Section 6 concludes the paper.

2. Hyperparameter Optimization

HPO is an important process to select the best combinations of hyperparameters that result in the best performance for ML models. Several automatic HPO techniques have been developed to tune the hyperparameters for designing efficient ML models. We first describe the popular hyperparameters and then HPO techniques.

2.1 Hyperparameters

Hyperparameters are the variables of model to determine its structure and behavior. Recently, hyperparameters have received remarkable attention to deal with the computational complexity of the models. Each ML algorithm has its own specific set of hyperparameters that need to be tuned. Studies such as [34–36] explain the hyperparameters in detail. There are two major categories of hyperparameters: model and optimizer hyperparameters. Model hyperparameters designed the structure of model while the training optimization algorithms are optimizer hyperparameters.

The activation function is a kind of model-specific hyperparameter that makes a strong perception of critical and non-linear complicated functions to convert the input signals. The activation functions such as Sigmoid, Softmax, Tanh, and Rectified Linear Units (ReLU) are commonly used (16). ReLU is suggested as a default activation function that rectifies the vanishing gradient problem and converges six times faster than the Tanh function. The learning rate (LR) is a well-known hyperparameter that quantifies the rate and speed of the networks during learning. LR is used to adjust the weights of the hidden layers where the overall structure of networks that extract the complex features directly are represented. Optimal LR values need to be selected in order not to miss the local minimum. Recently, the LR annealing approach has been developed to find optimal value during training (17), (18). However, in most cases, the users manually set LR values. Data augmentation is a data generation technique by creating fake copies of the training data to perform specific functions such as transformation, rotation, cropping, etc. It helps the NNs during the learning to improve their performance.

Optimization algorithms play a prominent role in the tuning of neural networks (NNs). They make the networks learnable to perform better. In DL, Gradient Descent (GD), Root Mean Square Propagation (RMSprop), and Adaptive momentum estimation (Adam) (19)(21) are widely used optimizers. Gradient Descent minimizes the cost function by updating the models’ parameters to reach the standard bias values in each step. It updates the weights and bias iteratively to get the local minima by adjusting the gradient. RMSprop optimizes the gradient by balancing the momentum (step size) and decreasing the step size for large gradients (20). Another state-of-the-art optimizer is Adam which combines the two optimizers, Adaptive Gradient Descent (AdaGrad) (22) and RMSprop. It computes the adaptive LR for each parameter and is used as the default optimization algorithm for the training.

2.2 HPO Techniques

In this section, we overview the HPO techniques such as GS, RS, BO, Genetic Algorithm (GA), and TPE [8, 9, 31-33].

2.2.1 Grid Search

Grid search (GS) is an optimization technique to search the configuration space by checking all the possible combinations of hyperparameters (26). A user selects the search space and divides this space into a grid in GS. Each hyperparameter in the grid has the same probability of affecting the optimization process. The selection of hyperparameters requires users' prior knowledge. GS can find optimal combinations with limited resources and achieve accurate predictions for different tasks (23). GS can perform parallelization and the results of one trial would not affect the others.

The complex dimensional spaces of GS require a lot of training time (see table 1). Also, GS is not sensitive to the hyperparameters scaling, thus it may affect the model performance. Furthermore, the method is only suitable for tuning a model, not for selection.

2.2.2 Random Search

Random Search (RS) was proposed by (8) to address the limitations of GS. RS selects the optimal values randomly. It continues searching until the desired fitness function is achieved. RS is suitable for the lower budget and investigates a larger search space than the GS. However, it requires more computational resources (8). RS allows stopping criteria for the experiments; it stops when the required output (objective function) is achieved. Overall, RS performs better than the GS and can perform in parallel also.

RS is faster than GS, however, still time-consuming when training complex models. The method randomly selects the combinations, which may not find the optimal combinations due to its lack of flexibility. It also has limitations in hyperparameter scaling and may not find a global minimum, as shown in table 1. Additionally, this method is less efficient in considering the relationship between the hyperparameters within the search space (27).

2.2.3 Bayesian Optimization

BO is a robust iterative algorithm in ML (9). Unlike GS and RS, BO is more efficient because it utilizes past results to regulate future evaluations. This allows BO to find the global minimum with fewer iterations. It uses the Bayesian statistical method to search the best combinations of hyperparameters. Two primary elements of BO are surrogate models and an acquisition function (28). The surrogate models approximate the objective function with a probability distribution based on the samples. At the same time, the acquisition function determines the distribution and balances the tradeoff between explorations and exploitation. Surrogate models allow the BO method to work efficiently by minimizing the number of expensive function evaluations to achieve the optimum. The popular surrogate models include BO Gaussian Process (BO-GP) (29), BO Random Forest (BO-RF) (30), and BO Tree Parzen Estimator (BO-TPE) (31).

The BO method can be computationally expensive where the objective function must be evaluated frequently. The method took a long time to converge where the objective function has local optima. Additionally, BO is conceptually complex and challenging to parallelize (see table 1).

2.2.4 Genetic Algorithm

GA is a search optimization algorithm to solve multi-objective optimization problems (24). GA works as a biological evolution process and selects those species capable of adopting environmental changes. Later, these species reproduced in the upcoming generations and inherited characteristics. The generations with better performance would survive longer, and worse ones would disappear gradually. The population in each generation represents the search space and the individual is a character. Thus, in each iteration, the individuals are the hyperparameters and nature is the real input value. The selection of individuals is based on the optimal value of fitness functions (32).

In HPO, GA is easy to implement and robust to optimize over a large population. However, GA has limitations in configuring the additional hyperparameters with fitness functions (see table 1). Moreover, the algorithm is difficult to perform in parallel due to its sequential execution nature.

Many HPO algorithms minimize the objective function for independent data, such as TPE and Hyperband. TPE (7) is a BO algorithm used to optimize the configuration of hyperparameter quantization to achieve better performance. The applicability of TPE in ML with its key advantages and disadvantages is shown in table 1. Hyperband is another optimization algorithm to solve the problem of pure exploration and infinite bandits (33).

Table 1. Comparison of HPO techniques and the complexity (n: values of hyperparameters; k: the total number of hyperparameters used)

HPO Approaches

Advantages

Disadvantages

Complexity

Applicability for DNN

Grid Search

Simplicity and baseline method

Used for parallelism

Time-consuming

Computational cost

Poor scaling

O(nk )

A few parameters need to be tuned

Random Search

Used for parallelism

Early stopping methods

Computational efficiency

Low efficiency

Limited search space

Lack of flexibility

Hard to find local optima

O(n)

Convenient for early stage

Random combinations

Bayesian Optimization

Fast and reliable

Efficiency and Flexibility

The foundation of other algorithms

Difficult to parallelize

Computational cost

Convergence

Robustness

O(nlogn)

Default algorithm

Variants of BO are applicable

Genetic Algorithm

Fast convergence speed

Efficient and flexible

No need for optimal initialization of values

Lack of parallelism

Long time to get the best model

Lack of interpretability

O(n2)

Mutation testing

Filtering and signal processing

Learning Fuzzy rule

TPE

Efficient search method

Better with conditional dependencies

Flexibility

Poor performance for parallelization

Computational cost

Convergence and robustness

O(nlogn)

For quantization configuration

3. Hyperparameter Optimization Frameworks

Hyperparameter optimization frameworks are the automatic tools to tune the hyperparameters of ML models. Each tool typically includes a set of optimization techniques and a user-friendly interface for defining search space, evaluating the objective functions, and monitoring the performance of models. These frameworks are in high demand to tackle complex machine-learning problems. In this section, we briefly overview the HPO frameworks that we have used: Bayesian optimization (9), Optuna (10), HyperOpt (11), and Keras tuner (12). Note that BO is omitted, as explained previously in detail.

3.1 Optuna

Optuna (10) is a newly developed open-source tool by a Japanese AI company for ML and DL applications in 2019. Optuna framework is built on the python language and guarantees the efficient optimization of complex problems (10). The existing HPO tools have many limitations, such as constructing a search space for each model individually, a lack of pruning approach, and handling the previous techniques within allocated resources (6). Optuna addressed these problems and provided a better solution by building the framework. It is a next-generation strategy that allows users to create a search space dynamically. Optuna delivers a lot of user-customized sampling, searching, and pruning algorithms for an efficient implementation. The versatile architecture of Optuna itself is easy to set up and can be deployed for different types of problems, ranging from scalable to lightweight experiments. The overall framework of Optuna for an ML model is shown in fig 1, where it automatically finds optimal combinations of hyperparameters using specific HPO sampling algorithms (GS, RS, BO, Hyperband). Later, the ML model is evaluated with a validation strategy to produce the final results.

Optuna is designed to solve the problems and limitations of black-box optimization frameworks. The implementation of Optuna is based on study and trial. The study requires an objective function that decides the number of samples in each trial and returns the optimal values for the specific hyperparameters. In contrast, a trial is the execution of the objective function. The key features and available HPO algorithms of Optuna are highlighted in table 2.

Fig. 1. Overall framework of the Optuna HPO process

../../Resources/kiee/KIEE.2023.72.5.607/fig1.png

3.2 HyperOpt

HyperOpt (11) is a python-based HPO tool based on sequential model-based optimization. This tool provides a user-friendly interface to configure the variables from the search space of hyperparameters and evaluate the objective function. The configured variables have continuous or discrete values, conditional ones, and sensitivity (uniform, log scaling). HyperOpt helps find the best deals on selected variables. These selected variables define the optimal hyperparameter configuration search space that minimizes the objective function as shown in fig 2.

HyperOpt key factors are a search space, objective function, and optimization algorithm. A search space in HyperOpt is chosen with random variables or parameters whose distributions have

Fig. 2. Overall framework of the HyperOpt HPO process

../../Resources/kiee/KIEE.2023.72.5.607/fig2.png

higher combinations of prior probability. It includes the functions and operators in python language to combine the random combinations of parameters for the specific objective function. An objective function is defined with any conditional structure to map the sampling of different random parameter values to minimize the selective optimization algorithm score. HyperOpt supports the following optimization algorithms RS, TPE, and Adaptive TPE algorithms. HyperOpt search functions select the optimization algorithms, configure the best-performing optimal hyperparameters, and store the configuration results (see table 2).

Table 2. Comparison of HPO frameworks with key features and components along GitHub repositories information

HPO Frameworks

Key Features

Available HPO Algorithms

Essential Components for Optimization

GitHub Link

Bayesian Optimization

Efficient

Versatile optimized

Useful for high-cost functions

Bayesian Optimization

Defining and changing the bounds

Build surrogate model

Acquisition function

Update the model for search space

(34)

Optuna

Easy parallelization

Quick visualization

Versatile with platform-agnostic architecture

GS, RS

TPE

Hyperband

Pruning Algorithm

Objective function with trial

Creation of Optuna optimization

Obtained optimal search space

Visualization of results

(35)

HyperOpt

High speed & parallelization

Complex search spaces

Persisting and resuming the optimization process

Random Search

TPE

Adaptive TPE

Define objective function and search space

Minimize the objective over space

Database for score and configuration

(36)

Keras Tuner

Intuitive and efficient

Light-weight

Distributed optimization

Dashboard

Random Search

Hyperband

Bayesian Optimization

Hyperparameters selection choice

Selection of optimization algorithms

Perform tuning

(12)

3.3 Keras Tuner

In ML, there is no fixed way to select the optimal parameters such as the number of layers, and kernel size, and the optimizing parameter such as LR, decay, normalization, etc. to build the model. Keras is an open-source python API that led to the development of Keras Tuner, an HPO framework (12), to select the search space of hyperparameters and finds the optimal combinations of the values for training ML models. These hyperparameters play a vital role in generalizing the models to perform better.

Keras tuner is a library to tune the set of hyperparameters to obtain high performance for the imaging study (37). The idea of the Keras tuner is to define the range of values of hyperparameters and get the optimal combination that improves the validation evaluation of the model, as shown in fig 3. Moreover, it helps to build a lightweight and efficient ML model that selects the optimal search space configuration to perform tuning (see table 2). Keras tuner has various built-in HPO methods: RS, GS, BO, hyperband, and evolutionary algorithms. It allows the researchers to conduct experiments with different techniques. RS, GS, and BO methods are explained in detail in section 3. The Keras Tuner-based hyperband is an extended version of the RS with an early stopping function to optimize the speed (38). This framework is helpful in the tuning of the CNN model and allows the tuning of different numbers of the model's hyperparameters (convolutional layers, number of neurons and epochs, learning rate, etc.). Keras Tuner works as an HPO framework by optimizing the long-short-term memory (LSTM) network to predict the earthquake (39).

Fig. 3. Overall framework of Keras Tuner to perform HPO

../../Resources/kiee/KIEE.2023.72.5.607/fig3.png

4. Comparative Performance Evaluation of HPO Frameworks

In this section, we present experimental results for the comparative performance evaluations of four different HPO frameworks such as BO, Optuna, HyperOpt, and Keras Tuner described in Section 3. We conducted ML experiments using BO, Optuna, and HyperOpt frameworks to classify the datasets and make predictions. (Previous research suggests that Keras Tuner is primarily used for image classification rather than continuous classification tasks, thus we did not consider it for the ML tasks.) For the DL CNN model experiments on the image dataset, we used Keras Tuner along with the other HPO frameworks. We first explain the experimental setting that includes the datasets and system specifications used for the experiments. Experimental results are presented along with the analyses.

4.1 Experimental Settings

For the performance evaluation, we selected four publicly available datasets named dry beans (40), raisin (41), nomao (42), and CIFAR-10 (43) dataset. The selection of datasets is based on the real-world scenarios. The details of each dataset, including the number of samples, classes, and numerical and categorical features are summarized in table 3. CIFAR-10 is the only image dataset. It consists of 60K images with ten different classes with 10K images per each type. This dataset is extracted from the 80 million tiny images database where each RGB image is 32 X 32 pixels. The dataset was split into training and testing sets with 50K and 10K, respectively.

Table 3. Details of datasets for the experiments

Dataset

Classes

Samples

Attribute characteristics

Year Donated

Reference

dry beans

17

13611

Integer, Real

2020

(40)

raisin

8

900

Integer, Real

2021

(41)

nomao

120

34465

Real

2012

(42)

CIFAR-10

10

60000

3072

2009

(43)

We used Python programming language with all the libraries needed to conduct the experiments. The training and testing of ML models were performed on a computer system with an Intel® Core™ i7-8700 CPU with a clock speed of 3.20 GHz, 16 GB of DRAM, and an NVIDIA GeForce GTX 1060 GPU with 8GB GDRAM. We evaluated accuracy, F1 score, and computing time as the performance evaluation metrics for the models.

4.2 Machine Learning Classifiers and the Experimental Results

We implemented three machine learning classification models including RF, XGB, and SVM, and evaluated their performance on the classification datasets (dry beans, raisin, nomao).

Random Forest (RF)

Random forest is a learning method that can analyze the classification and regression problems. It is an extension of decision tree algorithms and creates multiple trees where each tree is trained on the subset of the data. The results for all the trees are merged to build the final prediction resulting in a more robust and accurate model. RF can efficiently overcome the problem of overfitting. It has multiple hyperparameters to tune as discussed in table 4.

Extreme Gradient Boosting (XGB)

XGB is a type of supervised ML algorithm to improve traditional gradient boosting using decision trees. It is a highly efficient classifier that won numerous data science challenges. It builds the sequence of decision trees iteratively. Each new tree in the sequence is used to correct the error of the previous tree that results in performance improvement. The key hyperparameter of XGB to be optimized are gamma, n_estimators, and max_depth.

Table 4. Configuration search space of hyperparameters with type, space, and range of values

ML Classifiers

Hyperparameters

Type

Space

Range

RF

n_estimators

integer

linear

[100, 1000]

max_depth

real

-

[2-20]

min_samples_split

integer

-

[0.1, 1.0]

min_samples_leaf

-

-

[0.1, 0.5]

max_features

-

-

[0.1, 1.0]

criterion

real

-

-

XGB

colsample_bytree

real

-

[0.6, 1.0]

gamma

integer

-

[0,1]

max_depth

real

-

[2-20]

min_child_weight

-

-

-

n_estimators

integer

linear

[50, 1000]

subsample

numeric

[0.5, 1]

SVM

C

integer

float

[0.1, 10]

degree

-

-

[2, 4]

kernel

discrete

linear

-

gamma

numeric

[True, False]

Table 5. Experimental results of ML classifiers using selected HPO frameworks

Datasets

HPO Frameworks

Accuracy (%)

F1 score (%)

Computing Time (min)

dry beans

BO

87.17

86.23

22

Optuna-TPE

86.23

87.45

25

HyperOpt-TPE

87.45

87.44

28

raisin

BO

87.44

87.33

20

Optuna-TPE

87.33

87.33

23

HyperOpt-TPE

87.33

90.12

18

nomao

BO

90.12

93.97

27

Optuna-TPE

93.97

94.12

24

HyperOpt-TPE

94.12

92.17

30

Support Vector Machine (SVM)

SVM is a ML method that can be commonly used for classification and regression analysis. It finds the best boundary to separate the classes in the data. New data points fall on the sides of the boundary and can be classified easily. The data points closer to the boundary are called support vectors. The SVM classifier works well for the smaller dataset and may struggle to handle high-dimensional data.

Each classifier has its own configuration space of hyperparameters that need to be tuned. The configuration space includes the tuned hyperparameters, type, space, and the range of values for each hyperparameter (see table 4). The complete process involves the following steps: data pre-processing, selection of search space with values, and selection of algorithm to perform the implementation. In data pre-processing, we handled the missing values by imputation (mean, forward, backward), one-hot encoding for the categorical features, use of label encoder to target features, and Min-Max scaling to all the numerical features in the dataset. The dataset was split into training and testing using the K-fold cross-validation technique. We used 5-fold cross-validation and 50 iterations to tune the hyperparameters in each fold. Hyperparameter tuning was performed with BO, Optuna, and HyperOpt HPO framework for machine learning tasks. Previous research suggests that Keras Tuner is primarily used for image classification rather than continuous classification tasks, thus we did not consider it. As mentioned earlier, our experiments used accuracy, F1 score, and the computing time as performance metrics.

Experiments were conducted on the benchmark datasets for the classification. The configuration search space of each classifier is based on the list of hyperparameters provided in table 4. In each experiment, models with high-accuracy results were returned and highlighted in table 5. HyperOpt-TPE performed the best and achieved the highest accuracy score in the case of dry beans and nomao with a similar score with others for the raisin dataset. On the other hand, the BO had the lowest score.

Comparing the performance of HyperOpt and Optuna in terms of accuracy, both frameworks performed well and achieved almost the similar results, while HyperOpt had an advantage with a little margin: HyperOpt achieved an accuracy of 94.12% and Optuna got 93.97%. Both frameworks used TPE to tune their hyperparameters. TPE requires upper and lower limits and a distribution that makes optimization easier. Compared with RS, GS, BO, and many other optimization algorithms, TPE can better utilize the resources by evaluating the hyperparameters efficiently, can handle non-linear relationships between selected hyperparameters and objective functions more effectively, and require lower computational costs than other algorithms for finding the best hyperparameters.

The raisin dataset has a smaller number of instances compared with other datasets. All frameworks had satisfactory accuracy scores. BO showed a higher accuracy score of 87.44% despite a longer run time than others. The use of the acquisition function to find the optimal combinations for smaller datasets led to better accuracy and optimization. The results from table 5highlight that BO achieved quite a reasonable accuracy of 87.17% and 87.44% for the dry beans and raisin datasets. Optuna and HyperOpt achieved the same results for the raisin dataset as shown in fig 4.

In the high dimensional nomao dataset, BO had a lower score while HyperOpt had a higher score with a slightly longer runtime. HyperOpt achieved an accuracy score of 94.12%. Each framework took a long time to optimize compared with other datasets due to its large size of samples. Optuna also achieved similar results as HyperOpt. Optuna generally performed well in terms of computing time due to an adaptive sampling feature that adjusts the search space dynamically and took less computation to find the optimal combinations. Overall, HyperOpt took more computing time compared with others due to the number of iterations for the classifiers during the experiments as shown in fig 5. It explored more search space and evaluated one set of hyperparameters at a time rather than simultaneously which resulted in a longer computing time.

HyperOpt and Optuna had good performance scores in all datasets, however, HyperOpt had a longer runtime. The performance

Fig. 4. Accuracy of all selected HPO frameworks in percentage (%)

../../Resources/kiee/KIEE.2023.72.5.607/fig4.png

Fig. 5. Computing time (min) of HPO frameworks during the training

../../Resources/kiee/KIEE.2023.72.5.607/fig5.png

scores of HyperOpt and Optuna in terms of accuracy were almost the same, however, Optuna had a shorter run time. From the experiments, we ranked Optuna as the best choice for HPO considering the trade-off between the accuracy, F1 score vs. the computing time. HyperOpt can also be a good choice in the case of accurate prediction because it prioritizes accuracy over speed. Furthermore, it has the potential for parallel computation for complex problems with large data. In the end, it is important to note that the performance of the HPO framework relies on many factors such as the selection of the HPO method, the size of the dataset, choosing the search space of hyperparameter with the values, and the computational resources.

4.3 Deep Leaning CNN Model and the Experimental Results

In this section, we built a CNN architecture for the CIFAR-10 benchmark image dataset. The traditional CNN architecture has various layers such as convolution, dense, pooling, and fully connected layers. The selection of such layers and hyperparameters within each layer as well as the number of neurons and padding were made using HPO frameworks. The schematic diagram of the CNN model is presented in fig 6. The convolutional block in the CNN architecture consists of a convolutional layer with the ReLU activation function and the MaxPooling layer. Each convolutional layer is comprised of a 5 X 5 convolution filter, zero padding, and a stride of 1. The pooling layer has a MaxPooling filter with a size of 2 X 2. Both input and output images have the same size. This convolutional block takes the input images, extracts the sharp features, and forwards them to the hidden layers. These are converted into a single feature vector and passed to the fully connected layers of the network. Finally, the resulting output is classified using the Softmax activation function to compute the classification score.

A good choice of hyperparameters (learning rate, momentum, convolution, dense layers, hidden units, batch size, etc.) can lead to higher performance. All the possible hyperparameters to build the CNN model are highlighted in table 6. Each optimization algorithm has its specific number of hyperparameters with a range of values. table 6provides the search space of hyperparameters with values and optional hyperparameters of CNN. The configured values were obtained in the process of using HPO frameworks. For example, using BO resulted in the following configured values: learning_rate 0.000544, conv_layers 2, dense_layers 1, activation relu, and num_nodes 512. The optimal combinations of hyperparameters were validated on the testing set to predict the performance. The performance of each framework was measured and evaluated using the accuracy. Additionally, the computing time during the training of the CNN with a HPO method was measured to evaluate the efficiency.

Fig. 6. CNN architecture for our experimental setup (the number of convolutional blocks and dense layers are selected from hyperparameters' search space configuration) 제목

../../Resources/kiee/KIEE.2023.72.5.607/fig6.png

Table 6. Configuration search space and optional hyperparameter to optimize the CNN model

DL Model

Hyperparameters

Type

Space

Range

Optional Hyperparameters

CNN

learning_rate

real

log

[1e-6, 1e-2]

kernel_size, strides,

momentum, padding, and dropout

dense_layers

integer

linear

[1, 3]

conv_layers

-

-

[1, 3]

num_nodes

-

-

[5, 512]

batch_size

-

-

[10, 250]

Experiments were conducted using the CIFAR-10 dataset. The performance results of the CNN with and without HPO frameworks are shown in table 7. First, the training of CNN model was performed with the default hyperparameters for each optimization algorithm. The default values were chosen based on the previous knowledge and trained for 30 epochs. Then, the CNN model was trained with assigned search space hyperparameters for each HPO framework. The performance results of the optimized CNN are analyzed using an accuracy metric and the computing time.

Table 7. Experimental results of CNN model using selected HPO algorithms on CIFAR-10 dataset (Acc.: Accuracy; M: million; %: percentage; h: hour; m: minute)

HPO Framework

No. of Training Parameters

Training Acc. With default hyperparameters

Training Acc. with HPO

Testing Acc. with HPO

Computing Time

BO

1.20M

45.87%

72.88%

71.42%

1h6m

Optuna-TPE

1.24M

48.65%

74.15%

72.68%

2h32m

HyperOpt-TPE

364K

42.28%

76.98%

71.62%

2h24m

Keras Tuner- Hyperband

53K

72.37%

90.76%

84.66%

18m

The performance comparison with the four different HPO frameworks is summarized in table 7. With the default values of hyperparameters, the Keras Tuner-Hyperband achieved the best performance with 72.37% training accuracy, while BO, Optuna-TPE, and HyperOpt-TPE achieved 45.87%, 48.65%, and 42.28% training accuracy respectively. With HPO, HyperOpt-TPE improved the training accuracy by 34%, which was higher than all the other HPO frameworks.

Using BO, we achieved 72.88% training accuracy and 71.42% testing accuracy while it took 1 hour and 6 minutes of computing time. BO explored the search space of optimal hyperparameters effectively using a probabilistic model. The efficient exploration, ability to handle both the categorical and continuous values, and adoption of the most promising regions of search space led BO to improve its performance within a short training time.

Keras Tuner improved the CNN performance with a valuable margin. The complexity of the Keras Tuner framework is low due to fewer training parameters and the smaller search space size. Keras Tuner-Hyperband achieved the highest testing accuracy of 84.66% which outperformed other frameworks. However, Keras Tuner showed lower improvements in the training accuracy of 18.39%. With fewer parameters to optimize, it took only 18 minutes to complete the 30 epochs. The multiple features such as the easy-to-use tool interface, the ability to perform parallelization and early stopping, and multiple optimizations runs of optimization processes led to better results for CNN.

For Optuna and HyperOpt, the selected optimization algorithm was TPE. The achieved training accuracies for the HyperOpt-TPE and Optuna-TPE were 74.15% and 76.98% respectively as shown in fig 7. The search space of both the HPO frameworks was larger compared with the Keras Tuner and BO which increased the number of parameters and the complexity, thus resulting in a higher computing time (Optuna took 2 hours and 32 minutes). The reason for the CNN performance improvement using Optuna was due to the regularization and the pruning strategy. By allowing regularization which prevent overfitting, Optuna improved the generalization of the CNN model. The pruning strategy in Optuna eliminated the unpromising trials during the CNN optimization which resulted in better accuracy and the computing time. HyperOpt-TPE detected the best combinations with configured values early and improved the training accuracy by 34%, which was higher than all the other HPO frameworks. It allowed efficient exploration of the larger space to identify the combinations of hyperparameters for CNN model to improve the overall performance.

Fig. 7. Training Accuracy comparison of all selected HPO Frameworks for CNN model

../../Resources/kiee/KIEE.2023.72.5.607/fig7.png

To summarize, experimental results show that CNN with HPO frameworks achieved significant improvements in training accuracy and detected optimal combinations of hyperparameters. The training accuracies with and without HPO shown in table 7show that all frameworks produced significant improvements. Keras Tuner improved by 18.39%, while BO, Optuna, and HyperOpt improved by 27.01%, 25.5%, and 34%, respectively. It was simple to implement the Keras Tuner and BO to detect the optimal configuration of hyperparameter for a smaller task where the cost matters a lot. Although BO took more computing time due to poor parallelization, there is still a tradeoff between the accuracy vs. the computing time using BO. On the other hand, Optuna and HyperOpt are effective optimization algorithms that worked well for large spaces and guarantee to detect the optimal combinations with the configured values. Larger spaces may include unimportant hyperparameters that increase the complexity of the problems, resulting in more computing time.

5. Previous Research

This section reviews previous research on using HPO frameworks (Optuna, BO, HyperOpt, and Keras Tuner) for performance studies.

In (44), Optuna was used to optimize hyperparameters for an XGB classifier to diagnose cardiovascular disease. Multiple hyperparameters such as n_esitmators, max_depth, gamma, learning rate, etc. are tuned to improve the evaluation performance of XGB. The model achieved an accuracy of 94.7% in the Cleveland dataset which outperformed the previous approaches. In (45), Optuna improved the performance of the LightGBM model by updating their hyperparameters for the prediction of circuit impedance values. Hyperparameters like n_esitimators, learning_rate, max_depth, lambda, max_leaves etc. are optimized. Optuna provides the optimal values for LightGBM to outperform other models and achieved an R2 value of 0.79. In (46), the Optuna tool was used for a comprehensive performance study to classify the SVM, decision tree (DT), and RF for a multi-class problem. RS, GS, TPE, and CMA-ES optimization techniques within Optuna are analyzed. In (47), (48), Optuna was used to optimize the architecture for DL applications and multiple hyperparameters to improve the performance of CNN, LSTM, etc.

In (49), authors attempted to select a learning algorithm with its hyperparameters and tune them automatically by using BO. It optimized the different classifiers to select an optimal algorithm with their appropriate hyperparameter and achieved better results. In (28), BO was used to select the automatically select the ML algorithms and their hyperparameters for the WEKA approach. This approach is implemented on multiple datasets and improved experimental results are compared with other optimization techniques. In (50), BO searched the best configurations search space of the CNN model for gastroenterology which resulted in 10% higher accuracy compared with the previous method.

In (49), a comparative study was conducted using HyperOpt to improve the performance of ML classifiers. Performance comparison of HyperOpt-BO with GS and RS for DT, XGBoost, C-SVM, RF, and NN is performed on six different state-of-the-art datasets. Multiple hyperparameters for each classifier were tuned while XGBoost using HyperOpt-BO selected the best combinations of hyperparameters and achieved high accuracy within a short computing time. In (51), authors improved the overall performance of SVM, AdaBoost, logistic linear regression, RF, and NNs using the HyperOpt tool for the prediction of drugs. The selected hyperparameters are tuned with a different range of values to obtain the configuration set. Finally, models are trained with configures hyperparameters and 33 out of 36 models improved the validation performance. In (52), the HyperOpt-TPE algorithm was used to tune the hyperparameters of the model to predict the future taxi demands from the New York City dataset through the LSTM network. The optimized LSTM achieved a higher MSE of 0.172 compared with other prediction models.

In (53), a CNN is optimized using the Keras Tuner to obtain the optimal combination of hyperparameters to get better performance results within a short computing time. Optimized CNN achieved 94% of accuracy on the fer2013 dataset to detect emotions. In (39), Keras Tuner was used for optimizing a long-short-term memory (LSTM) DL network. The model's hyperparameters are optimized to build an efficient architecture that improves performance. 74.67% accuracy was achieved to predict the earthquake.

Although automated optimization approaches perform well, manual optimization approaches are also effective for NNs for a DL task. Studies such as (54), (55) used the manual tuning of hyperparameters and improve the classification accuracy with a valuable margin.

6. Conclusion

In this paper, a comparative performance evaluation study was conducted to analyze the direct impact of the choose of hyperparameters to optimize the ML models. For this purpose, the hyperparameters of each model were optimized using the latest HPO frameworks such as BO, Optuna, HyperOpt, and Keras tuners. Each of these frameworks consists of multiple state-of-the-art HPO algorithms. Two different experiments were conducted to obtain the best configuration of hyperparameters and the resulting performance was analyzed. First, multiple ML classifiers were optimized with HPO frameworks on publicly available datasets. Second, a CNN model was built and optimized with HPO frameworks for an image classification task. Experimental results showed that HyperOpt-TPE outperformed the other frameworks for the accuracy and the computing time for ML models while HyperOpt outperformed the other HPO frameworks for the CNN model.

Acknowledgements

This work was supported by the Supercomputer Development Leading Program of the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2020M3H 6A1084984).

References

1 
T. Young, D. Hazarika, S. Poria, E. Cambria, 2018, Recent Trends in Deep Learning Based Natural Language Processing, IEEE Computational Intelligence Magazine, Vol. 13, pp. 55-75DOI
2 
M. I. Jordan, T. M. Mitchell, Jul 2015, Machine learning: Trends, perspectives, and prospects, Science, Vol. 349, No. 6245, pp. 255-260DOI
3 
R. Elshawi, M. Maher, S. Sakr, 2019, Automated Machine Learning: State-of-The-Art and Open Challenges, ArXiv190602287 Cs StatDOI
4 
S. Abreu, 2019, Automated Architecture Design for Deep Neural Networks, ArXivDOI
5 
K. He, X. Zhang, S. Ren, J. Sun, 2016, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778DOI
6 
N. Ma, X. Zhang, H.-T. Zheng, J. Sun, 2018, ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design, CoRR, Vol. abs/1807.11164, pp. -DOI
7 
J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, 2011, Algorithms for Hyper-Parameter Optimization, Advances in Neural Information Processing Systems, Vol. 24, pp. 2546-2554DOI
8 
J. Bergstra, Y. Bengio, 2012, Random Search for Hyper- Parameter Optimization, J. Mach. Learn. Res., Vol. 13, No. 10, pp. 281-305DOI
9 
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas, 2016, Taking the Human Out of the Loop: A Review of Bayesian Optimization, Proc. IEEE, Vol. 104, No. 1, pp. 148-175DOI
10 
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, 2019, Optuna: A Next-generation Hyperparameter Optimization Framework, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623-2631DOI
11 
J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D. D. Cox, 2015, Hyperopt: a Python library for model selection and hyperparameter optimization, Comput. Sci. Discov., Vol. 8, No. 1, pp. 014008-DOI
12 
2022, KerasTuner
13 
P. Probst, A.-L. Boulesteix, B. Bischl, 2019, Tunability: Importance of Hyperparameters of Machine Learning Algorithms, J. Mach. Learn. Res., Vol. 20, No. 53, pp. 1-32DOI
14 
M. Claesen, B. De Moor, Apr. 06, 2015, Hyperparameter Search in Machine Learning, arXivDOI
15 
H. J. P. Weerts, A. C. Mueller, J. Vanschoren, Jul. 15, 2020, Importance of Tuning Hyperparameters of Machine Learning Algorithms, arXivDOI
16 
V. Nair, G. E. Hinton, 2010, Rectified linear units improve restricted boltzmann machines, in Proceedings of the 27th International Conference on International Conference on Machine Learning, Madison, WI, USA, pp. 807-814DOI
17 
J. Brownlee, Jan. 22, 2019, How to Configure the Learning Rate When Training Deep Learning Neural Networks, Machine Learning MasteryDOI
18 
Y. Bengio, 2012, Practical Recommendations for Gradient-Based Training of Deep Architectures, in Neural Networks: Tricks of the Trade: Second Edition, G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Berlin, Heidelberg: Springer, pp. 437-478DOI
19 
S. Agrawal, 2021, Hyperparameters in Deep Learning, MediumDOI
20 
, [Coursera] Neural Networks for Machine Learning (University of Toronto) (neuralnets)DOI
21 
D. P. Kingma, J. Ba, 2015, Adam: A Method for Stochastic Optimization, in 3rd International Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015, Conference Track ProceedingsDOI
22 
J. Duchi, E. Hazan, Y. Singer, 2011, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of machine learning research, Vol. 12, No. 7, pp. 39-DOI
23 
P. Liashchynskyi, P. Liashchynskyi, 2019, Grid search, random search, genetic algorithm: a big comparison for NAS, arXiv preprint arXiv:1912.06059DOI
24 
M. A. J. Idrissi, H. Ramchoun, Y. Ghanou, M. Ettaouil, 2016, Genetic algorithm for neural network architecture optimization, in 2016 3rd International Conference on Logistics Operations Management (GOL), pp. 1-4DOI
25 
J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, 2011, Algorithms for Hyper-Parameter Optimization, in Advances in Neural Information Processing Systems, Vol. 24DOI
26 
R. Joseph, 2018, Grid Search for model tuning, MediumDOI
27 
M.-A. Zöller, M. F. Huber, 2021, Benchmark and Survey of Automated Machine Learning Frameworks, ArXiv190412054 Cs StatDOI
28 
A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, 2017, Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 528-536DOI
29 
M. Seeger, 2004, Gaussian processes for machine learning, Int. J. Neural Syst., Vol. 14, No. 2, pp. 69-106DOI
30 
F. Hutter, H. H. Hoos, K. Leyton-Brown, 2011, Sequential Model-Based Optimization for General Algorithm Configuration, in Learning and Intelligent Optimization, pp. 507-523DOI
31 
D. Maclaurin, D. Duvenaud, R. Adams, 2015, Gradient-based Hyperparameter Optimization through Reversible Learning, in Proceedings of the 32nd International Conference on Machine Learning, pp. 2113-2122DOI
32 
A. S. Wicaksono, A. A. Supianto, 2018, Hyper Parameter Optimization using Genetic Algorithm on Machine Learning Methods for Online News Popularity Prediction, Int. J. Adv. Comput. Sci. Appl. IJACSA, Vol. 9, No. 12, pp. 33-31DOI
33 
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, 2022, Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization, International Conference on Learning RepresentationsDOI
34 
2022, GitHub - fmfn/BayesianOptimization: A Python implementation of global optimization with gaussian processes, https://github. com/fmfn/BayesianOptimizationGoogle Search
35 
Mar. 18, 2022, Optuna: A hyperparameter optimization framework, optunaDOI
36 
Jan. 12, 2023, Hyperopt: Distributed Hyperparameter Optimization, hyperoptDOI
37 
K. Team, Jan. 13, 2023, Keras documentation: KerasTunerDOI
38 
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, 2017, Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization, , pp. 6765-6816DOI
39 
Md. H. A. Banna, 2021, Attention-Based Bi-Directional Long-Short Term Memory Network for Earthquake Prediction, IEEE Access, Vol. 9, pp. 56589-56603DOI
40 
Murat Koklu, Ilker Ali Ozkan, 2020, Multiclass classification of dry beans using computer vision and machine learning techniques, Comput. Electron. Agric., Vol. 174, pp. 105507-DOI
41 
İ. Çinar, M. Koklu, P. D. Ş. Taşdemi̇r, Dec. 2020, Classification of Raisin Grains Using Machine Vision and Artificial Intelligence Methods, Gazi Mühendis. Bilim. Derg., Vol. 6, No. 3, pp. -DOI
42 
L. Candillier, V. Lemaire, Aug. 2013, Active learning in the real-world design and analysis of the Nomao challenge, in The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1-8DOI
43 
A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012, ImageNet Classification with Deep Convolutional Neural Networks, in Advances in Neural Information Processing Systems, Vol. 25DOI
44 
P. Srinivas, R. Katarya, Mar. 2022, hyOPTXg: OPTUNA hyper- parameter optimization framework for predicting cardiovascular disease using XGBoost, Biomed. Signal Process. Control, Vol. 73, pp. 103456-DOI
45 
J.-P. Lai, Y.-L. Lin, H.-C. Lin, C.-Y. Shih, Y.-P. Wang, P.-F. Pai, Feb. 2023, Tree-Based Machine Learning Models with Optuna in Predicting Impedance Values for Circuit Analysis, Micromachines, Vol. 14, No. 2, pp. -DOI
46 
J. Joy, M. P. Selvan, 2022, A comprehensive study on the performance of different Multi-class Classification Algorithms and Hyperparameter Tuning Techniques using Optuna, in 2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), pp. 1-5DOI
47 
Y. Nishitsuji, J. Nasseri, Mar. 2022, LSTM with forget gates optimized by Optuna for lithofacies prediction, DOI
48 
I. Ekundayo, 2020, OPTUNA Optimization Based CNN-LSTM Model for Predicting Electric Power Consumption, masters, Dublin, National College of IrelandDOI
49 
S. Putatunda, K. Rama, 2018, A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost, in Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, Shanghai China, pp. 6-10DOI
50 
R. J. Borgli, H. Kvale Stensland, M. A. Riegler, P. Halvorsen, 2019, Automatic Hyperparameter Optimization for Transfer Learning on Medical Image Datasets Using Bayesian Optimization, in 2019 13th International Symposium on Medical Information and Communication Technology (ISMICT), pp. 1-6DOI
51 
J. Zhang, Q. Wang, W. Shen, Dec 2022, Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library, Chin. J. Chem. Eng., Vol. 52, No. , pp. -DOI
52 
N. Schwemmle, T.-Y. Ma, May 2021, Hyperparameter Optimization for Neural Network based Taxi Demand Prediction, presented at the BIVEC-GIBET Benelux Interuniversity Association of Transport Researchers: Transport Research Days 2021DOI
53 
B. Abdellaoui, A. Moumen, Y. Idrissi, A. Remaida, 2021, Training the Fer2013 Dataset with Keras Tuner., pp. 412-DOI
54 
A. Jafar, M. Lee, 2021, High-speed hyperparameter optimization for deep ResNet models in image recognition, in Cluster Computing, pp. 1-9DOI
55 
A. Jafar, L. Myungho, Aug. 2020, Hyperparameter Optimization for Deep Residual Learning in Image Classification, in 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), pp. 24-29DOI

저자소개

First Auther
../../Resources/kiee/KIEE.2023.72.5.607/au1.png

Abbas Jafar received his B.S in Software Engineering from the Government College University Faisalabad, Pakistan. He joined Myongji University, Korea for a Master's Degree. He graduated Master and now enrolled in Ph.D. Currently, he is a Research Assistant in HPC Lab at Myongji University. His research interests are AI in healthcare system, deep learning, high- performance computing and performance optimization with a special interest in GPU computing.

이명호 (Myungho Lee)
../../Resources/kiee/KIEE.2023.72.5.607/au2.png

Myungho Lee received his B.S. in Computer Science and Statistics from Seoul National University, Korea, M.S. in Computer Science, Ph.D. in Computer Engineering from the University of Southern California, USA. He was a Staff Engineer in the Scalable Systems Group at Sun Microsystems, Inc, Sunnyvale, California, USA. He is currently a Full Professor in the Dept. of Computer Science & Engineering at Myongji University. His research interests are in High-Performance Computing: architecture, compiler, and applications.