WangQi
-
(College of Architecture and Information Engineering, Shandong Vocational College of
Industry, Zibo, 256414, China
qiwang_qwqw@outlook.com
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Machine-learning, Bayesian, Automation, CASH, Artificial intelligence
1. Introduction
Automated machine-learning technology can reduce or avoid manual participation in
model selection and parameter tuning. Therefore, the efficiency and performance of
machine-learning models are improved. Machine-learning pipeline design is integral
to automated machine-learning technology and has received extensive attention [1,2]. In practical applications, however, the existing machine-learning pipeline design
algorithms are used more to solve the automatic modeling of static data sets, which
cannot capture the drift of data concepts accurately, resulting in the model trained
in a particular stage cannot adapt to the next step. The data from each stage will
reduce the accuracy of the model. In addition, for CASH (Combined Algorithm Selection
and Hyperparameter optimization) problems, the solution effect and efficiency of existing
machine-learning pipeline design algorithms are not ideal [3,4]. To this end, the research divides the problem of machine-learning pipeline design
into two sub-problems. They are based on reinforcement learning to realize machine-learning
pipeline structure search and the Bayesian network model to realize the optimal configuration
of machine-learning pipeline hyperparameters. Thus, a Bayesian-based Model machine-learning
pipeline automation design (AutoML for PipeLine Design, Auto-PLD) algorithm framework
is proposed to improve the Bayesian proxy model and Bayesian acquisition function
in the hyperparameters optimization configuration process, improving the performance
of Auto-PLD. There are two main innovations in the research. The first point is to
divide the machine-learning pipeline design problem into two sub-problems to realize
the simultaneous optimization of the machine-learning pipeline structure and hyperparameters.
The second point is to improve the Bayesian proxy model and the Bayesian acquisition
function, thereby improving the performance of Auto-PLD. The research provides theoretical
guidance and ideas for applying automated machine-learning technology practically.
In addition, it has specific reference significance for developing automated machine-learning
technology in China.
1. Related Works
The application data in various fields has increased considerably when information
technology is highly developed and popularized. Artificial intelligence technology
based on machine-learning models has received increasing attention to make more efficient
use of these data. Automated machine-learning technology can realize the automation,
efficiency, and intelligence of machine learning. Moreover, the application threshold
of artificial intelligence technology has decreased, which has attracted the academic
community. Tan HB et al. proposed a new crown prediction method based on automated
machine-learning technology to analyze the chest CT of patients with new coronary
pneumonia. Hence, clinical prediction of new coronary pneumonia can be realized. The
results show that the AUC value of the method was greater than 0.95, proving the effectiveness
of the method [5]. Alsharef et al. used an automated machine-learning technology framework to realize
time series forecasting, improving the efficiency and performance of data modeling.
This study provides additional help and reference for related researchers and industries
[6]. Wever et al. applied it to multi-label classification work based on the characteristics
of automated machine-learning technology that can support the construction of pipelined
algorithm models. The results show that automated machine-learning technology has
a good application effect in multi-label classification [7]. Baudart et al. proposed an orthogonal combinator for the defect in which progressive
automated machine-learning techniques must change large-scale non-combined codes.
Applying this combinator to progressive automated machine-learning techniques can
improve its operational efficiency [8]. Li et al. introduced the VOLCANOML framework in end-to-end automated machine learning.
This approach effectively improved the decomposition level of the search space in
automated machine-learning techniques [9]. Automatic machine-learning technology could not achieve the best prediction performance
of the model in a limited time(Ed note: Contractions, such as ``couldn’t'' should
not be used in report writing.). Zogaj et al. proposed a method to solve this problem
by reducing the number of rows in the input table data set and improving the efficiency
of automatic machine learning. Experimental data confirm the effectiveness of the
method [10]. Li et al. combined Internet of Things technology, blockchain technology, and automated
machine-learning technology to build an open and intelligent customer service platform.
The platform could help users realize data transactions on the premise of ensuring
user safety [11]. Yakovlev A et al. reported that machine-learning technology could not quickly deploy
models due to massive data growth and introduced automated machine-learning technology
to achieve fast and accurate modeling. The experimental results verified the effectiveness
of the method [12].
The Bayesian classification algorithm is an algorithm that achieves classification
based on probability and statistics. It has the advantages of a simple classification
method, high classification accuracy, and fast classification speed. In addition,
it has a good application in large databases. Alade IO and others used the Bayesian
algorithm to optimize the support vector machine (SVM) to construct a prediction model
to predict accurately the specific heat capacity of alumina/ethylene glycol nanofluids.
The experimental results showed that the model accuracy reached 99.95% [13]. Scanagatta et al. introduced the network structure of the Bayesian algorithm and
proposed an alternative to deal with the processing of incomplete data and continuous
variables by the Bayesian algorithm. In addition, the study also tested the current
software tools [14]. Yao et al. explored the influence of silk processing parameters on the physical
properties of silk fibers based on the fast Bayesian algorithm and improved silk processing.
The experimental results showed that the mechanical properties of silk had been improved
significantly after the fast Bayesian algorithm was introduced [15]. Maheswari et al. combined a decision tree and a naive Bayesian algorithm to perform
data mining on healthcare data to predict heart disease. The experimental results
validate the prediction accuracy of the method [16].
Joseph G et al. used sparse Bayesian to solve the dictionary learning problem and
verified its global convergence and stability. In addition, this method had a good
application effect in image denoising [17]. Mat SRT et al. proposed a model based on Bayesian algorithm classification to prevent
malicious attacks from Android malware. Experiments were performed on samples from
the AndroZoo and Drebin databases; the accuracy of the model exceeded 90% [18]. Based on a Bayesian algorithm, Salvato et al. cross-matched the counterparts of
sky X-ray measurements. The experimental results showed that the results of this method
are faster and more accurate [19]. Liu Y et al. proposed a hybrid Bayesian algorithm and applied it to evaluate the
collaborative ability of related equipment in retrieving ice cloud microphysics. The
experimental results verified the effectiveness of the algorithm [20].
The current automatic machine-learning technology and Bayesian algorithm are used
widely. On the other hand, in the existing research, automatic machine learning was
used more to solve the automatic modeling of static data sets, but the effect in actual
application scenarios was poor. In response to this problem, the study proposed a
machine-learning pipeline automation design method that combined Bayesian algorithms
and reinforcement learning, so that it could also play a good role in practical applications.
The research provided new ideas for the practical applications of automated machine-learning
technology and had a particular role in promoting the development of artificial intelligence
technology.
2. Construction of Auto-PLD Algorithm Framework for Classic Scenes
2.1 Basic Structure Design of the Machine-learning Pipeline
Automated machine-learning technology can automatically select an algorithm on a given
data set and perform hyperparameter tuning on a given data set through a particular
control strategy. Hence, manual intervention is reduced. The performance of machine-learning
algorithms and the accuracy of the data set are improved. The main problem faced by
automated machine-learning techniques is the CASH problem, which combines algorithm
selection and hyperparameter optimization. CASH can be described as follows. Suppose
there is a set of machine algorithms, $A=\left\{A_{1},A_{2},\ldots ,A_{n}\right\}$,
which divide a data set into two disjoint subsets called the training set $D_{1}$and
test set $D_{2}$. The main goal of the CASH problem is to find an algorithm $A_{i},A_{i}\in
A$. After $D_{1}$training on the network and tuning the hyperparameters, $D_{2}$performed
best. The above process can be expressed using formula (1).
where $L\left(A_{i},D_{1},D_{2}\right)$ is the loss function. In the application scenarios
of machine learning, it is often necessary to consider the design of data preprocessing
algorithms and feature preprocessing algorithms. Taking the classic classification
task as an example, the machine-learning pipeline has multiple algorithms participating
in data preprocessing, feature preprocessing, and final classification. A complete
automated machine-learning pipeline design structure can be expressed using formula
(2).
where $l$ represents the $m_{1},m_{2},\ldots ,m_{l}$algorithms that form the pipeline
in turn. In a machine-learning pipeline, the input data is $<F,y>$, where $F$is the
input feature and $y$is the corresponding data label. $F$ can be represented by two
sets, namely discrete features $f_{1}$ and continuous features $f_{2}$. According
to the above content, the design of the machine-learning pipeline can be realized,
as shown in Fig. 1.
In Fig. 1, $M_{d1},M_{d2},M_{d3},M_{f},M_{c}$ represent the algorithm set for preprocessing
discrete data in the machine-learning pipeline, the algorithm set capable of simultaneously
preprocessing discrete data and continuous data, the algorithm set for continuous
preprocessing data, a feature collection of preprocessing algorithms, and a collection
of classification algorithms, respectively.
Fig. 1. Machine-learning pipeline feature transformation.}
2.2 Machine-learning Pipeline Structure Search based on Reinforcement Learning
To realize machine-learning automation, it is necessary to ensure the machine-learning
pipeline structure. (Ed note: Short coordinating conjunctions like ``and'' and ``but''
should not be used at the beginning of sentences.) In addition, the hyperparameters
corresponding to the machine-learning pipeline structure are optimized simultaneously
in the machine-learning pipeline. This study proposes an Auto-PLD algorithm framework
consisting of two parts. The machine-learning pipeline design problem is divided into
two sub-problems, as shown in Fig. 2.
In Fig. 2, the two stages, A and B, are optimized alternately to realize the simultaneous optimization
of the machine-learning pipeline structure and hyperparameters. The training process
of reinforcement learning is the process of continuous interaction between the agent
(Agent) and the environment. In this process, the decision-making strategy is updated
through the interaction information and continues to act. The essence of reinforcement
learning is to solve the problem that the agent maximizes the reward through decision-making
strategies, as shown in Fig. 3.
For stage A, its workflow is basically the same as that of reinforcement learning.
The Markov property of stage A can be modeled as a reinforcement learning problem,
and reinforcement learning is used to determine the pipeline structure. The goal of
phase A is to find a sequence, as expressed in Eq. (2). Therefore, the state space of reinforcement learning can be determined using formula
(3).
The study proposes a 01 sequence to represent the state space. That is, the coding
table to combine all the algorithms into a unique sequence is used. Each bit in the
sequence represents an algorithm. In this sequence, 0 indicates that the algorithm
represented by the sequence position is not selected; 1 is the opposite, indicating
that it is selected. The set of states in reinforcement learning is denoted by $S$.
One bit needs to be added at the end of the sequence to express the terminal state
more intuitively.
In summary, the length of the 01 sequence is $\left| M_{d1}\cup M_{d2}\cup M_{d3}\cup
M_{f}\times \cup M_{c}\right| +1.$ The machine-learning pipeline structure corresponds
to the sequence structure proposed in the research. It has the advantages of low dimensionality,
constant length, and simple implementation. In the problem corresponding to stage
A, the action space of reinforcement learning has two actions: selecting an algorithm
and evaluating the entire pipeline. $X$ represents a collection of actions. By executing
an action, the state transition of the agent can be realized. In order to avoid the
unreasonable pipeline structure of the proposed machine-learning pipeline, different
candidate action sets need to be designed in different states. According to the definition
of the state space, the algorithm corresponding to the last 1 in the 01 sequence can
be known. It is defined as $m_{s}^{last}$; $s\in S$ is the last algorithm in the pipeline
corresponding to the state. $s_{0}$ is the start state, and the sequence is all 0
at this time. $s_{e}$ is the termination state, indicating that the last bit is a
sequence of 1. Other definitions are shown in formula (4).
set $X_{s}$ of possible actions of the Agent in the $a_{e}$state, the evaluation action
of the machine-learning pipeline is $s$, then $X_{s}$there is formula (5) for
In reinforcement learning, the reward function can describe the agent’s actions in
the environment. The training process of reinforcement learning is the process of
maximizing the cumulative reward. In the problem of machine-learning pipeline structure
search, the research regards performance evaluation as the reference index of reward
value. The performance of the machine-learning pipeline is closely related to the
selection of hyperparameters. The reward value in stage A is defined as the optimal
performance evaluated so far for the machine pipeline structure to minimize the impact
of noise produced by different hyperparameters. $s$Thus, the reward function in stage
A is defined as formula (6).
In formula (6), the initial value of $r_{s}$ is 0. $s\_ $ represents the next state. $s'$ and $s''$
are the terminal state and non-terminal state, respectively. $r_{s}^{now}$ is the
performance of the machine-learning pipeline with structure. Based on the above content,
the machine-learning pipeline structure search $s$ is completed.
Fig. 2. Auto PLD Framework.}
Fig. 3. Basic process of reinforcement learning.}
2.3 Bayesian-based Machine-learning Pipeline Hyperparameters Optimization
Assuming that the machine-learning pipeline structure is $m=\left(m_{1},m_{2},\ldots
,m_{l}\right)$, a set of optimal hyperparameters needs to be configured in its hyperparameter
space $\theta _{1},\theta _{2},\ldots ,\theta _{l}$. The research proposes a public
Bayesian model following the SMBO (sequential model-based global optimization) algorithm
framework to optimize the hyperparameters under different structures of the machine-learning
pipeline. The Bayesian model is a very common and effective global optimization algorithm,
which can obtain the optimal global solution by calculating the extreme value of the
objective function, as shown in formula (7).
where $\chi $ is the search space; $f\left(\right)$ is the objective function; $x$
is the given query point. In the automatic design of machine pipelines, the hyperparameter
optimization problem is the problem of optimizing the loss function in the hyperparameter
space. Hence, it is necessary to define the hyperparameter space for Bayesian optimization.
First, the hyperparameter space should meet four basic requirements: support for integer
and floating-point parameters, support for class parameters, support for conditional
sincerity, and support for prohibition clauses. The research proposes a 01 sequence
to represent the state space of reinforcement learning, and each bit in the sequence
represents an algorithm. Therefore, it should treat each bit in the sequence as a
class parameter, with optional values of 0 and 1. When the parameter value is 1, the
hyperparameter space of the algorithm represented by this parameter is one of the
components of the machine pipeline hyperparameter space. As shown in Fig. 4, the machine-learning pipeline structure search is completed through reinforcement
learning. The 01 sequence and machine-learning pipeline structure are determined.
When a certain bit in the sequence takes a value of 0, the hyperparameter space corresponding
to its child node is defined as None. When a certain bit in the sequence takes a value
of 1, the hyperparameters space corresponding to its child node is defined as AdaBoost.
This position represents the hyperparameter space of the algorithm (AdaBoost). The
hyperparameters are determined as the learning rate, estimators, and maximum depth.
A complete map of the machine-learning pipeline structure can be obtained through
the above content in public hyperparameter space. In the SMBO framework, the core
content of the agent model. Compared with other models, the Gaussian process is more
flexible in representing the distribution of functions. Therefore, the Gaussian process
is usually selected as the Bayesian proxy model. On the other hand, the proxy model
constructed by this method is too dependent on the parameterized kernel function,
which is only suitable for continuous hyperparameters. The application effect in the
automatic design of the machine-learning pipeline is not ideal. Therefore, this study
proposes the weighted Hamming distance kernel function method to optimize it and build
a proxy model, to have a better processing effect on the category parameters. This
method uses the Gaussian process to construct the proxy model, defining the category
function as a similar kernel function. The weighted Hamming distance to measure the
distance is used. Finally, a combined function as the kernel function in the proxy
model is obtained, such as formula (8).
where $k_{\textit{mixed}}\left(\right)$ is the combination function. $P_{cont},P_{cat}$
is the continuous numerical parameter set and the categorical parameter set, respectively.
$\delta \left(\right)$ is the Kronecker delta function, and $\lambda _{l}$ represents
the first parameter of the kernel function $l$. When using a Gaussian process as a
proxy model, the complexity is high, and the time-consuming is also high. This study
uses the random forest algorithm as the proxy model. The advantage of this method
is that the calculation load is small, and the processing time is short. Therefore,
it is more suitable for machine-learning pipeline design. After determining the proxy
model, finally Expected Improvement (EI) is used as the Bayesian optimization function
to obtain the function. Finally, the hyperparameters of the machine-learning pipeline
are determined. Based on the above content, the Auto-PLD algorithm framework is constructed
to realize the automatic design of the machine-learning pipeline.
Fig. 4. Machine-learning pipelined hyperparametric space.}
3. Performance Evaluation of the Auto-PLD Algorithm
Research, design, and comparative experiments were conducted to evaluate the performance
of the Auto-PLD algorithm framework based on the Bayesian model. Table 1 lists the experimental environment.
The 10 datasets used in the experiment were all classification task datasets in OpenML-CC18.
Approximately 70% of the data samples in each dataset were used as the training data
set; the remaining 30% were used as the testing data set. Each experiment was run
10 times, and the average value was taken as the final result. In the construction
process of the Auto-PLD algorithm framework, the reinforcement learning method adopted
Q-learning. In addition to the method in this paper, two methods, Auto-sklearn and
Auto-PLD-random, were also constructed for better comparison. Among them, the meta-learning
of the Auto-sklearn method was pre-trained on a large-scale public data set. These
data sets contained the part used in the experiment. Therefore, the meta-learning
function was turned off to reduce the experimental error. In Auto-PLD-random, reinforcement
learning and Bayesian optimization used random methods, i.e., the structure and hyperparameter
configuration of the machine-learning pipeline are entirely random. The performance
of the three methods was compared using the balanced accuracy as the evaluation index
when the time budget is 1h, 4h, and 8h on each test set.
Auto-sklearn showed the best performance when the time budget was 1h, as shown in
Table 2. Its average balanced accuracy value was 0.840, which was 0.005 higher than the balanced
accuracy of Auto-PLD-random and 0.009 higher than the balanced accuracy of Auto-PLD.
Auto-PLD showed the best performance when the time budget was 4h. Its average balanced
accuracy value was 0.842, which was 0.003 higher than the balanced accuracy of Auto-PLD-random
and 0.001 higher than the balanced accuracy of Auto-sklearn. When the time budget
was 8h, Auto-PLD had the best performance, and its average balanced accuracy value
was 0.845, which was 0.007 higher than the balanced accuracy of Auto-PLD-random and
0.003 higher than the balanced accuracy of Auto-sklearn. Auto-sklearn had more advantages
when the time budget was small. This is because Auto-PLD needs to thoroughly search
and determine the machine-learning pipeline structure so a sufficient number of training
samples can be improved for reinforcement learning. After the time budget increased,
the performance of Auto-PLD was also significantly better than Auto-sklearn and Auto-PLD-random.
The above results verified the performance of Auto-PLD.
The machine-learning pipeline evaluation success rate of the three methods on each
data set was compared, as shown in Table 3. Different time budgets have little impact on the success rate of machine-learning
pipeline evaluation. Among them, the success rate of Auto-PLD was the highest, exceeding
92%. The success rate of Auto-sklearn was slightly lower than that of Auto-PLD, exceeding
91%. Auto-PLD-random had the lowest success rate, approximately 81%. This showed that
adopting the search strategy proposed by the study during the search process could
improve the performance of the machine-learning pipeline.
The average number of machine-learning pipelines per hour attempted by the three methods
under different time budgets was compared, as shown in Fig. 5. Auto-PLD-random had the largest machine-learning pipeline attempts per hour, exceeding
140,000 times. The average number of machine-learning pipeline attempts per hour for
Auto-PLD and Auto-sklearn was comparable, between 30,000 and 50,000. When the time
budget was 8h, the average number of machine-learning pipeline attempts per hour of
Auto-PLD was 5034 times smaller than that of Auto-sklearn.
The number of algorithm occurrences in the optimal machine-learning pipeline searched
by the three methods was compared, as shown in Table 4. One algorithm could obtain the optimal situation for all problems. Different problems
require different algorithms to obtain the optimal solution. Therefore, it was necessary
to ensure the diversity of the AutoML algorithm library.
The dataset with id 14 showed the best machine-learning pipeline performance over
time, as shown in Fig. 6. When the time budget was 1h, 4h, and 8h, the balanced accuracy values of Auto-PLD
were 0.849, 0.858, and 0.863, respectively, which were higher than the other two methods.
The best machine-learning pipeline performance was tested over time to avoid errors
in the experimental results caused by chance on the dataset with id 307, as shown
in Fig. 7. When the time budget was 1h, 4h, and 8h, the balanced accuracy values of Auto-PLD
were 0.982, 0.985, and 0.987, respectively, which were higher than the other two methods.
The above results indicated that the performance of Auto-PLD was better. In summary,
the Auto-PLD based on reinforcement learning and the Bayesian model had a good performance
in the automatic design of the machine-learning pipeline.
Fig. 5. Average number of machine-learning pipeline attempts per hour under different time budgets.}
Fig. 6. Time-varying performance of the best machine-learning pipeline on Dataset id-14.}
Fig. 7. Time-varying performance of the best machine-learning pipeline on Dataset id-307.}
Table 1. Experimental environment.
Project
|
Configuration information
|
Operating system
|
CentOS 7.0
|
CPU
|
2x Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz (6C 12T 3.19GHz, 3.2GHz IMC, 6x 256kB L2,15MB
L3)
|
Memory
|
64GB(8x8GB) 1866MHz
|
Hard disk
|
2TBx2 3.5-inch with RAID-1
|
Programing language
|
Python 3.6.10
|
Machine-learning library
|
Scikit-learn 0.21.3
|
Ray
|
Ray 0.8.2
|
Internet
|
1Gb/s Ethernet
|
Table 2. Balanced accuracy value of the three methods.
Time Budget
|
Method
|
Dataset id
|
Average
|
11
|
14
|
18
|
31
|
50
|
54
|
307
|
1053
|
1461
|
1480
|
1 hour
|
Auto-PLD
|
0.974
|
0.847
|
0.745
|
0.719
|
1.000
|
0.826
|
0.980
|
0.676
|
0.853
|
0.689
|
0.831
|
Auto-PLD-random
|
1.000
|
0.855
|
0.746
|
0.719
|
1.000
|
0.833
|
0.985
|
0.675
|
0.852
|
0.685
|
0.835
|
Auto-sklearn
|
1.000
|
0.849
|
0.753
|
0.736
|
0.997
|
0.840
|
0.981
|
0.683
|
0.854
|
0.703
|
0.840
|
4 hours
|
Auto-PLD
|
0.987
|
0.857
|
0.748
|
0.736
|
1.000
|
0.857
|
0.979
|
0.681
|
0.861
|
0.716
|
0.842
|
Auto-PLD-random
|
1.000
|
0.859
|
0.747
|
0.728
|
1.000
|
0.841
|
0.982
|
0.675
|
0.860
|
0.693
|
0.839
|
Auto-sklearn
|
1.000
|
0.865
|
0.755
|
0.728
|
1.000
|
0.844
|
0.983
|
0.678
|
0.856
|
0.699
|
0.841
|
8 hours
|
Auto-PLD
|
1.000
|
0.858
|
0.756
|
0.733
|
1.000
|
0.856
|
0.986
|
0.682
|
0.863
|
0.712
|
0.845
|
Auto-PLD-random
|
1.000
|
0.855
|
0.750
|
0.730
|
1.000
|
0.833
|
0.984
|
0.678
|
0.862
|
0.705
|
0.838
|
Auto-sklearn
|
1.000
|
0.876
|
0.755
|
0.730
|
0.999
|
0.836
|
0.979
|
0.678
|
0.858
|
0.709
|
0.842
|
Table 3. Machine-learning pipeline evaluation success rate of three methods (%).
Time Budget
|
Method
|
Dataset id
|
Average
|
11
|
14
|
18
|
31
|
50
|
54
|
307
|
1053
|
1461
|
1480
|
1h
|
Auto-PLD
|
92.42
|
91.05
|
94.34
|
95.13
|
93.02
|
91.77
|
92.58
|
90.26
|
89.73
|
92.06
|
92.24
|
Auto-PLD-random
|
83.07
|
80.16
|
79.23
|
85.14
|
80.03
|
78.17
|
82.14
|
81.33
|
80.25
|
83.27
|
81.28
|
Auto-sklearn
|
90.05
|
92.34
|
93.46
|
91.08
|
88.42
|
89.73
|
92.05
|
91.44
|
93.02
|
90.08
|
91.17
|
4h
|
Auto-PLD
|
93.26
|
90.58
|
93.49
|
96.22
|
91.05
|
92.38
|
91.46
|
91.04
|
89.42
|
93.05
|
92.20
|
Auto-PLD-random
|
81.85
|
82.34
|
80.96
|
86.33
|
81.02
|
77.93
|
83.18
|
82.71
|
81.04
|
82.98
|
82.03
|
Auto-sklearn
|
91.04
|
91.58
|
94.07
|
90.96
|
89.14
|
90.13
|
90.25
|
91.46
|
91.05
|
90.17
|
90.99
|
8 hours
|
Auto-PLD
|
92.84
|
91.03
|
92.45
|
95.66
|
90.72
|
93.11
|
92.45
|
90.01
|
90.42
|
92.17
|
92.07
|
Auto-PLD-random
|
80.72
|
81.08
|
79.34
|
87.25
|
82.33
|
78.96
|
82.15
|
83.44
|
80.42
|
84.03
|
81.97
|
Auto-sklearn
|
90.05
|
92.64
|
95.41
|
89.46
|
88.74
|
92.08
|
91.42
|
90.05
|
90.44
|
89.53
|
90.98
|
Table 4. Number of algorithm occurrences in the optimal machine-learning pipeline.
Time Budget
|
Method
|
Algorithms
|
Adaboost
|
Bernouli NB
|
ExtraTrees
|
GBDT
|
Gaussian NB
|
KNeightbors
|
LinearSVC
|
SGD
|
SVC
|
RF
|
1h
|
Auto-PLD
|
14
|
0
|
21
|
16
|
0
|
4
|
9
|
1
|
26
|
26
|
Auto-PLD-random
|
10
|
1
|
17
|
32
|
0
|
4
|
11
|
1
|
25
|
20
|
Auto-sklearn
|
13
|
0
|
20
|
7
|
2
|
10
|
14
|
2
|
14
|
27
|
4h
|
Auto-PLD
|
9
|
0
|
17
|
26
|
0
|
4
|
9
|
1
|
27
|
21
|
Auto-PLD-random
|
5
|
0
|
16
|
32
|
0
|
6
|
10
|
0
|
33
|
17
|
Auto-sklearn
|
20
|
0
|
24
|
8
|
0
|
9
|
8
|
2
|
11
|
23
|
8 hours
|
Auto-PLD
|
4
|
0
|
24
|
33
|
0
|
3
|
8
|
1
|
30
|
20
|
Auto-PLD-random
|
8
|
0
|
18
|
33
|
2
|
6
|
10
|
0
|
30
|
18
|
Auto-sklearn
|
17
|
1
|
24
|
7
|
0
|
6
|
14
|
2
|
15
|
25
|
4. Conclusion
Automated machine learning was a technology that used machines to replace manual model
selection and parameter optimization. It could automate model design and improve model
modeling speed and performance. Machine-learning pipeline automation design was integral
to machine-learning automation technology. This study took the classic classification
problem as an example and completed the machine-learning pipeline structure search
based on reinforcement learning. The optimal configuration of hyperparameters based
on the Bayesian network was realized, and an Auto-PLD algorithm framework was proposed.
Auto-PLD was tested, and the experimental results showed that when the time budget
was four hours, the average balanced accuracy value of Auto-PLD was 0.842. The accuracy
was 0.003 higher than Auto-PLD-random and 0.001 higher than Auto-sklearn. When the
time budget was eight hours, the average balanced accuracy value of Auto-PLD was 0.845,
which was 0.007 higher than Auto-PLD-random and 0.003 higher than Auto-sklearn. Under
different time budgets, Auto-PLD had the highest success rate of machine-learning
pipeline evaluation on various datasets, exceeding 92%. When the budget was eight
hours, the average number of machine-learning pipeline attempts per hour of Auto-PLD
was 5034 times lower than that of Auto-sklearn. On the dataset with id 14, when the
time budget was one hour, four hours, and eight hours, the balanced accuracy values
of Auto-PLD were 0.849, 0.858, and 0.863, respectively, which are higher than the
other two methods. On the dataset with id 307, when the time budget was one hour,
four hours, and eight hours, the balanced accuracy values of Auto-PLD were 0.982,
0.985, and 0.987, respectively, which were higher than the other two methods. In summary,
the Auto-PLD proposed in the study had excellent performance and essential applications
in the mechanical design of machine-learning pipelines. The scale of the data used
in the experiment is insufficient, which may cause certain experimental errors. Therefore,
it is necessary to expand the scale and quantity of the data included in the subsequent
experiments and conduct more experimental tests to reduce the error impact caused
by accidental factors.
5. Fundings
The research is supported by Zibo Key Research and Development Program (city school-city
integration) project “Building an integrated platform for industry-academia-research
based on digital twin technology to empower Zibo’s digital economy” (No. 2021SNPT0055).
REFERENCES
Tsiakmaki. M, Kostopoulos. G, Kotsiantis. S, Ragos. O. ``Fuzzy-based active learning
for predicting student academic performance using autoML: a step-wise approach,''
Journal of Computing in Higher Education, vol. 33, no. 3, pp. 635-667, 2021.
Gupta. G, Katarya. R. ``EnPSO: An AutoML technique for generating ensemble recommender
system,'' Arabian Journal for Science and Engineering, vol. 46, no. 9, pp. 8677-8695,
2021.
Roman. D, Saxena. S, Robu. V, Pecht. M, Flynn. D. ``Machine learning pipeline for
battery state-of-health estimation,'' Nature Machine Intelligence, vol. 3, no. 5,
pp. 447-456, 2021.
Ajirlou. AF, Partin-Vaisband. I. ``A machine learning pipeline stage for adaptive
frequency adjustment,'' IEEE Transactions on Computers, vol. 71, no. 3, pp. 587-598,
2021.
Tan. HB, Xiong. F, Jiang. YL, Huang. WC, Wang. Y, Li. HH, You. T, Fu. TT, Peng. B
W. ``The study of automatic machine learning base on radiomics of non-focus area in
the first chest CT of different clinical types of COVID-19 pneumonia,'' Scientific
reports, vol. 10, no. 1, pp. 1-10, 2020.
Alshareref. A, Aggarwal. K, Kumar. M, Mishra. ``A Review of ML and AutoML solutions
to forecast time-series data,'' Archives of Computational Methods in Engineering,
vol. 29, pp. 5297-5311, 2022.
Wever. M, Tornede. A, Mohr. F, Hüllermeier. E. ``AutoML for multi-label classification:
Overview and empirical evaluation,'' IEEE transactions on pattern analysis and machine
intelligence, vol. 43, no. 9, pp. 3037-3054, 2021.
Baudart. G, Hirzel. M, Kate. K, Ram. R, Shinnar. A, Tsay. J. ``Pipeline combinators
for gradual automl,'' Advances in Neural Information Processing Systems, vol. 34,
pp. 19705-19718, 2021.
Li. Y, Shen. Y, Zhang. W, Zhang. C, Cui. B. ``VolcanoML: speeding up end-to-end AutoML
via scalable search space decomposition,'' The VLDB Journal, vol. 2022, pp. 218-218,
2022.
Zogaj. F, Cambronero. JP, Rinard. MC, Cito. J. ``Doing more with less: characterizing
dataset downsampling for AutoML,'' Proceedings of the VLDB Endowment, vol. 14, no.
11, pp. 2059-2072, 2021.
Li. Z, Guo. H, Wang. WM, Guan. YJ, Barenji. AV, Huang. GQ, McFall. KS, Chen. X. ``A
blockchain and automl approach for open and automated customer service,'' IEEE Transactions
on Industrial Informatics, vol. 15, no. 6, pp. 3642-3651, 2019.
Yakovlev. A, Moghadam. HF, Moharrer. A, Cai. JX, Chavoshi. N, Varadarajan. V, Agrawal.
SR, Idicula. S, Karnagel. T, Jinturkar. S, Agarwal. N. ``Oracle automl: a fast and
predictive automl pipeline,'' Proceedings of the VLDB Endowment, vol. 13, no. 12,
pp. 3166-3180, 2020.
Alade. IO, Abd Rahman. MA, Saleh. T A. ``Predicting the specific heat capacity of
alumina/ethylene glycol nanofluids using support vector regression model optimized
with Bayesian algorithm,'' Solar Energy, vol. 183: 74-82, 2019.
Scanagatta. M, Salmerón. A, Stella. F. ``A survey on Bayesian network structure learning
from data,'' Progress in Artificial Intelligence, vol. 8, no. 4, pp. 425-439, 2019.
Yao. Y, Allardyce. BJ, Rajkhowa. R, Hegh. D, Sutti. A, Subianto. S, Subianto. S, Rana.
S, Greenhill. S, Greenhill. S, Greenhill. XG, Greenhill. J M. ``Improving the tensile
properties of wet spun silk fibers using rapid Bayesian algorithm,'' ACS Biomaterials
Science & Engineering, vol. 6, no. 5, pp. 3197-3207, 2020.
Maheswari. S, Pitchai. R. ``Heart disease prediction system using decision tree and
naive bayes algorithm,'' Current Medical Imaging, vol. 15, no. 8, pp. 712-717, 2019.
Joseph. G, Murthy. C R. ``On the convergence of a Bayesian algorithm for joint dictionary
learning and sparse recovery,'' IEEE Transactions on Signal Processing, vol. 68, pp.
343-358, 2019.
Mat. SRT, Ab Razak. MF, Kahar. MNM, Arif. JM, Firdaus. A. ``A Bayesian probability
model for Android malware detection,'' ICT Express, vol. 8, no. 3, pp. 424-431, 2022.
Salvato. M, Buchner. J, Budavári. T, Dwelly. T, Merloni. A, Brusa. M, Rau. A, Fotopoulou
S, Nandra K. ``Finding counterparts for all-sky X-ray surveys with NWAY: a Bayesian
algorithm for cross-matching multiple catalogs,'' Monthly Notices of the Royal Astronomical
Society, vol. 473, no. 4, pp. 4937-4955, 2018.
Liu. Y, Mace. G G. ``Assessing synergistic radar and radiometer capability in retrieving
ice cloud microphysics based on hybrid Bayesian algorithms,'' Atmospheric Measurement
Techniques, vol. 15, no. 4, pp. 927-944, 2022.
Author
Qi Wang, lecturer at Shandong Vocational College of Industry, member of Shandong
Electronics Society, received a bachelor's degree in computer science and technology
from Shandong University of Technology in 2005, a master's degree in software engineering
from the University of Electronic Science and Technology in 2008, the title of Zibo
Technical Expert, Huawei HCIP, H3C certified lecturer, and Microsoft MCSE. He guides
students to win the first prize in the virtual reality competition of the National
Vocational College Skills Competition, mainly in the field of network communication
protocol Graphics, edge computing and Artificial Intelligence.