• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid




Unconstrained nonlinear conjugate gradient methods, Adaptive Restart, Stagnation

1. Introduction

Conjugate Gradient (CG) methods are widely employed for optimizing objective functions with multiple variables, especially for minimizing such functions without the need to store matrices. In determining directional vectors, it is crucial not only to ascertain the direction but also the magnitude of the vector to pinpoint the optimal point of the objective function. In the realm of nonlinear systems, CG methods typically yield directional vectors in approximately n steps for quadratics with n variables, where these vectors are conjugate and orthogonal to each other[1-2]. However, achieving exact line searches in nonquadratic CG for nonlinear systems within n steps is unattainable.

One inherent characteristic of CG methods is the selection of a new directional vector based on previously determined vectors in CG iterations. The relationship between these vectors can contribute to stagnation in the convergence process. Addressing this issue necessitates mitigating or eliminating the correlation between directional vectors. To counteract the deterioration in convergence for nonquadratic CG, a restart process is often implemented every nth iteration [3,4,10]. However, this approach proves less effective for nonquadratic problems with numerous variables [5-7].

In response to these challenges, this paper introduces a modified nonquadratic CG method termed Adaptive Restarting Conjugate Gradient (ARCG). The ARCG method is designed to achieve rapid and effective convergence in nonquadratic CG, thereby minimizing stagnation. The proposed algorithm adapts decision-making approaches based on the surface shape of the objective function, meticulously monitoring convergence dynamics to enhance performance.

Section 2 will explore directional parameters for both quadratic and nonquadratic CG methods. In Section 3, the ARCG method will be elaborated upon, specifically addressing convergence stagnation in nonlinear conjugate gradients. Section 4 will present experimental results, showcasing the reduction of convergence stagnation in CG for a specific nonquadratic scenario. Finally, Section 5 will provide conclusions on the proposed methods, along with an analysis of the restarting process.

2. Iterative Methods

Many real-world optimization problems entail objective functions lacking readily available analytical solutions. In such instances, iterative methods become indispensable for approximating solutions. These iterative optimization techniques, such as conjugate gradient methods and their variants, play a vital role in numerically finding optimal or near-optimal solutions by iteratively refining the current estimate. They prove especially valuable in scenarios where analytical optimization is impractical or infeasible, facilitating the effective exploration of complex, high-dimensional solution spaces [8-9]. The iterative nature of these methods allows for systematic improvement of the solution approximation until a satisfactory level of optimality is attained.

2.1 Steepest Descent Methods

The method of Steepest Descent (SD) is a gradient algorithm where the step size is chosen to achieve the maximum decrease in the objective function at each individual step. Let $\left\{{ x}^{(k)}\right\}_{k=0}^{\infty}$ be a sequence in the domain of a function $f({ x}):{ R}^{n}\to { R}$ that is differentiable at each point where n and k are integers [10]. The direction of gradient descent can be obtained as the vector $-\nabla f({ x}^{(k)})$ where the gradient $\nabla f({ x}^{(k)})$ is the first order derivative of $f({ x}^{(k)})$,

(1)
$\nabla f({x}^{({k})})=(\dfrac{\partial}{\partial{x}_{1}^{({k})}},\: \dfrac{\partial}{\partial{x}_{2}^{({k})}},\: ...,\: \dfrac{\partial}{\partial{x}_{{n}}^{({k})}})^{{T}}$

Then starting at ${ x}^{(k)}$, the next point ${ x}^{(k+1)}$ is searched by moving with an amount of in the orthogonal steps,

(2)
${ x}^{(k+1)}={ x}^{(k)}-\alpha_{k}\nabla f({ x}^{(k)})$

where $\alpha_{k}$ is a positive scalar called the step size, which is chosen to achieve the maximum amount of decrease of the objective function at each individual step. The $\alpha_{k}$ is selected to minimize

(3)
$f({ x}^{(k+1)})= f({ x}^{(k)}-\alpha_{k}\nabla f({ x}^{(k)}))$

with respect to $\alpha_{k}\ge 0$ where the variable is defined as

(4)
$\alpha_{k}=\arg\min f({ x}^{(k)}-\alpha\nabla f({ x}^{(k)}))$ for $\alpha\ge 0$.

For a quadratic objective function of the form:

(5)
$f({ x})=\dfrac{1}{2}{ x}^{T}{ Q}{ x}-{b}^{T}{ x}$

where ${ Q}\in{ R}^{n\times n}$ is a symmetric positive definite matrix, ${ b}\in{ R}^{n}$ and ${ x}\in{ R}^{n}$, the unique minimizer $\nabla f({ x})$ and the Hessian matrix ${ H}({ x})$ of $f({ x})$ can be found by:

(6)
$\nabla f({ x})={ Q}{ x}-{b}$ and ${ H}({}{x})={}{Q}$ (for quadratic case:${}{Q}^{{T}}={}{Q}$).

The gradient descent algorithm for an objective function $f({}{x})$ follows the method of steepest descent, moving in orthogonal steps, as indicated by $\left <\nabla f({}{x}^{({k})},\: \nabla{f}({}{x}^{({k}+1)}\right > =0$ for each k in the sequence $\left\{{ x}^{(k)}\right\}_{k=0}^{\infty}$, where the operator, $ < >$, denotes the inner product. In steepest descent, the algorithm iteratively updates the current solution by moving in the direction opposite to the gradient of the objective function at the current point. This direction is considered the steepest descent as it corresponds to the direction of the fastest decrease in the function. Nevertheless, this method may display zig-zagging behavior and slow convergence in certain cases [12,17].

2.2 Conjugate Gradient Methods

To address the limitations of the steepest descent method, the conjugate gradient (CG) method is introduced. The key idea behind the conjugate gradient method is to search for conjugate directions, which are directions that are orthogonal with respect to the problem's quadratic form. By choosing conjugate directions, the method aims to avoid the zig-zagging behavior and improve the convergence rate compared to the steepest descent method.

The CG algorithm is widely used for optimization problems, especially in the context of solving linear systems of equations or unconstrained optimization problems. It is known for its efficiency, particularly for quadratic functions, where it can converge to the solution in a number of iterations equal to the dimension of the problem. The CG algorithm is an iterative optimization technique that belongs to the family of iterative solvers for linear systems and can be extended to unconstrained optimization problems as well. The Conjugate Gradient algorithm is described below. Given a quadratic function, Equation (5), the conjugate gradient algorithm is summarized as follows [2,10,17]:

==============================================

Initialize the process by choosing an initial guess ${ x}^{(0)}$ and the residual ${ r}^{(0)}={ b}-{ Qx}^{(0)}$. Then the search direction becomes ${ d}^{(0)}={ r}^{(0)}= -\nabla f({ x}^{(0)})={ b}-{ Qx}^{(0)}$

such that

$\nabla{f}({}{x}^{(0)})={}{Qx}^{(0)}-{}{b}$.

If $\nabla{f}({}{x}^{(0)})=0$, stop, else set ${}{d}^{(0)}={}{r}^{(0)}$.

For each iteration $k=0,\: 1,\: 2,\: ...$

- Compute the step size by minimizing

(7)
$f({}{x}^{({k})}+\alpha_{{k}}{}{p}^{({k})})$,
(8)
$\alpha_{k}=-\dfrac{\nabla f({}{x}^{({k})})^{{T}}{ d}^{({k})}}{{}{d}^{({k})^{{T}}}{}{Qd}^{({k})}}$.

- Update the step:

(9)
${ x}^{(k+1)}={ x}^{(k)}+\alpha_{k}{ d}^{(k)}$.

- Update the gradient:

(10)
$-\nabla{f}({}{x}^{({k}+1)})={}{b}-{}{Qx}^{({k}+1)}$.

- Compute the conjugate gradient parameter to determine the next search direction:

(11)
$\beta_{k}=-\dfrac{\nabla f({}{x}^{({k}+1)})^{{T}}{}{Qd}^{({k})}}{{}{d}^{({k})^{{T}}}{}{Qd}^{({k})}}$.

- Update the search direction:

(12)
${ d}^{(k+1)}= -\nabla f({ x}^{(k+1)})+\beta_{k}{ d}^{(k)}$

==============================================

The iterations terminate when a predefined convergence criterion is met, such as when the norm of the gradient is below a certain threshold or after a fixed number of iterations. The CG methods involve more sophisticated update rules. The search directions are adjusted based on the history of previous iterations to ensure conjugacy, while the update rule in steep descent is simple and involves multiplying the negative gradient by a step size (learning rate) to determine the next iterate. The CG methods are designed to converge more quickly than the SD method when they minimize the objective function exactly for quadratic problems in a certain number of iterations (up to machine precision). The CG methods may require additional memory to store vectors representing the history of previous iterations while steep descent requires less memory typically as it only needs to store the current gradient vector.

The fact that the CG methods solve quadratics of n variables in n steps is well known, where the set of search directions is known as the best. The CG methods may be regarded as an intermediate between the SD methods and Newton’s methods, which implies that the CG methods typically perform better than the SD methods but not as well as Newton’s methods [11-12].

Newton’s methods have the property that the minimum may be found without much effort under the following conditions. First, the objective function is approximately quadratic near the minimum. Second, the objective function is twice differentiable at the minimum. Third, the second derivative is non-singular at the minimum point [10,15,16].

Newton’s methods practically utilize that the non-quadratic function is approximately quadratic near the minimum points if the objective function is twice differentiable at the minimum and the second derivative is non-singular at the point. However, in other words, if the initial point is not near to the solution, Newton’s methods are not practical.

CG methods can solve quadratics of n variables in n steps. Matrix inversion and the storage of a matrix are not required in the CG procedure compared to Newton’s methods. Thus, the usual implementation of the quadratic CG algorithm does not require Hessian matrix evaluations in iterations, while the nonquadratic CG algorithm needs Hessian evaluations in every iteration [1,10,15].

2.3 Nonquadratic Conjugate Gradient Methods

Researchers have tackled this issue by devising a range of line search techniques. In general, for non-quadratic problems, evaluating the Hessian matrix at every iteration of the CG algorithm is computationally demanding. In contrast to the constant Hessian matrix found in quadratic objective function, the CG methods extend the CG approach to non-quadratic numerical optimization, eliminating the necessity for Hessian evaluations at every step. In non-quadratic CG methods, the closed-form expression for the conjugate gradient parameter $\beta_{k}$ in Equation (11) can be substituted with a line-search procedure, obviating the need for the Hessian matrix Q in the Equation (5). This alteration enables the algorithm to rely solely on the values of the objective function and its gradient at each iteration, without requiring matrix involvement.

The conjugate gradient parameter $\beta_{k}$ in Equation (11) can be reformulated as depicted in Equation (13) by replacing $[\nabla f({ x}^{(k+1)}-\nabla f({ x}^{(k)})]/\alpha_{k}$ with the term ${ Qd}^{(k)}$. This term ${ Qd}^{(k)}$ is obtained by multiplying ${ x}^{(k+1)}={ x}^{(k)}+\alpha_{k}{ d}^{(k)}$ (from Equation (9)) with Q, which yields ${ Qx}^{(k+1)}={ Qx}^{(k)}+\alpha_{k}{ Qd}^{(k)}$. From the relation ${}{Qx}^{({k}+1)}-{}{Qx}^{({k})}=\alpha_{{k}}{}{Qd}^{({k})}$ and $\nabla f({ x}^{(k+1)})+{ b}={ Qx}^{(k+1)}$ in Equation (10), the term $[\nabla f({ x}^{(k+1)}-\nabla f({ x}^{(k)})]/\alpha_{k}$ is derived. Hence, Equation (11), $\beta_{k}=\dfrac{\nabla f({ x}^{(k+1)}){ Qd}^{k)}}{{ d}^{(k)^{T}}{ Qd}^{(k)}}$, is reformulated by incorporating the relations we established, resulting in a new conjugate gradient parameter $\beta_{k}^{HS}$. This leads to

(13)
$\beta_{k}^{HS}=\dfrac{\nabla f({}{x}^{({k}+1)})^{{T}}[\nabla{f}({}{x}^{({k}+1)})-\nabla{f}({}{x}^{({k})})]}{{ d}^{(k)^{T}}[\nabla f({}{x}^{({k}+1)})-\nabla{f}({}{x}^{({k})})]}$

Equation (13) is known as the Hestenes-Stiefel formula [1], which is derived by eliminating the parameter $\alpha_{k}$ and removing the parameter Q through mathematical manipulations. In this context, $\nabla f({ x}^{(k)})$ represents the gradient of the objective function at the k-th iteration, and ${ d}^{(k)}$ denotes the previous conjugate direction. The Hestenes-Stiefel formula for $\beta_{k}^{HS}$ ensures that the new conjugate direction is maintained conjugate to the previous ones with respect to the quadratic objective function. This prevents the algorithm from retracing its steps, thereby contributing to efficient convergence. Therefore, determining the conjugate gradient parameter without explicit knowledge of the Hessian matrix Q becomes essential, necessitating only the objective function and gradient values at the selected location for each iteration [1,10,16].

However, for nonquadratic problems involving n variables, the algorithm typically fails to converge in n steps [10,16]. As the algorithm advances, the condition of conjugate orthogonality among the direction vectors tends to weaken. This underscores the importance of line search in identifying the minimum of a nonquadratic objective function. Popular choices for the parameter include the Fletcher-Reeves (FR), Polak-Ribiere (PR), Polak-Ribiere-Polyak (PRP), and Hestenes-Stiefel (HS) formulas, among others, to obviate the need for Q [11-12].

In general, conjugate gradient algorithms for nonlinear problems require only the objective function and gradient values at each iteration, bypassing the need for explicit knowledge of the Hessian matrix, with various options available for selecting the parameter $\beta_{k}$. For quadratic problems, the formulas for $\beta_{k}$—specifically those of HS, PR, and FR—are identical. However, this is not the case with nonquadratic problems, where different expressions for $\beta_{k}$ can lead to distinct outcomes. Nonquadratic objective functions can increase computational demands and lead to stagnation in practical applications. Consequently, the choice of $\beta_{k}$ significantly influences the effectiveness of the line search process, making it a critical factor in solving minimization problems involving nonquadratic functions [10,15].

3. Adaptive Restarting Conjugate Gradient Methods

3.1 Conjugate Gradient methods with Restarting

While the process of finding the objective function's minimum involves conducting searches along directions that are mutually conjugate for a quadratic function of n variables in n steps, the CG algorithm often does not converge to the optimum within n steps for non-quadratic problems.

To counteract the issue of slow convergence, the direction search process is typically reset to follow the steepest descent direction at least every n or n+1 iterations in practice for an objective function with n variables, as noted in references [7,10]. However, this approach might not be suitable for large n values. The reason is that restarting the process every n iteration may not effectively counteract the deterioration of convergence due to the extended intervals between restarts, which become too lengthy with a large number of variables.

Your explanation provides a detailed view on the dynamics of the conjugate gradient (CG) method and its restart mechanism, but it can be made clearer and more concise. Indeed, the conjugate gradient (CG) method tends to exhibit linear convergence unless the iterative process is periodically restarted. At the outset of CG iterations, starting from an initial point ${ x}^{(0)}$, the subsequent point ${}{x}^{(1)}$ is determined using the Steepest Descent (SD) method. Without a restart, the vector leading from the initial to the minimum point can be expressed as a linear combination of conjugate directional vectors. Thus, a restart at ${}{x}^{({k}+1)}$ is affected from ${ x}^{(k)}$ by reforming the search for directional vectors, specifically by setting $\beta_{k}=0$ in Equation (12). This ensures that the new directional vector ${ d}^{(k+1)}$ is no longer a linear combination of the previous conjugate directional vectors.

Recent studies have introduced various restarting algorithms for nonquadratic conjugate gradient (CG) methods, developing a method for restarting by monitoring the $l_{2}$ norm of the gradients of $\nabla f({ x}^{(k)})$ and $||{ d}^{(k+1)}||$ at the point ${ x}^{(k)}$. These studies leverage the characteristics of typical conditions related to gradient directions to establish global convergence, as documented in references [16-17].

In addition, at each iteration, the backtracking line search method calculates a step size that ensures a suitable reduction in the non-convex objective function. This process begins with an initially large estimate of the step size for the line search direction, then iteratively reduces—or 'backtracks'—the step size until a decrease in the objective function is achieved. This backtracking process ensures an appropriate balance between the step size reduction and the local gradient of the objective function. Restarting methods utilize not only the inner product of the gradient and directional vector but also consider the norm of the directional vector at each point [14,18,19].

The proposed adaptive method, however, capitalizes on vector properties, specifically the vector that extends from the initial point to the directional vector, to monitor changes in convergence without directly searching for the step size. This contrasts with many restarting methods, which impose a significant computational load by relying on the degree of backtracking for restart criteria, focusing on finding the next directional vector and its size without addressing the overall trajectory of conjugate directional vectors. Instead, the proposed method detects convergence stagnation and initiates a CG restart, effectively mitigating the risk of stagnation.

3.2 Stagnation of Conjugate Gradient and Restart

Stagnation in optimization occurs when an algorithm's progress toward the optimal solution significantly slows down or stops altogether. This phenomenon, particularly within the conjugate gradient (CG) method, manifests as negligible progress toward the optimal solution, evidenced by minimal changes in the objective function or the values of decision variables across successive iterations. Despite the CG method's effectiveness, stagnation can occur, especially in complex optimization challenges. Identifying the causes of stagnation and implementing suitable strategies or adjusting parameters are crucial steps to enhance the algorithm's performance.

Several strategies can be employed to address stagnation in the CG method. First, restarting the CG method involves resetting the algorithm's state after a certain number of iterations or when specific conditions are met, helping to overcome stagnation by reinitializing the search process. Second, adjusting parameters such as the step size or the preconditioner can sometimes alleviate stagnation issues. Third, employing adaptive methods can enhance the algorithm's robustness. Incorporating adaptive strategies, like adjusting the search direction based on the objective function's behavior, can effectively overcome stagnation.

The relationship between stagnation and the Nonlinear Conjugate Gradient (NCG) method merits exploration. In nonquadratic optimization, the NCG method adapts the conventional approach to accommodate more complex, nonquadratic objective functions, whereas the CG method is optimized for quadratic objectives. In the realm of non-quadratic problems with n variables, the optimization process may fail to converge within n steps. Additionally, the quality of conjugate direction vectors can decline, exacerbating stagnation. This decline is often due to the method of expanding direction vectors, where each new directional vector must be conjugately orthogonal to those of previous iterations, potentially leading to stagnation.

Stagnation in the NCG method may arise from various challenges, such as ill-conditioned problems, poor scaling, or unsuitable step size choices. The inherent characteristics of nonquadratic objective functions, including nonlinearity and variable curvature, further contribute to stagnation when the algorithm cannot adapt to the evolving landscape. Causes of stagnation within the NCG framework include loss of conjugacy, inadequate step size selection, and the complexity of navigating nonquadratic surfaces. Given the wide variability in the curvature of objective functions in nonquadratic optimization, selecting appropriate step sizes becomes a significant challenge across the optimization landscape.

Addressing stagnation in the NCG method can involve several strategies, such as adopting adaptive step size selection, implementing preconditioning techniques, and employing restarting strategies. These approaches aim to enhance the method's adaptability and efficiency in overcoming the unique challenges presented by nonquadratic optimization problems.

Adaptive methods play a pivotal role in addressing stagnation within optimization, particularly in the Nonlinear Conjugate Gradient (NCG) method. These methods dynamically adjust parameters in response to the behavior of the optimization process, especially changes in the objective function's curvature. Restarting the optimization algorithm, by resetting its state, offers another effective strategy for overcoming stagnation. This approach enables the algorithm to explore new search directions, potentially navigating out of local minima or areas of slow progress. The challenges of nonquadratic objective functions inherently contribute to stagnation, making adaptive methods and restarting strategies crucial for mitigating these effects and enhancing optimization efficiency in the face of nonlinearities.

Stagnation in optimization is characterized by a significant slowdown or complete halt in progress towards the optimal solution. Within the CG method, it manifests as minor changes in the objective function or decision variables across iterations. Despite the method's proven effectiveness, stagnation can occur, especially in the face of complex optimization challenges. Understanding the underlying causes of stagnation and implementing effective mitigation strategies, such as parameter adjustments, is essential for improving the performance of the CG method.

3.3 The Adaptive Restarting Conjugate Gradient methods

Adaptive restarting strategies, which dynamically adjust the criteria for restarting based on the optimization algorithm's progress, have garnered interest from numerous researchers. This paper proposes adaptive methods using the CG approach, leveraging information about the objective function's curvature. The proposed algorithm has demonstrated an improvement in convergence by mitigating or even eliminating the phenomena of stagnation through an adaptive CG restarting procedure.

The criterion for restarting is set in an adaptive manner, responding to the dynamic behavior of the directional vectors. Although the convergence characteristics are significantly influenced by the initial starting point of the CG method, the adaptive restarting approach can reduce the computational load and time required for convergence. The dynamics of CG directional vectors are affected by the gradient, which reflects the surface curvature of the objective function. Therefore, generally, the deterioration of convergence in nonquadratic CG problems may be avoided not simply by choosing an arbitrary initial point but through the implementation of adaptive control strategies.

The restarting process in the CG method involves eliminating the conjugacy of directional vectors at the point of restart. This means that the requirement for all directional vectors to be conjugate orthogonal to each other is lifted at this point, allowing for a new set of CG directional vectors to commence. This constraint is identified as a potential cause of stagnation in nonconvex CG methods. Stagnation indicates that convergence is not as effective as expected, partly due to the conjugacy constraint. To select a new directional vector that is not bound by conjugacy, we employ the Steepest Gradient (SG) method whenever stagnation is detected.

Consider an initial point ${ x}^{(0)}$ as the centroid of a sphere $||{ x}-{ x}_{c}||$for every ${}{x}\in{}{R}^{{n}}$, where distances from the position ${ x}_{c}= x^{(0)}$ to ${ x}^{(k)}$ and ${ x}^{(k+1)}$ are measured for $k>1$.

When the distance from the centroid ${ x}_{c}$ to ${ x}^{(k+1)}$, $||{ x}^{(k+1)}-{ x}_{c}||$, is less than the distance $||{ x}^{(k)}-{ x}_{c}||$, the trajectory of the positions starts spiraling inwards with respect to the centroid of the sphere, signaling a potential onset of stagnation. Upon detecting the beginning of this spiral, ${ x}^{(k+1)}$ is designated as the new centroid for the subsequent sphere. The task then shifts to detecting a spiral pattern anew relative to this updated centroid. Detection of a spiral-in pattern triggers the initiation of a new CG cycle, where the steepest descent method is applied to remove the property of conjugacy. This approach effectively adapts to the curvature characteristics of the objective function's surface.

The beginning of the spiral in can be measured by computing the ratio $\rho$ between two successive reference vectors ${}{x}^{({k})}$ and ${ x}^{(k+1)}$ with respect to a chosen centroid ${ x}_{c}$, as given by Equation (15) for (k > 1). The ratio of the distances is defined as Equation (14): $\rho =\dfrac{||{ x}^{(k+1)}-{ x}_{c}||}{||{ x}^{(k)}-{ x}_{c}||}$.

Here, || · || denotes a norm. The ratio $\rho$ is utilized to detect the possible onset of stagnation at the (k+1)th directional vector. When the ratio $\rho$ is less than the predetermined threshold $\eta\in(0,\: 1]$, we restart the CG process by resetting the centroid ${ x}_{c}$ to the new position ${ x}^{(k+1)}$ and setting ${ d}^{(k+1)}= -\nabla f({ x}^{(k+1)})$. This restart breaks the 'conjugacy' connection of the directional vectors with the new directional vector ${ d}^{(k+1)}$. The properties of the vector measure not only the step size of a directional vector but also the characteristics of the trajectory of the directional vectors. However, if the ratio is greater than the threshold $\eta$, the directional vectors are free from stagnation, and we proceed without restarting CG.

In the proposed ARCG method, we find that it is rather simple and effective, as it accelerates the convergence process while avoiding stagnation. For instance, some restarting methods in previous works could potentially slow down the convergence speed, although they may improve stability. Additionally, possible round-off errors may cause misbehavior of the CG method, a concern that does not arise in ARCG. Moreover, ARCG enhances both the convergence speed and stability properties of non-quadratic optimization by eliminating stagnation processes during convergence. Below is the algorithm for ARCG.

==============================================

Algorithm: Adaptive Restarting Conjugate Gradient (ARCG)

==============================================

Input:${ x}^{(0)}$, set $\nabla f({ x}^{(0)})$, ${}{d}^{(0)}=\nabla{f}({}{x}^{(0)})$, ${ x}_{c}= x^{(0)}$, $\eta\in(0,\: 1]$.

for k=0, 1, 2, ...,

do

Compute $\alpha_{k}$ such that

$f({}{x}^{({k})}+\alpha_{{k}}{ d}^{({k})})= f({}{x}^{({k})})+\alpha_{{k}}\nabla{f}({}{x}^{({k})})^{{T}}{ d}^{({k})}$.

Set ${}{x}^{({k}+1)}={}{x}^{({k})}+\alpha_{{k}}{}{d}^{({k})}$ and $\nabla f({ x}^{(k+1)})$.

if( $\rho <\eta$ ) and (k >1),

Compute the ratio

(14)
$\rho =\dfrac{||{ x}^{(k+1)}-{ x}_{c}||}{||{ x}^{(k)}-{ x}_{c}||}$

Reset the centroid

(15)
${ x}_{c}= x^{(k+1)}$

Restart the CG algorithm by setting

(16)
${ d}^{(k+1)}= -\nabla f({ x}^{(k+1)})$.

Choose a conjugate directional parameter.

Set
(17)
${ d}^{(k+1)}= -\nabla f({ x}^{(k+1)})+\beta_{k+1}{ d}^{(k)}$.

end for

==============================================

4. Numerical Simulations

The experiments were conducted by randomly selecting initial points to examine convergence towards the local optimum point and the global optimum point in relation to the objective function. The choice of initial points was influenced by considerations of convergence speed and the potential for stagnation. This paper does not explore the selection of initial points, a critical factor in achieving convergence. Instead, it focuses on addressing the issue of stagnation that can occur with arbitrary initial points. In this context, the objective was to demonstrate the effectiveness of the proposed method, ARCG, in overcoming stagnation by assessing the speed of convergence in situations prone to stagnation. A nonlinear function is used as the objective function for variables and ${x}_{1}$, ${x}_{2}$ represented by the equation $f({x}_{1,\: }{x}_{2})= 0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+ 2.5{x}_{1}+ 0.5{x}_{2}^{4}-{x}_{2}^{3}- 8{x}_{2}^{2}+ 2.5{x}_{2}- 10{x}_{1}{x}_{2}$. The shape of the objective function is depicted in Figure 1.

Fig. 1. An objective function of a nonquadratic nonlinear form: \begin{align*}

../../Resources/kiee/KIEE.2024.73.8.1437/fig1.png

$f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}\\-{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2}$

Fig 2 depicts the track of directional vectors on the contour map of the objective function, which is of a nonquadratic nonlinear form, as shown in Fig 1. Observing the behavior of convergence using the CG method for the initial point (-0.5, -0.9) where f (-0.5, -0.9) = -15, it becomes evident that the convergent directions deteriorate as the direction vectors approach the optimal point. This phenomenon, known as convergence stagnation, is characterized by the directional vector sequence converging very slowly to the optimal point, forming a spiraling shape as illustrated in Fig 2 through our experiments. All numerical experiments are established with allowed maximum error 1.0e-9.

Fig. 2. The track of directional vectors of the contour map of the CG Method with the initial point (-0.5, -0.9) is presented. The minimum value (–277) of the objective function is found at the optimum point (4.39390103568654, 4.39390103568654) at the 78th directional vector for the allowed error specification 1.0e-9.

../../Resources/kiee/KIEE.2024.73.8.1437/fig2.png

For the initial point (-0.5, -0.9), where f (-0.5, -0.9) is 15, the directional vector sequence of CG method converges to the global minimum point (4.39390103568654, 4.39390103568654) with an error of 8.881267727026257e-10 at 78 iterations, reaching a minimum value of –277 with an allowable error less than 1.0e-9. However, the iterations reveal a deterioration in the convergence process, leading to stagnation, as indicated in Fig 2. As it is well known, this stagnation is a challenge that the conventional nonconvex CG method needs to overcome. Fig 3 illustrates the convergence of the directional vectors of the ARCG method starting from the initial point (-0.5, -0.9) with an allowable error which is less than 1.0e-9. The objective function reaches its minimum value (-277) at the optimal point (4.39390103568654, 4.39390103568654) after only 9 iterations. Table 1 presents the sequence of converging directional vectors. The proposed ARCG method demonstrates faster convergence with fewer iterations compared to the CG method shown in Fig 2. Restarting occurs at the second, fourth, and sixth iterations, effectively reducing the overall number of iterations, as depicted in Figure 3 and Table 1.

Fig. 3. Convergence of the ARCG Method to the minimum point (4.39390103568654, 4.39390103568654) with the initial point (-0.5, -0.9) takes 10 iterations. Restarting occurs at the second, fourth, and sixth iterations.

../../Resources/kiee/KIEE.2024.73.8.1437/fig3.png

Table 1 Convergence of the ARCG Method with Initial Point (-0.5, -0.9). NI: Number of Iterations.

NI

Optimal Point $({x}_{1},\: {x}_{2})$

Restart

0: Off

1: On

1

(-0.5, -0.9)

0

2

(4.63042015891167, 4.09508799471984)

1

3

(4.37292930860144, 4.07464340741611)

0

4

(4.34834980311952, 4.38421178718921)

1

5

(4.39379845522371, 4.3949020707615)

0

6

(4.39402354706745, 4.39394511570709)

1

7

(4.39390103254279, 4.39390104905889)

0

8

(4.39390103711054, 4.39390103635959)

0

9

(4.39390103568654, 4.39390103568654)

0

Fig 4 depicts the convergence of the CG method, starting from the initial point (5, -5) with f (5, −5) = 474. The objective function reaches its minimum value -277 at the optimal point (4.39390103439962, 4.39390103477718) at the 109th iteration, with an error 8.194639732233808e-10.

Fig. 4. The error that occurred at the 122th iteration is 8.194639732233808e-10, where an initial objective function value of 475 is given at the starting point (5, -5). The global convergent point reached is (4.39390103439962, 4.39390103477718) with a minimum value of –277, which is denoted as (4.39391, 4.393901) in Fig.4 as approximation

../../Resources/kiee/KIEE.2024.73.8.1437/fig4.png

Figure 5 illustrates a scenario in which the number of iterations decreases from 109, as obtained by the Conjugate Gradient (CG) method, to 10 with the proposed Adaptive Residual Conjugate Gradient (ARCG) method, starting from the initial point (5, -5) and achieving an error of . The value of objective function at the initial point is 475, and the global minimum point (4.39390103568654, 4.39390103568654) is reached at the 10th iteration, achieving a minimum value of -277. The convergence directional vector sequence obtained from the 10 iterations with three restarts. While the conventional CG method without restarting requires 109 iterations, the proposed ARCG achieves the same optimal point with 10 iterations under the same conditions.

Fig. 5. The experiments with ARCG yield an error of 2.141417408469820e-10 for the initial point (5, -5) with an objective function value of 475. The minimum value reached at -277 after 11 iterations at the location (4.39390103568654, 4.39390103568654), which is denoted as (4.393901, 4.393901) in Figure.

../../Resources/kiee/KIEE.2024.73.8.1437/fig5.png

For a given initial point (-4, -6) with an objective function value of 375, the convergence to the optimal minimum point (-2.98907601341104, -2.98907601325043) is achieved after 11 iterations in Fig 6. Both CG and ARCG yield the same results without any restarting process described in Table 2. In this case, no restart is needed to reach the minimum, indicating that the initial point converges to a local minimum instead of the global minimum. This suggests that finding global minima depends on choosing a suitable initial point, a topic worthy of further discussion in future studies.

Fig. 6. For a given initial point (-4, -6) where f(x) is 375, the local optimal minimum point (-2.98907601341104, -2.98907601325043) with a value of -114 is reached after 11 iterations. Both the nonconvex CG and ARCG show no restarting for the initial point as shown in Table 2.

../../Resources/kiee/KIEE.2024.73.8.1437/fig6.png

Fig 7 illustrates that the minimum value of –277 at (4.39390104015781, 4.39390104218252) is reached with an error of 9.901277842301936e-10 after the 1687th iteration for the given initial point (-4, 6) where f (−4, 6) = 453. As observed in the track of directional vectors of the nonconvex CG in Fig 7, the phenomenon of stagnation incurs significant computational overload, posing a serious problem in optimization. Experimental results indicate that the convergence process is stable but not sufficiently fast, taking time to spiral down to the minimum point slowly with 1687 iterations with the allowed error 1.0e-09.

Fig. 7. The obtained error is 9.901277842301936e-10 for the initial point (-4,6) with an objective function value of 453. After 1687 iterations, the nonconvex CG method converges to (4.39390104015781, 4.39390104218252) with the minimum value of –277.

../../Resources/kiee/KIEE.2024.73.8.1437/fig7.png

In Fig 8, we depict the track of directional vectors, which is obtained using the ARCG method, for the given initial point (-4, 6) where f (−4, 6) = 453. The sequence of directional vectors of this case converges to the optimal point (4.39390103568864, 4.39390103572041) with the minimum value of –277 with the 29 directional vectors. The track of directional vectors demonstrates that the proposed ARCG method effectively eliminates the stagnation of the nonconvex CG process. The convergence process of ARCG is stable and fast enough to find the optimal point without convergence deterioration compared to the nonconvex CG method. The restarting occurs only once at the 20th directional vector in ARCG, reducing the number of iterations from 1687 to 29. The ARCG method employs a restarting only at the 20th iteration.

Fig. 8. The ARCG method reaches the minima –277 at (4.39390103568864, 4.39390103572041) after 28 iterations with the initial point (-4, 6), where the function value f (-4, 6) is 453.

../../Resources/kiee/KIEE.2024.73.8.1437/fig8.png

In the previously conducted experiments, we established the criterion for detecting stagnation as the moment when the variable ρ, as defined in equation (15), drops below a threshold of 1.0. We then investigated the effects that arise when the established threshold is either below or above 1.0. For the nonquadratic nonlinear objective function, $f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}-{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2}$, we set the threshold η to be 1.0 for the initial points, such as (-0.5, -0.9), (5, -5), (-4, -6), (-4, 6), (4, -6), and (0, 0) cases. The number of directional vectors used for the CG, ARCG, and SD methods with the initial points is presented in Table 2 where the ratio in Equation (15) is set to be less than or equal to 1.0. As shown in Table 2, the objective function has two minimum values, -277 and -3, at the points (4.3939010, 4.3939010) and (-2.9890760, -2.9890760) respectively, which is approximated with seven digits after decimal point. The initial point (-4, -6) converges to the local minimum (-2.9890760, -2.9890760), while the other initial points such as (-0.5, -0.9), (5, -5), (-4, 6), (4, -6), and (0, 0) cases converge to the global minimum point (4.3939010, 4.3939010). Essentially, the initial points tend to converge to an optimal point located near them respectively.

However, the convergence to a local minimum point (-2.989076, -2.9890760) or to the global minimum point (4.3939010, 4.3939010) depends on the choice of the initial point in general, which is known as the initial value problem in optimization. This implies that the convergence to the local or global minimum entirely depends on the surface dynamics of the objective functions in optimization, necessitating further studies.

Table 2 The number of direction vectors for CG and ARCG for the initial points when the threshold is set to η=1.0. The initial point (-4, -6) converges to the local minimum (-2.298907, -2.9890760), while the other five initial points converge to the global minimum point (4.3939010, 4.3939010).

Objective Function

\begin{align*} f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}\\ -{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2} \end{align*}$\rho\le\eta ,\: \eta =1.0$

Methods

CG

ARCG

SD

Case 1

Iterations

78

10

8

Converged ${x}_{1}^{*}$

4.39390103449538

4.39390103568654

4.39390103568655

Converged ${x}_{2}^{*}$

4.3939010352255

4.39390103568654

4.39390103568652

Error

8.881267727026257e-10

8.881784197001252e-16

7.017681150351688e-12

Initial Point

${x}_{1}=-0.5$, ${x}_{2}=-0.9$ $f({x}_{1},\: {x}_{2})= -15$

Converged Point value

$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$

Case 2

Iterations

122

11

12

Converged ${x}_{1}^{*}$

4.39390103439962

4.39390103568654

4.39390103567406

Converged ${x}_{2}^{*}$

4.39390103477718

4.39390103568654

4.39390103567906

Error

8.194639732233808e-10

2.141417408469820e-10

1.586372380921446e-10

Initial

Point

${x}_{1}= 5$, ${x}_{2}=-5$ $f({x}_{1},\: {x}_{2})= 475$

Converged Point value

$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$

Case 3

Iterations

11

11

15

Converged ${x}_{1}^{*}$

-2.98907601312605

-2.98907601312605

-2.98907601315108

Converged ${x}_{2}^{*}$

-2.98907601314108

-2.98907601314108

-2.989076013204

Error

4.109032155931346e-10

3.052403315124787e-10

4.109032155931346e-10

Initial

Point

${x}_{1}= -4$, ${x}_{2}=-6$ $f({x}_{1},\: {x}_{2})= 375$

Converged Point value

$f({x}_{1}^{*},\: {x}_{2}^{*})= -114$

Case 4

Iterations

1495

29

13

Converged ${x}_{1}^{*}$

4.39390104015781

4.39390103569077

4.3939010356884

Converged ${x}_{2}^{*}$

4.39390104218252

4.39390103568738

4.39390103568305

Error

9.901277842301936e-10

3.309872148302396e-11

7.428821742253838e-11

Initial

Point

${x}_{1}= -4$, ${x}_{2}= 6$ $f({x}_{1},\: {x}_{2})= 475$

Converged Point value

$f({x}_{1}^{*},\: {x}_{2}^{*})=-277$

Case 5

Iterations

15

10

14

Converged ${x}_{1}^{*}$

4.39390103563522

4.39390103568654

4.39390103565784

Converged ${x}_{2}^{*}$

4.39390103563426

4.39390103568654

4.39390103567814

Error

2.163227819026375e-10

8.881784197001252e-16

2.163227819026375e-10

Initial

Point

${x}_{1}= 4$, ${x}_{2}=-6$ $f({x}_{1},\: {x}_{2})= 747$

$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$

Case 6

Iterations

1

1

3

Converged ${x}_{1}^{*}$

4.39390103568654

4.39390103568654

4.39390103568654

Converged ${x}_{2}^{*}$

4.39390103568654

4.39390103568654

4.39390103568654

Error

0.0

0.0

0.0

Initial

Point

${x}_{1}= 0$, ${x}_{2}=0$ $f({x}_{1},\: {x}_{2})= 0$

Converged Point value

$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$

However, the convergence to a local minimum point (-2.989076, -2.9890760) or to the global minimum point (4.3939010, 4.3939010) depends on the choice of the initial point in general, which is known as the initial value problem in optimization. This implies that the convergence to the local or global minimum entirely depends on the surface dynamics of the objective functions in optimization, necessitating further studies. In experiments, the case of the initial point (-4, 6) using CG exhibits much worse convergence stagnation compared to the other five initial positions, as shown in Table 2. It requires 1495 iterations to satisfy the error requirement, where the error needs to be less than $1.0\times 10^{-9}$. As depicted in Table 2, the initial points located near the minimum point in a nonconvex region with respect to an optimal point tend to converge fast to the optimal points, as observed with the initial points (0, 0) and (4, -6).

However, for the initial points (-0.5, -0.9), (5, 5), and (-4, 6), the CG method generates stagnation, resulting in slow convergence. This phenomenon appears to reflect the shape of the surface of the objective function, indicating that the surface structure is somewhat far from being nonconvex with respect to the optimal points. The initial point (-4, -6) converges to a local minimum, while the other five initial points converge to the global optimal point.

In this work, we proposed a new restarting scheme, the Adaptive Restarting Conjugate Gradient (ARCG) method, to effectively remove stagnation in CG, significantly reducing the convergence time when the initial value in CG deteriorates due to convergence stagnation. However, for a quadratic nonlinear objective function , both the CG and ARCG methods generally outperform the SD method in terms of convergence speed. For example, as demonstrated in Table 3, convergence with the CG method is significantly faster than with the SD method. Furthermore, Table 3 shows that the ARCG method achieves effective convergence at a rate comparable to that of the CG method.

Table 3 The number of iterations of a convex objective function for CG, SD and ARCG with the threshold and the permitted convergence error $\epsilon_{s}=1.0\times 10^{-9}$, which converges to (2, -1) with f (2, -1) = -3.

Objective Function

$f({x}_{1,\:}{x}_{2})={x}_{1}^{2}+{x}_{2}^{2}+{x}_{1}{x}_{2}-3{x}_{1}$

,

$\rho\le\eta ,\: \eta =1.0$, $\epsilon_{s}=1.0\times 10^{-9}$

Initial Point

CG

SD

ARCG

Converging Point

(0,0)

2

30

2

(2, -1)

(4,0)

2

30

2

(2, -1)

(4,4)

2

13

2

(2, -1)

(-4, -4)

2

11

2

(2, -1)

The threshold $\eta$ serves as a reference for comparing the ratios between two successive reference vectors, as described in Equations (14) and (15). The CG method restart is initiated only if the ratio falls below this threshold. The experiment demonstrates that the proposed method, the ARCG method, outperforms the conventional CG method in the convergence process. By eliminating or weakening the dependence on previously found conjugate directional vectors when applying the steepest gradient method as a restart, the proposed method addresses the issue of deterioration. This improvement in the convergence process mitigates the tendency to meander around the optimal point caused by the effect of conjugates, which is a serious drawback of CG.

5. Conclusion

The issue of stagnation in the CG methods, especially when tackling nonquadratic objective functions, is widely recognized and poses a significant challenge to the convergence efficiency of these methods. This problem, fundamentally linked to the methods’s reliance on conjugation between directional vectors, can severely impede progress towards an optimal solution. The ARCG method introduces a novel solution to enhance convergence by systematically addressing the root causes of stagnation observed in traditional CG iterations.

One of the hallmark features of the ARCG method is its innovative use of a restarting mechanism which plays a pivotical role in steering the convergence process more swiftly towards the optimal minimum. Unlike backtracking methods, which primarly focus on step size adjustments, the ARCG employs advanced vector properties to sensitively detect and adapt to convergence variations. This is achived through a dynamic restarting algorithm that adjusts to the changing curvatures encountered on the surface of nonlinear objective functions, showcasing a nuanced understanding of the problem space.

A particularly distinctive aspect of the ARCG method is its strategic focus on the cumulative trajectory of conjugate directional vectors. This approach, diverging from traditional methods that may only consider the magnitude of directional vectors, offers a more holistic and potentially more effective strategy for overcoming stagnation. Such a comprehensive view on navigating through the optimization process underscores the method’s innovative approach to stagnation detection and avoidance, contributing to its superiority.

The importance of selecting appropriate initial values is also highlighted, suggesting a promising area for future research in nonlinear optimization. The ARCG method stands out not only for its technical innovations but also for its practical benefits, including the potential for significant time and resourxe savings. This advantage positions the ARCG method as a valuable too for a wide range of applications across various fields.

References

1 
Magnus R. Hestenes, Eduard Stiefel, “Methods of Conjugate Gradients for Solving Linear Systems,” Journal of Research of the National Bureau of Standards, vol. 49, no. 6, pp. 409-436, December 1952.URL
2 
R. Fletcher, and C. M. Reeves, “Function minimization by conjugate gradients,” Computer. J. vol. 7, no. 2, pp. 149-154, 1964.URL
3 
Boris T. Polyak, “The Conjugate Gradient method in extreme problem,” pp. 94-112, Nov. 1969.URL
4 
E. Polak, G. Ribiére, “Note sur la convergence de méthodes de directions conjugués,” Revnue franaise d’informatique et de recherche opérationnelle, série rouge, tome 3, no. 1, pp. 35-43, 1969.URL
5 
M. J. D. Powell, “Restart Procedures for the Conjugate Gradient Method,” Mathematical Programming, vol. 12, pp. 241-254, 1977.URL
6 
R. Fletcher, Practical Methods of Optimization, vol. 1: Unconstrained Optimization, John Wiley \& Sons, New York, pp. 80-87, 1987.URL
7 
Borris T. Polyak, Introduction to Optimization, Optimization Software, Inc., Publications Division, pp. 95-116, 1987.URL
8 
Gene H. Golub, Dianne P. O’Leary, “Some history of the Conjugate Gradient and Lanczos Algorithms: 1948-1976,” SIAM Review vol. 31, no. 1, pp. 50-102, March 1989.DOI
9 
D. Touati-Ahmed, and C. Storey, “Efficient Hybrid Conjugate Techniques,” J. of Optimization Theory and Applications, vol. 64, no. 2, Feb. 1990.URL
10 
Edwin K. O. Chong, Stanislaw H. Zak, An Introduction to Optimization, Wiley, pp. 151-186, 1996.URL
11 
William W. Hager, Hongchao Zhang, “A New Conjugate Gradient Method with Guaranteed Descent and an Efficient Line Search,” SIAM J. OPTIM., vol. 16, no. 1, pp. 170-192, 2005.DOI
12 
William W. Hager, Hongchao Zhang, “A Survey of Nonlinear Conjugate Gradient Method,” Pacific Journal of Optimization, January 2006.URL
13 
Gonglin Yuan, Xiwen Lu. “A Modified PRP conjugate gradient method,” Ann Oper Res, vol. 166, pp. 73-90, 2009. DOI 10.1007/s10479 –008-0420-4,URL
14 
Gonglin Yuan, Xiwen Lu, Zhan Wang, “The PRP conjugate gradient algorithm with a modified WWP line search and its application in the image restoration,” Applied Numerical Mathematics, vol. 152, pp. 1-11, 2020.DOI
15 
Neculai Andrei, Nonlinear Conjugate Gradient Methods for Unconstrained Optimization, vol. 158, pp. 67-214, ISBN: 978-3-030-42949-2, Springer, 2020.URL
16 
Reme Chan-Renous-Legoubin, Clement W. Royer, “A Nonlinear conjugate gradient method with complexity guarantees and its application to non-convex regression,” EURO J. Comput. Optim. 10, 100043, 2022.URL
17 
Awad Abdelrahman, Morgataba Mohammed, Osman O., Murtada K., “A Nonlinear conjugate gradient Coefficients with Exact and Strong Wolfe Line Searches Techniques,” Journal of Mathematics, vol. 2022. no. 1, 2022. http://doi.org/10.1155/2022/1383129.DOI
18 
C. Cartis, N. I. M. Gould, and Ph. L. Toint, “Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization,” Mathematical Programming, vol. 163, no. 1, pp. 359-368, 2017DOI
19 
Cartis, C., Sampaio, Ph. R., and Toint, Ph. L., “Worst-case evaluation complexity of non-monotone gradient-related algorithms for unconstrained optimization.” Optimization, vol. 64, no. 5, pp. 1349–1361, 2014. https://doi.org/10.1080/02331934.2013.869809.DOI

저자소개

김성수 (Sung-Soo Kim)
../../Resources/kiee/KIEE.2024.73.8.1437/au1.png

He received his B.S. from Chungbuk National University February 1983, M.S. from University of Arkansas at Fayetteville August 1989, and Ph.D. from University of Central Florida December 1997 in the area of Electrical engineering. He was a associate professor in the Department of Electrical Engineering, Woosuk University from March 1999 to August 2001. Since September 2001 he has been with the Department of Electrical Engineering, Chungbuk National University, Korea. His research interests include artificial intelligence, optimization, multidimensional statistical signal processing, and nonlinear system analysis.