Lecture Notes 7-10

Lecture Notes for MBE 2036 Jia Pan
7. One Dimensional
Unconstrained Optimization
Definition of Optimization

Given a function ( ), the optimization is the process of finding to
minimize or maximize the function.
The function ( ) is called the objective function or cost function.
The point is called a minimizer (denoted as = argmin ( ) ) or a

maximizer (denoted as = argmax ( )) of the objective function.
The value ( ) is called a minimum or maximum of the objective
function, i.e., ( ) = min ( ) or ( ) = max ( ) . Or we can
simply say ( ) is an optimum of the objective function.
Global and Local Optimum

The point is called a global minimizer of the objective function if its
function value is smaller than any other function values, i.e. ,
, ( ) ( ). Similarly, the point is called a global maximizer if
its function value is larger than any other function values, i.e.,
, ( ) ( ).
The point is called a local minimizer of the objective function if its
function value is smaller than any other function values of those
points in its neighborhood, i.e., ( ), ( ) ( ), where
( ) is a local neighborhood around . Similarly, the point is
called a local maximizer if ( ), ( ) ( ).
Note:a global minimizer/maximizer is always a local
minimizer/maximizer; but the converse is NOT always true.
Figure 1 illustrates the difference between the global and local

optimum.
Figure 1: Global and local optimum
Note: the minimizer of ( ) must be the maximizer of ( ), and

vice versa.
For functions with more than two variables, e.g., ( , ), the
optimization will become a high-dimensional optimization. Figure 2
shows the graph of a 2D function ( , ), where the z-axis is the
function value. We can observe that it also has global/local
minimizers.
Figure 2: 2D optimization problem
2
Convex functions
From Note 1, we know that local optimum may not be a global
optimum. However, for a special type of functions called convex
functions, their local optimum must be a global optimum.
Figure 3 shows a convex 1D function. Intuitively, a function is convex
if connecting any two points on the curve, the segment is always
above the curve. More formally,
, , [0,1], ( + (1 ) ) ( ) + (1 ) ( ).
This property is called the convexity property.
If ( ) is a convex function, then ( ) is called a concave function.
Figure 3: A Convex 1D function
We can also define the convexity property for high-dimensional

functions. For instance, for 2D convex function, we must have
( , ), ( , _2), [0,1], ( , ) + (1 )( , )
(( , )) + (1 ) (( , )).
A 2D convex function is shown in Figure 4.
3
Figure 4: A 2D convex function
Optimal conditions for differentiable functions

Given a differentiable function ( ), a necessary condition for point
to be an optimizer is ( ) = 0. Such kind of points are called the
critical points. If ( ) > 0, then the point is a local minimizer; if
( ) < 0, then the point is a local maximizer; if ( ) = 0, we
cannot make a decision (and need to have more knowledge about
the functions higher order derivatives).
To find all the optimizers of a function , we need to first solve the
root finding problem ( ) = 0 to obtain all the critical points { }.
Next, we need to evaluate ( ) to determine whether it is a local
maximizer or minimizer.
However, the root finding problem is difficult and cannot easily done.
In addition, there are many functions that are not differentiable.
4
Thus, we must develop a numerical method that can work in most

cases.
Note: A brief proof about the above conclusion. For any
locally around , Taylors formula tells: ( ) = ( ) + ( )(
)+ ( )( ) + (( ) ). Thus if ( ) = 0, then ( ) =
( )+ ( )( ) + (( ) ). If ( ) > 0, then ( ) >
( ); if If ( ) < 0, then ( ) < ( ); if ( ) = 0, then we need
more information.
Note: Based on ( ) = 0 and ( ) > 0, we can only say is a
local minimizer. It is possible that is not a global minimizer, unless
we take some more checks (e.g. checking whether is a convex
function).
Note: Suppose function is ( ) = . It is easy to show that

= 1 and = 1 are two critical points. = 1 is a local minimizer,
= 1 is a local maximizer; but neither of them is global minimizer
or maximizer.
Note: Suppose function is ( ) = . At = 0 there is ( ) =
0, ( ) = 0. And thus we cannot tell anything about = 0. Actually
= 0 is neither a local minimizer nor a local maximizer.
Golden Ratio
Before discussing a numerical method for the optimization, we first
deviate a bit from the optimization topic and discuss a bit about the
Golden ratio.
In mathematics, two quantities are in the golden ratio if their ratio is
the same as the ratio of their sum to the larger of the two quantities.
Figure 5 illustrates the geometric relationship. Expressed
5
algebraically, for quantities a and b with > > 0, = = ,

where is the golden ratio. Then we have = 1 + , and thus =

0.618.
Figure 5: Geometric Relationship of Golden ratio. The point A is called the Golden Ratio Point of the
segment.
From = , we can have that + = = =

( ), which implies that = .
As a result, there is the important relationship = = .
In Figure 5, the point which splits the segment into two sub-
segments with lengths and satisfying the Golden ratio
relationship. The point is called the Golden ratio point of the
segment . According to the symmetry, there are two Golden ratio
points on the segment : in Figure 6, we denote them as point
and point respectively. In addition, we note one important and
interesting property: the Golden ratio point , which is a Golden ratio
point of the segment , is also a Golden ratio point of the sub-
segment . This is due to the relationship = = . Similarly,
the point , which is a Golden ratio point of the segment , is also a
Golden ratio point of the sub-segment . We will see that this
property will play an important role in the numerical algorithm that
will be discussed below.
6
Figure 6: Two Golden ratio points C and D on the segment AB. Note that the point C is also the Golden
ratio point of the sub-segment BD. Similarly, the point D is also the Golden ratio point of the sub-
segment AC.
Iterative search
Iterative search is a technique for finding the minimum or maximum
of a strictly unimodal function by successively narrowing the range of
values inside which the minimum or maximum is known to exist.
Unlike finding the root finding, where two function evaluations with
opposite sign are sufficient to bracket a root, when searching for a
minimum or maximum, three values are necessary.
Definition: A function ( ) is a unimodal function if it belongs to the
following two cases. Case 1: For some value , ( ) is monotonically
increasing for and monotonically decreasing for ; Case
2: For some value , ( ) is monotonically decreasing for and
monotonically increasing for . In Case 1, the maximum value of
( ) is ( ) and there are no other local maxima; in Case 2, the
minimum value of ( ) is ( ) and there are no other local minima.
As a result, we call Case 1 the unimodal maximum case, and Case 2
7
the unimodal minimum case. A function that is not unimodal is called

a multimodal function. Figure 7 shows the difference between the
unimodal and multimodal functions.
Figure 7: Comparison of unimodal and multimodal functions
Note:A convex/concave function must be unimodal. But a

unimodal function may not be convex or concave, as shown by the
two examples in Figure 8.
Next, we can look into the details of the iterative search. Lets first
Figure 8: Two examples of unimodal but are not convex or convex
consider the unimodal maximum case (i.e., Case 1). Suppose we what
to achieve the maximum of a unimodal function within the interval
( ) ( )
[ , ], how can we successively narrow this search range?
( ) ( )
One solution is as follows: first, we choose two points and
( ) ( ) ( ) ( )
inside the segment [ , ] with < . These two points
( ) ( )
divide the segment [ , ] into three sub-segments I, II, and III, as
shown in Figure 9.
8
( ) ( )
Then we evaluate the function values ( ) and ( ) at points
( ) ( )
and . According to the relative magnitude of these two
values, there are two difference cases.
( ) ( )
The Case 1 is when ( ) ( ), as shown in Figure 9. In which
regions among I, II and III can the maximizer of a unimodal
function appear?
( ) ( )
Figure 9Case 1 for a unimodal maximum function where f(x ) f(x ).
Can the maximizer locates in region II and III? Yes, it can; and two
sample unimodal functions are shown in the first two sub-figures in
Figure 10, corresponding to the situations where the maximizer
locates in region II and III respectively.
Can the maximizer locate in region I? No, it is impossible;
otherwise the function cannot be a unimodal function as shown in
( ) ( )
last sub-figure in Figure 10. Why? Because < , but
( ) ( )
( ) ( ), and thus to the right of the function is not
monotonically decreasing. This is in contradict with the property of
the unimodal maximum function and thus not possible.
9
Figure 10: How the function will look if its maximizer locates in region II, III and I, respectively.
Since the maximizer can only locate in region II and III, we can
narrow the search in the next iteration. In particular, in the next
( )
iteration, the lower bound should be updated from to
( ) ( ) ( ) ( )
= , the upper bound is unchanged, i.e. = ; and we
( ) ( )
need to generate two new points and within the interval
( ) ( )
[ , ].
( ) ( )
The Case 2 is when ( ) ( ), and the discussion is similar.
We can conclude that the maximizer can only locate in region I
and II. As a result, in the next iteration, the lower bound is
( ) ( )
unchanged, i.e. = , and the upper bound should be updated
( ) ( ) ( )
from to = . We also need to generate two new points
( ) ( ) ( ) ( )
and within the interval [ , ].
Okay. We have finished discussing the main part of the algorithm for
the unimodal maximum function, which can be briefly summarized as
( ) ( ) ( ) ( )
follows. We initialize with four points < < < and
then go into iterations. In the i-th step, we first evaluate the values of
() () () ()
( ) and ( ). If ( ) ( ) (case 1), we perform the
update:
( ) () ( ) ()
and ;
10
() ()
If ( ) ( ) (case 2), we perform the update:
( ) () ( ) ()
and .
( ) ( )
Then we generate two points < in the narrowed
( ) ( ) ( )
segment [ , ]. With the new set of four points <
( ) ( ) ( )
< < , we can start the next step of iteration.
The iteration stops when the stopping criteria is satisfied, which we
will discuss later.
For the unimodal minimum function, there are some tiny differences
with the algorithm presented above for the unimodal maximum
function. The details of these differences are left as Exercises. [Hint:
() ()
the case 1 should now become ( ) ( ) where is in region
() ()
II and III, and case 2 should become ( ) ( ) where is in
region I and II.]
Golden-Section search
In the iterative search algorithm presented above, we need to
perform two function evaluations (i.e. evaluating the values of
() ()
( ) and ( )) in each step of iteration. Suppose we run
iterations, then we need to perform 2 function evaluations. If the
function is complex (e.g., ( ) = sin ), then the
computation can be slow. Can we reduce the number of function
evaluations that are necessary in the iterative search?
The answer is, yes we can. The key point lies in the property we
discussed before in Figure 6: suppose the segment is the
() () ()
segment [ , ], and we choose the point as the Golden ratio
()
point and the point as the Golden ratio point . Suppose we
11
( ) ( )
meet case 1, and the narrow segment becomes , = .
( ) ( )
We further need to generate and as the Golden ratio
()
point on . But according to our discussion before, the point
( ) ()
(that is ) is also the Golden ratio point of , i.e. = . We
( )
only need to generate one new Golden ratio point . In this way,
( )
we only need to evaluate the function value ( ) in the next
( ) ()
iteration, since the other function value ( )= ( ) has
already been evaluated in the i-th iteration. Similarly, for case 2, we
( ) ()
have = , and we only need to generate a new Golden ratio
( )
point and evaluate its value.
() ()
By using the Golden ratio points as , , given iterations, we
only need to perform one function evaluation in each iteration,
( )
except the first one, where we need to evaluate both ( ) and
( )
( ). As a result, we only need to take 2 + ( 1) = + 1
function evaluations in total. Comparing to the general iterative
search, the Golden-section search only takes 50%
computation, which is a great improvement.
Error bound and Stopping Criteria

( )
If we stop at the N-th iteration, and the set of four points are <
( ) ( ) ( )
< < . How can we estimate the error after this
iteration? If we are in case 1, then should locate in region II and III,
( )
then we return as the approximate solution to . The actual
can be in any place in region II and III, and thus the error bound |
( ) ( ) ( ) ( ) ( )
| max , , i.e., the larger one in the
lengths of interval II and III. Similarly, If we are in case 2, we shall
12
( )
return as the approximate solution to , and the error bound
( ) ( ) ( ) ( ) ( )
| | max , .
( ) ( ) ( ) ( )
For the Golden ratio case, we know that = =
( ) ( ) ( ) ( ) ( ) ( )
0.381( ), and = (5
( ) ( ) ( ) ( )
2) 0.236( ). As a result, the error is always
( ) ( ) ( ) ( )
bounded by = (1 ) .
According to the iteration, we know that each time the search

( ) ( ) ( ) ( )
interval is narrowed by , and thus = ( ).
As a result, the error bound when we stop at the N-th iteration would
( ) ( ) ( ) ( )
be: (1 ) = (1 ) ( )
Based on the above inequality, we can estimate the number of

iterations required t achieve a given error bound.
Example: Suppose the relative error bound is , and assume we
( ) ( )
( )

already know . Then = , and thus we
| |
| |
( ) ( )
( )

have 1+ . In practice we dont know before
ahead, so we usually use a weaker estimation e.g. 1+
( )
( ) ( )
( )
.
Note:The iterative search cannot guarantee to find the global

optimum for a non-unimodal function, and Figure 11 shows one
example.
13
Figure 11Iterative search may not find the global optimum for a non-unimodal function.
Note: No matter whether the function is unimodal or not, the

iterative search can always return a local optimum solution. Why?
[Hint: Are all the functions unimodal locally?]
Note: Will the iterative search algorithm work for non-continuous
or non-differential functions as shown in Figure 12?
Figure 12 Two examples for non-continuous or non-differential functions.
14
Newtons method
Newton-Raphson method is an approach for finding the root of a
function such that ( ) = 0. It is an iterative method, and at the i-th
( )
step there is = .
( )
For a differential function, as we discussed before, the necessary

condition for to be an optimizer is ( ) = 0, which can be solved
( )
by the Newton-Raphson method: = . This is called the
( )
Newtons method.
Newtons method does not require initial guesses that bracket the
optimum. Like Newton-Raphson method, depending on the nature
of the function and the quality of the initial guess, this method may
be divergent i.e. it may not find the answer. However, when it works,
it is faster than the Golden-Section search.
Note: How will the Newtons method work for non-continuous or
non-differential functions as shown in Figure 12?
Note: How will the Newtons method work for ( )= with

initial guess = 0?
Note: How will the Newtons method work for ( )= +

2 with initial guess = 0?
15
8. Multi-Dimensional
Unconstrained Optimization
Analytical method
Given a multi-dimensional differential function ( ), a necessary
condition for the point = [ , , ] to be the minimizer or
maximizer is ( ) = 0, where ( ) is the gradient evaluated at
point :
( )=
Which is analogous to the first derivative for the one-dimensional

function. These points are also called the critical points, similar to the
one-dimensional case.
To determine whether is a local maximizer or minimizer, we need
to further check the determinant of the Hessian matrix of .
The Hessian matrix of evaluated at is

= ,

which is a symmetric matrix.
16
Definition: A matrix is positive definite, if for all non-zero vector ,

there is > 0; is negative definite, if for all non-zero vector ,
there is < 0.
According to the Taylors expansion, for any point in the
neighborhood of , there is:
( ) = ( )+ ( ) ( )+ ( ) ( ) + (|| || ).
Since is a critical point, ( ) = 0, then we have
( ) = ( )+ ( ) ( ) + (|| || ).
Thus, if is positive definite, then for any , there is (

) ( ) > 0, and thus ( ) > ( ), i.e., is a local minimizer.
Similarly, if is negative definite, then for any , there is
( ) ( ) < 0, and thus ( ) < ( ), i.e., is a local
maximizer. If is neither positive nor negative definite, then we
cannot tell too much about .
Intuitively, is positive definite is analogous to ( ) > 0 in the one
dimensional case; and is negative definite is analogous to ( ) < 0
in the one dimensional case.
Theorem: For a function ( , ), = is positive
definite if and only if | | > 0 and > 0; = is
negative definite if and only if | | > 0 and < 0. If | | < 0,then

(x,y) is called a saddle point.
17
Proof: = = . Let be any non-zero vector,
then = +2 + > 0 holds if > 0 and
(2 ) 4 0. This is equivalent to say > 0 and | | > 0.

Similarly +2 + < 0 holds for any non-zero vector , if
< 0 and | | > 0.
One example of the saddle point is as shown in Figure 13.
Figure 13: For function ( , ) = , (0,0) is a saddle point.
Example: For 2D function ( , ) = + (1 ) , discuss its

optimum values.
18
Solution: ( , ) = [2 3 (1 ) , 2 (1 ) ] = 0, and thus

2 6 ( 1) 6 (1 )
the critical point is (0,0). = and
6 (1 ) 2 (1 )
2 0
thus = at (0,0). is positive definite, and thus (0,0) is a
0 2
local minimizer. It is not a global minimizer, since (2,3) = 5 < 0 =
(0,0).
Numerical method: Gradient descent

In any multi-dimensional optimization search algorithm, there are
two important components: 1) choose a search direction ; 2) choose
a step size (i.e., how along) that we should pursue a solution along
the chosen direction. If we start from the point , the iteration runs
as: = + .
Gradient descent chooses the gradient as the search direction, i.e.,
= ( ).
For the step size, one ideal but not practical way is to choose a very
small step size such that the optimization process is always keeping
the steepest direction and walking a shortest distance. However, the
disadvantage lies in the fact that we need to perform a huge number
of computations of the gradient, which would be very expensive.
Numerical method: Steepest Ascent descent

It provides a better way for determining the step size. In particular,
the step size is chosen as the travel distance that achieves the
best function value along the search direction .
For a maximum optimization max ( ) , we choose the step size as
= argmax ( + );
19
For an minimization problem min ( ) , we choose the step size as
= argmin ( + ).
Example: Using the steepest ascent method for finding the maximum
point of ( , ) = 2 + 2 2 with the initial guess =
( , ) = (1,1).
2 +22 6
Solution: = ( , )= = .
2 4 ( , ) 6
Then = argmax ( + ).
( + ) = (1 + 6 , 1 6 ) = 180 + 72 7 = ( ).
The optimal needs to satisfy ( ) = 0, and thus = 0.2.
= + = (0.2, 0.2).
After several steps, we can get the solution sequence as show in
Figure 14.
Figure 14: Solution sequence when applying steepest descent to maximize 2 +2 2 .
20
From the above example, we can observe that the search directions
seems to be perpendicular to . Is this just a coincidence?
No, actually this is always true for the steepest decent method.
Theorem: In steepest descent method, the descent direction is
perpendicular to , the descent direction in the last step.
Proof: Let ( ) = ( + ), according to the selection of steepest
descent, there is ( ) = 0, i.e. ( + ) = 0, this actually
means = 0.
This property actually implies one important limitation of the
steepest descent method: if the initial guess is not good, the solution
sequence will be zigzag and converge very slowly, as shown in Figure
15.
Figure 15 Comparison of different initial guesses for the steepest descent algorithm while solving an
optimization problem.
Numerical method: Newtons method

One solution to the above limitation of the steepest descent method
is the Newtons method.
21
In particular, we perform the Taylors expansion around the point
( ) = ( )+ ( ) ( )+ ( ) ( )+
(|| || ), where is the Hessian matrix evaluated at point .
At the optimal point ( ) = 0, and thus ( ) + ( ) = 0. As
a result, we can choose the next point as = ( ).
The Newtons method is more complex than the steepest descent
method, but usually it can converge much faster than the steepest
descent method, as shown in Figure 17.
Figure 16: Comparison between the steepest descent method and Newtons method while maximizing
( , )= + 5 log(1 + ) , where the black curve is the steepest descent, and the Newtons
method is the blue curve.
22
9. Curve fitting least squares

regression
Polynomial curve fitting
Suppose we observe a real-valued input variable and we wish to
use this observation to predict the value of a real-valued target
variable . Now suppose that we are given a training set comprising
observations of , denoted as , , , , together with
corresponding observations of the values of , denoted as
, , , . Our goal is to exploit the given data set in order to
make predictions of the value of of the target variable for some
new value . The simplest approach to achieve this is via curve fitting.
In particular, we shall fit the data using a polynomial function of the
form ( , ) = + + . ++ = , where
is the order of the polynomial.
The values of the coefficients will be determined by fitting the
polynomial to the training data. This can be done by minimizing an
error function that measures the misfit between the function ( , ),
for any given value of , and the given data set points. One simple
choice of error function, which is widely used, is given by the sum f
the squares of the errors between the predications ( , ) for each
data point , and the corresponding target value , so that we
minimize
( )= ( ( , ) ) .
In this way, the curve fitting problem can be converted into a multi-
dimensional optimization problem min ( ).
23
Figure 17 shows a set of polynomial fitting results while using

different orders of polynomials. We can observe that: if the
polynomial order is too small (e.g., = 0,1 in Figure 17), the fitting
curve cannot fit the points well; if the polynomial order is too large
(e.g., = 9 in Figure 17), the fitting curve can pass through every
data point, but it may not fit well with the underlying real function
and thus cannot provide a high quality prediction. Only when the
polynomial order is appropriate (e.g., = 3 in Figure 17), the fitting
curve can fit the given data points well and also provide a good
prediction.
Figure 17: Polynomials with different orders M, shown as red curves, fitted to the data set shown as blue
dots.
Below we discuss two special cases of the polynomial curve fitting:

the linear regression and the quadratic regression.
24
Special Case 1: Linear Regression

In this special case = 1, ( , , )= + . Then the error
function is ( , ) = ( + ) .
The optimal , will satisfy = 0 and = 0. By solving these

two equations we have:
+ = 0 and + = 0, and thus we can

obtain = and = .
( )
Special Case 2: Quadratic Regression

In this special case = 2, ( , , , )= + + . Then
the error function is ( , , )= + + .
The optimal , will satisfy = 0, = 0, and = 0. By

solving these three equations we have:
= .
Solve the above equation we can obtain the optimal quadratic

regression.
The general case:

For general polynomial fitting, the error function can be formulated
as
25
( )= ( ( , ) )
1
= 1

1
1
Let = , = 1 , the above function can

1
be reformulated as ( ) = ( ) ( ). The optimal
needs to satisfy ( ) = 0, which results in = ( ) .
10. Fourier Series

A function ( ) is periodic if exists such that ( + ) = ( ). Here
is called the period of the function, and = is the frequency of
the function.
Any periodic function with frequency (and some other
appropriate properties) can be represented by the Fourier series, i.e.,
( )= + cos( ) + sin( )+ cos(2 )
+ sin(2 ) + + cos( ) + sin( )+
or ( ) = + cos( )+ sin( ).
In other words we can approximate ( ) by a set of basis functions
= {1, sin( ) , cos( ) , , sin( ) , cos( ) , }, and we
need to find the corresponding Fourier coefficients , , ,
26
Theorem: The basic functions have the following property:

1. Given any two different basis functions , , we have
( ) ( ) =0
2. For any basis function , ( ) 0. In particular, for
1, ( ) = , and 1 = .
Proof: Simple integration math.
Given the above theorem, we can compute the Fourier coefficients

easily.
( ) = [ + cos( )+ sin( ) + cos(2 )+

sin(2 ) + + cos( )+ sin( ) + ] = , and
thus = ( ) .
cos( ) ( ) = cos( )[ + cos( )+

sin( ) + cos(2 ) + sin(2 )++ cos( )+
sin( ) + ] = , and thus = ( )cos( ) .
Similarly, = ( ) sin( ) .
Examples on Fourier coefficients computation are available on the

slides.
For some special functions, actually we can compute their Fourier
coefficients using the concept instead of complex calculation.
Example. Compute the Fourier coefficients for function ( ) =
sin + cos( ).
27
Solution: For sure we can follow the standard steps of computing

Fourier coefficients. But notice that this function is already a sum of
trigonometric functions, so maybe we can find a more lightweight
solution. First, the period of sin is = 4 , and the period of
cos is = 6 . Thus we can guess that the period of function

( ) is 12 , the lowest common multiple of 6 and 4 . We can verify
this by checking ( + 12 ) = ( ), and there is no smaller that
can satisfy ( + ) = ( ). Thus = = . The function can
then be written as the standard format of Fourier series: ( ) =
sin + cos = sin(3 ) + cos(2 ). This means =1
and = 1, and all other , are zero.
Note: This may help you to greatly simplify the computation when
the function involves trigonometric components.
1, < < 0
Try by yourself: ( ) = , where = 24 . What is
0, 0< <
the Fourier coefficients for ( ) + sin + cos( )? [Hint: the

period now should be the lowest common multiple of 24 (the
period of ( ) and 12 (the period of sin + cos( )).
28

Lecture Notes 7-10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes 7-10

Uploaded by

Copyright:

Available Formats

Lecture Notes for MBE 2036 Jia Pan

Global and Local Optimum

Figure 1 illustrates the difference between the global and local

Figure 1: Global and local optimum

Note: the minimizer of ( ) must be the maximizer of ( ), and

Figure 2: 2D optimization problem

Figure 3: A Convex 1D function

We can also define the convexity property for high-dimensional

Figure 4: A 2D convex function

Optimal conditions for differentiable functions

Thus, we must develop a numerical method that can work in most

Note: Suppose function is ( ) = . It is easy to show that

algebraically, for quantities a and b with > > 0, = = ,

From = , we can have that + = = =

As a result, there is the important relationship = = .

the unimodal minimum case. A function that is not unimodal is called

Figure 7: Comparison of unimodal and multimodal functions

Note:A convex/concave function must be unimodal. But a

Error bound and Stopping Criteria

According to the iteration, we know that each time the search

Based on the above inequality, we can estimate the number of

Note:The iterative search cannot guarantee to find the global

Note: No matter whether the function is unimodal or not, the

Figure 12 Two examples for non-continuous or non-differential functions.

For a differential function, as we discussed before, the necessary

Note: How will the Newtons method work for ( )= with

Note: How will the Newtons method work for ( )= +

Which is analogous to the first derivative for the one-dimensional

which is a symmetric matrix.

Definition: A matrix is positive definite, if for all non-zero vector ,

Since is a critical point, ( ) = 0, then we have

Thus, if is positive definite, then for any , there is (

Theorem: For a function ( , ), = is positive

definite if and only if | | > 0 and > 0; = is

negative definite if and only if | | > 0 and < 0. If | | < 0,then

Proof: = = . Let be any non-zero vector,

then = +2 + > 0 holds if > 0 and

(2 ) 4 0. This is equivalent to say > 0 and | | > 0.

< 0 and | | > 0.

One example of the saddle point is as shown in Figure 13.

Figure 13: For function ( , ) = , (0,0) is a saddle point.

Example: For 2D function ( , ) = + (1 ) , discuss its

Solution: ( , ) = [2 3 (1 ) , 2 (1 ) ] = 0, and thus

Numerical method: Gradient descent

Numerical method: Steepest Ascent descent

For an minimization problem min ( ) , we choose the step size as

Figure 14: Solution sequence when applying steepest descent to maximize 2 +2 2 .

Numerical method: Newtons method

In particular, we perform the Taylors expansion around the point

9. Curve fitting least squares

Figure 17 shows a set of polynomial fitting results while using

Below we discuss two special cases of the polynomial curve fitting:

Special Case 1: Linear Regression

The optimal , will satisfy = 0 and = 0. By solving these

+ = 0 and + = 0, and thus we can

Special Case 2: Quadratic Regression

The optimal , will satisfy = 0, = 0, and = 0. By

Solve the above equation we can obtain the optimal quadratic

The general case:

10. Fourier Series

Theorem: The basic functions have the following property:

Given the above theorem, we can compute the Fourier coefficients

( ) = [ + cos( )+ sin( ) + cos(2 )+

cos( ) ( ) = cos( )[ + cos( )+