Download the notebook here

[5]:
import numpy as np
import pandas as pd
import seaborn as sns

from estimagic import minimize

Why optimization is difficult#

This tutorial shows why optimization is difficult and you need some knowledge in order to solve optimization problems efficiently. It is meant for people who have no previous experience with numerical optimization and wonder why there are so many optimization algorithms and still none that works for all problems. For each potential problem we highlight, we also give some ideas on how to solve it.

If you simply want to know how to choose an optimization algorithm, check out our very brief guide or the longer version.

If you simply want to learn the mechanics of doing optimization with estimagic, check out the first optimization with estimagic tutorial

The message of this notebook can be summarized as follows:

  • The only algorithms that are guaranteed to solve all problems are grid search or other algorithms that evaluate the criterion function almost everywhere in the parameter space.

  • If you have more than a hand full of parameters these methods would take too long.

  • Thus, you have to know the properties of your optimization problem and have knowledge on different optimization algorithms in order to choose the right algorithm for your problem.

This tutorial uses variants of the sphere function from the first optimization with estimagic tutorial to illustrate problems.

[31]:
def sphere(params):
    return (params["value"] ** 2).sum()


def sphere_gradient(params):
    return params * 2

Why grid search is infeasible#

Sampling based optimizers and grid search require the parameter space to be bounded in all directions. Let’s assume that we know the optimum of the sphere function is between -0.5 and 0.5, but don’t know where it is.

In order to get a precision of 2 digits with a grid search, we require the following number of function evaluations (depending on the number of parameters).

[32]:
dimensions = np.arange(10) + 1
n_evals = 100**dimensions
sns.lineplot(dimensions, n_evals)
[32]:
<AxesSubplot:>
../_images/getting_started_why_optimization_is_hard_4_1.png

If you have 10 dimensions and your criterion function takes one second, you need about 3 billion years on a 1000 core cluster. Many of real world criterion functions have hundreds of parameters and take minutes to evaluate once. This is called the curse of dimensionality.

Sampling based algorithms typically fix the number of criterion evaluations and try to spend them a bit smarter than searching completely randomly. However, the smart tricks only work under additional assumptions. Thus, either you need to make assumptions on your problem or you will get the curse of dimensionality through the backdoor again. For easier analysis, assume we fix the number of function evaluations in a grid search instead of a sampling based algorithm and want to know which precision we can get, depending on the dimension:

For 1 million function evaluations, we can expect the following precision:

[33]:
dimensions = np.arange(10) + 1
precision = 1e-6 ** (1 / dimensions)
sns.lineplot(dimensions, precision)
[33]:
<AxesSubplot:>
../_images/getting_started_why_optimization_is_hard_7_1.png

How derivatives can solve the curse of dimensionality#

Derivative based methods do not try to evaluate the criterion function everywhere in the search space. Instead they start at some point and go “downhill” from there. The gradient of the criterion function indicates which direction is downhill. Then there are different ways of determining how far to go in that direction. The time it takes to evaluate a derivative increases at most linearly in the number of parameters. Using the derivative information, optimizers can often find an optimum with very few function evaluations.

How derivative based methods can fail#

To see how derivative based methods can fail, we use simple modifications of the sphere function.

[34]:
def sphere_with_noise(params):
    return sphere(params) + np.random.normal(scale=0.01)


start_params = pd.DataFrame(
    data=np.arange(5) + 1,
    columns=["value"],
    index=[f"x_{i}" for i in range(5)],
)
start_params
[34]:
value
x_0 1
x_1 2
x_2 3
x_3 4
x_4 5
[35]:
grid = np.linspace(-1, 1, 1000)
sns.lineplot(
    x=grid,
    y=(grid**2) + np.random.normal(scale=0.01, size=len(grid)),
)
[35]:
<AxesSubplot:>
../_images/getting_started_why_optimization_is_hard_11_1.png
[36]:
minimize(
    criterion=sphere_with_noise,
    params=start_params,
    algorithm="scipy_lbfgsb",
    logging=False,
)
[36]:
{'solution_x': array([1., 2., 3., 4., 5.]),
 'solution_criterion': 54.99812797024216,
 'solution_derivative': array([1246767.68971634,   72225.04557204,  518929.12042475,
         325149.00625277,   59972.20947514]),
 'solution_hessian': None,
 'n_criterion_evaluations': 21,
 'n_derivative_evaluations': None,
 'n_iterations': 0,
 'success': False,
 'reached_convergence_criterion': None,
 'message': b'ABNORMAL_TERMINATION_IN_LNSRCH',
 'solution_params':      lower_bound  upper_bound  value
 x_0         -inf          inf    1.0
 x_1         -inf          inf    2.0
 x_2         -inf          inf    3.0
 x_3         -inf          inf    4.0
 x_4         -inf          inf    5.0}

So the algorithm failed, but at least tells you that it did not succed. Let’s look at a different kind of numerical noise that could come from rounding.

[37]:
def piecewise_constant_sphere(params):
    params = params.copy(deep=True)
    params["value"] = params["value"].round(2)
    return sphere(params)
[38]:
sns.lineplot(x=grid, y=grid.round(2) ** 2)
[38]:
<AxesSubplot:>
../_images/getting_started_why_optimization_is_hard_15_1.png
[39]:
minimize(
    criterion=piecewise_constant_sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    logging=False,
)
[39]:
{'solution_x': array([1., 2., 3., 4., 5.]),
 'solution_criterion': 55.0,
 'solution_derivative': array([0., 0., 0., 0., 0.]),
 'solution_hessian': None,
 'n_criterion_evaluations': 1,
 'n_derivative_evaluations': None,
 'n_iterations': 0,
 'success': True,
 'reached_convergence_criterion': None,
 'message': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
 'solution_params':      lower_bound  upper_bound  value
 x_0         -inf          inf    1.0
 x_1         -inf          inf    2.0
 x_2         -inf          inf    3.0
 x_3         -inf          inf    4.0
 x_4         -inf          inf    5.0}

This time the algorithm failed silently.

How derivative free methods can solve the curse of dimensionality#

To be written

How derivative free methods can fail#

To be written