How to scale optimization problems#

Real world optimization problems often comprise parameters of vastly different orders of magnitudes. This is typically not a problem for gradient based optimization algorithms but can considerably slow down derivative free optimizers. Below we describe three simple heuristics to improve the scaling of optimization problems and discuss the pros and cons of each approach.

What does well scaled mean#

In short, an optimization problem is well scaled if a fixed step in any direction yields a roughly similar sized change in the criterion function.

In practice, this can never be achieved perfectly (at least for nonlinear problems). However, one can easily improve over simply ignoring the problem altogether.

Heuristics to improve scaling#

Divide by absolute value of start parameters#

In many applications, parameters with very large start values will vary over a wide range and a change in that parameter will only lead to a relatively small change in the criterion function. If this is the case, the scaling of the optimization problem can be improved by simply dividing all parameter vectors by the start parameters.

Advantages:

  • straightforward

  • works with any type of constraints

Disadvantages:

  • Makes scaling dependent on start values

  • Parameters with zero start value need special treatment

How to specify this scaling:

import estimagic as em


def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "start_values", "clipping_value": 0.1},
)

Divide by bounds#

In many optimization problems, one has additional information on bounds of the parameter space. Some of these bounds are hard (e.g. probabilities or variances are non negative), others are soft and derived from simple considerations (e.g. if a time discount factor were smaller than 0.7, we would not observe anyone to pursue a university degree in a structural model of educational choices; or if an infection probability was higher than 20% for distant contacts, the covid pandemic would have been over after a month). For parameters that strongly influence the criterion function, the bounds stemming from these considerations are typically tighter than for parameters that have a small effect on the criterion function.

Thus, a natural approach to improve the scaling of the optimization problem is to re-map all parameters such that the bounds are [0, 1] for all parameters. This has the additional advantage that absolute and relative convergence criteria on parameter changes become the same.

Advantages:

  • straightforward

  • works well in many practical applications

  • scaling is independent of start values

  • No problems with division by zero

Disadvantages:

  • Only works if all parameters have bounds

  • This prohibits some kinds of other constraints in estimagic

How to specify this scaling:

def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "bounds", "clipping_value": 0.0},
)

Influencing the magnitude of parameters#

The above approaches align the scale of parameters relative to each other. However, the overall magnitude is set rather arbitrarily. For example, when dividing by start values, the magnitude of the scaled parameters is around one. When dividing by bounds, it is somewhere between zero and one.

For the performance of numerical optimizers, only the relative scales are important.

However, influencing the overall magnitude can be helpful to trick some optimizers into doing things they do not want to do. For example, when there is a minimal allowed initial trust region radius, increasing the magnitude of parameters allows to effectively make the trust region radius smaller.

Setting the magnitude means simply adding one more entry to the scaling options. For example, if you want to scale by bounds and increase the magnitude by a factor of five:

def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "bounds", clipping_value: 0.0, "magnitude": 5},
)

Remarks#

What is the clipping_value#

In all of the above heuristics, the parameter vector is divided (elementwise) by some other vector and it is possible that some entries of the divisor are zero or close to zero.

The clipping value bounds the elements of the divisor away from zero. It should be set to a strictly non-zero number for the "start_values" and "gradient" approach. The "bounds" approach avoids division by exact zeros by construction. The "clipping_value" can still be used to avoid extreme upscaling of parameters with very tight bounds. However, this means that the bounds of the re-scaled problem are not exactly [0, 1] for all parameters.

Default values#

Scaling is disabled by default. If enabled, but no scaling_options are provided, we use the "start_values" method with a "clipping_value" of 0.1. This is the default method because it can be used for all optimization problems and has low computational cost. We strongly recommend you read the above guidelines and choose the method that is most suitable for your problem.