(scaling)=

# How to scale optimization problems

Real world optimization problems often comprise parameters of vastly different orders of
magnitudes. This is typically not a problem for gradient based optimization algorithms
but can considerably slow down derivative free optimizers. Below we describe three
simple heuristics to improve the scaling of optimization problems and discuss the pros
and cons of each approach.

## What does well scaled mean

In short, an optimization problem is well scaled if a fixed step in any direction yields
a roughly similar sized change in the criterion function.

In practice, this can never be achieved perfectly (at least for nonlinear problems).
However, one can easily improve over simply ignoring the problem altogether.

## Heuristics to improve scaling

### Divide by absolute value of start parameters

In many applications, parameters with very large start values will vary over a wide
range and a change in that parameter will only lead to a relatively small change in the
criterion function. If this is the case, the scaling of the optimization problem can be
improved by simply dividing all parameter vectors by the start parameters.

**Advantages:**

- straightforward
- works with any type of constraints

**Disadvantages:**

- Makes scaling dependent on start values
- Parameters with zero start value need special treatment

**How to specify this scaling:**

```python
import estimagic as em


def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "start_values", "clipping_value": 0.1},
)
```

### Divide by bounds

In many optimization problems, one has additional information on bounds of the parameter
space. Some of these bounds are hard (e.g. probabilities or variances are non negative),
others are soft and derived from simple considerations (e.g. if a time discount factor
were smaller than 0.7, we would not observe anyone to pursue a university degree in a
structural model of educational choices; or if an infection probability was higher than
20% for distant contacts, the covid pandemic would have been over after a month). For
parameters that strongly influence the criterion function, the bounds stemming from
these considerations are typically tighter than for parameters that have a small effect
on the criterion function.

Thus, a natural approach to improve the scaling of the optimization problem is to re-map
all parameters such that the bounds are \[0, 1\] for all parameters. This has the
additional advantage that absolute and relative convergence criteria on parameter
changes become the same.

**Advantages:**

- straightforward
- works well in many practical applications
- scaling is independent of start values
- No problems with division by zero

**Disadvantages:**

- Only works if all parameters have bounds
- This prohibits some kinds of other constraints in estimagic

**How to specify this scaling:**

```python
def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "bounds", "clipping_value": 0.0},
)
```

## Influencing the magnitude of parameters

The above approaches align the scale of parameters relative to each other. However, the
overall magnitude is set rather arbitrarily. For example, when dividing by start values,
the magnitude of the scaled parameters is around one. When dividing by bounds, it is
somewhere between zero and one.

For the performance of numerical optimizers, only the relative scales are important.

However, influencing the overall magnitude can be helpful to trick some optimizers into
doing things they do not want to do. For example, when there is a minimal allowed
initial trust region radius, increasing the magnitude of parameters allows to
effectively make the trust region radius smaller.

Setting the magnitude means simply adding one more entry to the scaling options. For
example, if you want to scale by bounds and increase the magnitude by a factor of five:

```python
def sphere(params):
    return (params["value"] ** 2).sum()


start_params = pd.DataFrame(data=np.arange(5), columns=["value"])
start_params["lower_bound"] = 0
start_params["upper_bound"] = 2 * np.arange(5) + 1

res = em.minimize(
    criterion=sphere,
    params=start_params,
    algorithm="scipy_lbfgsb",
    scaling=True,
    scaling_options={"method": "bounds", clipping_value: 0.0, "magnitude": 5},
)
```

## Remarks

### What is the `clipping_value`

In all of the above heuristics, the parameter vector is divided (elementwise) by some
other vector and it is possible that some entries of the divisor are zero or close to
zero.

The clipping value bounds the elements of the divisor away from zero. It should be set
to a strictly non-zero number for the `"start_values"` and `"gradient"` approach. The
`"bounds"` approach avoids division by exact zeros by construction. The
`"clipping_value"` can still be used to avoid extreme upscaling of parameters with very
tight bounds. However, this means that the bounds of the re-scaled problem are not
exactly \[0, 1\] for all parameters.

### Default values

Scaling is disabled by default. If enabled, but no `scaling_options` are provided, we
use the `"start_values"` method with a `"clipping_value"` of 0.1. This is the default
method because it can be used for all optimization problems and has low computational
cost. We strongly recommend you read the above guidelines and choose the method that is
most suitable for your problem.