Download the notebook here
!
Ordered Logit Example¶
Let’s suppose, completely hypothetically, that we are not a big fan of Stata or simply want to learn the mechanics behind an ordered probit model by coding it up ourselves.
The resulting function should be user-friendly and its usage should look approximately like this:
data = pd.read_pickle("ologit.pickle")
formula = 'apply ~ pared + public + gpa'
estimates = ordered_logit(formula, data)
The example is taken from the Stata Documentation. The correct parameters for this model are [1.047664, -.0586828, .6157458, 2.203323, 4.298767]
In this notebook we show you how estimagic can help you to implement such a model very easily.
[1]:
import numpy as np
import pandas as pd
from scipy import stats
from patsy import dmatrices
from estimagic import maximize
from time import sleep
Process the user input¶
First we have to take the formula and dataset, extract all relevant information about the model and construct the inputs for the likelihood function.
We will need four inputs:
A DataFrame with start parameters for the optimization.
An array with the dependent variable.
A 2d array with explanatory variables.
Constraints for the optimization that keep the cutoffs increasing.
[2]:
def ordered_logit_processing(formula, data):
"""Process user input for an ordered logit model."""
# extract data arrays
y, x = dmatrices(formula + ' - 1', data, return_type='dataframe')
y = y[y.columns[0]]
# extract dimensions
num_choices = len(y.unique())
beta_names = list(x.columns)
num_betas = len(beta_names)
num_cutoffs = num_choices - 1
# set-up index for params_df
names = beta_names + list(range(num_cutoffs))
categories = ['beta'] * num_betas + ['cutoff'] * num_cutoffs
index = pd.MultiIndex.from_tuples(
zip(categories, names), names=['type', 'name'])
# make params_df
np.random.seed(5471)
start_params = pd.DataFrame(index=index)
start_params['value'] = np.hstack(
[np.random.uniform(low=-0.5, high=0.5, size=len(x.columns)),
np.arange(num_cutoffs) * 2])
start_params["group"] = start_params.index.get_level_values("type")
# make constraints
constr = [{'loc': 'cutoff', 'type': 'increasing'}]
return start_params, y.to_numpy().astype(int), x.to_numpy(), constr
If you have never programmed an estimator before, you migt be surprised how much code is spent on processing compared to calculating the actual likelihood function. This will almost always be the case, at least if you try to make your estimator flexible and user friendly. Estimagic is there to shorten this type of code as much as possible.
Calculate the Likelihood¶
Next we have to evaluate the likelihood function, given parameters and data. There are more efficient ways of calculating the likelihood for an ordered logit, but this one was chosen for brevity and readability.
[3]:
def ordered_logit_loglike(params, y, x):
"""Likelihood function of an orderd logit model."""
# parse the parameter vector into its quantities
beta = params.loc["beta", "value"].to_numpy()
cutoffs = params.loc["cutoff", "value"].to_numpy()
# calculate deterministic part of utilities
xb = x.dot(beta)
# evaluate likelihood
upper_cutoffs = np.hstack([cutoffs, np.inf])[y]
lower_cutoffs = np.hstack([-np.inf, cutoffs])[y]
upper_cdf = stats.logistic.cdf(upper_cutoffs - xb)
lower_cdf = stats.logistic.cdf(lower_cutoffs - xb)
return np.log(upper_cdf - lower_cdf).sum()
The ordered_logit command¶
Finally we have to maximize the likelihood function.
[4]:
def ordered_logit(formula, data, dashboard=False):
"""Estimate an ordered probit model with maximum likelihood.
Args:
formula (str): A patsy formula
data (str): A pandas DataFrame
dashboard (bool): Switch on the dashboard.
Returns:
res: optimization result.
"""
params, y, x, constr = ordered_logit_processing(formula, data)
res = maximize(
criterion=ordered_logit_loglike,
params=params,
algorithm='scipy_L-BFGS-B',
constraints=constr,
criterion_kwargs={"y": y, "x": x},
dashboard=dashboard)
return res
Note that estimagic has a maximize
function, so you don’t have to switch the sign of your objective function to do a maximization!
Use the command¶
[5]:
df = pd.read_pickle("ologit.pickle")
form = "apply ~ pared + public + gpa"
[6]:
info, params = ordered_logit(form, df, dashboard=False)
[7]:
params["correct"] = [1.047664, -.0586828, .6157458, 2.203323, 4.298767]
params[["value", "correct"]]
[7]:
value | correct | ||
---|---|---|---|
type | name | ||
beta | pared | 1.047661 | 1.047664 |
public | -0.058685 | -0.058683 | |
gpa | 0.615745 | 0.615746 | |
cutoff | 0 | 2.203319 | 2.203323 |
1 | 4.298763 | 4.298767 |
This looks pretty good! I actually had to try three optimizers to get at least one differenet digit which makes the result more credible. Other optimizers like nlopt_bobyqa
and nlopt_neledermead
hit it on all digits!
Of course this model is way too simple to actually see all the benefits of estimagic. But we wanted to keep it simple. And the very nice way of parsing the parameters expressing the constraint that cutoffs have to be increasing hints at estimagic’s usefulness in models with hundred or more parameters with plenty of constraints!
[ ]:
[ ]: