{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bootstrap Tutorial\n", "\n", "This notebook contains a tutorial on how to use the bootstrap functionality provided by estimagic. We start with the simplest possible example of calculating standard errors and confidence intervals for an OLS estimator without as well as with clustering. Then we progress to more advanced examples.\n", "\n", "In the example here, we will work with the \"exercise\" example dataset taken from the seaborn library.\n", "\n", "The working example will be a linear regression to investigate the effects of exercise time on pulse." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import estimagic as em\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import statsmodels.api as sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare the dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddietpulsetimekindconstant
01low fat851rest1
11low fat8515rest1
21low fat8830rest1
32low fat901rest1
42low fat9215rest1
\n", "
" ], "text/plain": [ " id diet pulse time kind constant\n", "0 1 low fat 85 1 rest 1\n", "1 1 low fat 85 15 rest 1\n", "2 1 low fat 88 30 rest 1\n", "3 2 low fat 90 1 rest 1\n", "4 2 low fat 92 15 rest 1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = sns.load_dataset(\"exercise\", index_col=0)\n", "replacements = {\"1 min\": 1, \"15 min\": 15, \"30 min\": 30}\n", "df = df.replace({\"time\": replacements})\n", "df[\"constant\"] = 1\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Doing a very simple bootstrap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we need is a function that calculates the bootstrap outcome, given an empirical or re-sampled dataset. The bootstrap outcome is the quantity for which you want to calculate standard errors and confidence intervals. In most applications those are just parameter estimates.\n", "\n", "In our case, we want to regress \"pulse\" on \"time\" and a constant. Our outcome function looks as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def ols_fit(data):\n", " y = data[\"pulse\"]\n", " x = data[[\"constant\", \"time\"]]\n", " params = sm.OLS(y, x).fit().params\n", "\n", " return params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, the user-specified outcome function may return any pytree (e.g. numpy.ndarray, pandas.DataFrame, dict etc.). In the example here, it returns a pandas.Series." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to calculate confidence intervals and standard errors." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 90.858983\n", " time 0.151361\n", " dtype: float64,\n", " constant 96.880057\n", " time 0.654426\n", " dtype: float64)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_without_cluster = em.bootstrap(data=df, outcome=ols_fit)\n", "results_without_cluster.ci()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "constant 1.548116\n", "time 0.126410\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_without_cluster.se()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above function call represents the minimum that a user has to specify, making full use of the default options, such as drawing a 1_000 bootstrap draws, using the \"percentile\" bootstrap confidence interval, not making use of parallelization, etc.\n", "\n", "If, for example, we wanted to take 10_000 draws, while parallelizing on two cores, and using a \"bc\" type confidence interval, we would simply call the following:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 91.309379\n", " time 0.192349\n", " dtype: float64,\n", " constant 96.286624\n", " time 0.607616\n", " dtype: float64)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_without_cluster2 = em.bootstrap(\n", " data=df, outcome=ols_fit, n_draws=10_000, n_cores=2\n", ")\n", "\n", "results_without_cluster2.ci(ci_method=\"bc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Doing a clustered bootstrap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cluster robust variant of the bootstrap, the original dataset is divided into clusters according to the values of some user-specified variable, and then clusters are drawn uniformly with replacement in order to create the different bootstrap samples. \n", "\n", "In order to use the cluster robust boostrap, we simply specify which variable to cluster by. In the example we are working with, it seems sensible to cluster on individuals, i.e. on the column \"id\" of our dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "constant 1.185239\n", "time 0.101723\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_with_cluster = em.bootstrap(data=df, outcome=ols_fit, cluster_by=\"id\")\n", "\n", "results_with_cluster.se()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the estimated standard errors are indeed of smaller magnitude when we use the cluster robust bootstrap. \n", "\n", "Finally, we can compare our bootstrap results to a regression on the full sample using statsmodels' OLS function.\n", "We see that the cluster robust bootstrap yields standard error estimates very close to the ones of the cluster robust regression, while the regular bootstrap seems to overestimate the standard errors of both coefficients.\n", "\n", "**Note**: We would not expect the asymptotic statsmodels standard errors to be exactly the same as the bootstrapped standard errors.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: pulse R-squared: 0.096\n", "Model: OLS Adj. R-squared: 0.086\n", "Method: Least Squares F-statistic: 13.75\n", "Date: Sat, 14 Jan 2023 Prob (F-statistic): 0.000879\n", "Time: 17:54:58 Log-Likelihood: -365.51\n", "No. Observations: 90 AIC: 735.0\n", "Df Residuals: 88 BIC: 740.0\n", "Df Model: 1 \n", "Covariance Type: cluster \n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "constant 93.7611 1.205 77.837 0.000 91.400 96.122\n", "time 0.3873 0.104 3.708 0.000 0.183 0.592\n", "==============================================================================\n", "Omnibus: 20.828 Durbin-Watson: 0.827\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 26.313\n", "Skew: 1.173 Prob(JB): 1.93e-06\n", "Kurtosis: 4.231 Cond. No. 31.7\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors are robust to cluster correlation (cluster)\n" ] } ], "source": [ "y = df[\"pulse\"]\n", "x = df[[\"constant\", \"time\"]]\n", "\n", "\n", "cluster_robust_ols = sm.OLS(y, x).fit(cov_type=\"cluster\", cov_kwds={\"groups\": df[\"id\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting up the process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many situations, the above procedure is enough. However, sometimes it may be important to split the bootstrapping process up into smaller steps. Examples for such situations are:\n", "\n", "1. You want to look at the bootstrap estimates\n", "2. You want to do a bootstrap with a low number of draws first and add more draws later without duplicated calculations\n", "3. You have more bootstrap outcomes than just the parameters\n", "\n", "### 1. Accessing bootstrap outcomes\n", "\n", "The bootstrap outcomes are stored in the results object you get back when calling the bootstrap function. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[constant 93.732040\n", " time 0.580057\n", " dtype: float64,\n", " constant 92.909468\n", " time 0.309198\n", " dtype: float64,\n", " constant 94.257886\n", " time 0.428624\n", " dtype: float64,\n", " constant 93.872576\n", " time 0.410508\n", " dtype: float64,\n", " constant 92.076689\n", " time 0.542170\n", " dtype: float64]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = em.bootstrap(data=df, outcome=ols_fit, seed=1234)\n", "my_outcomes = result.outcomes\n", "\n", "my_outcomes[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To further compare the cluster bootstrap to the uniform bootstrap, let's plot the sampling distribution of the parameters on time. We can again see that the standard error is smaller when we cluster on the subject id. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "result_clustered = em.bootstrap(data=df, outcome=ols_fit, seed=1234, cluster_by=\"id\")\n", "my_outcomes_clustered = result_clustered.outcomes" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# clustered distribution in blue\n", "sns.histplot(\n", " pd.DataFrame(my_outcomes_clustered)[\"time\"], kde=True, stat=\"density\", linewidth=0\n", ")\n", "\n", "# non-clustered distribution in orange\n", "sns.histplot(\n", " pd.DataFrame(my_outcomes)[\"time\"],\n", " kde=True,\n", " stat=\"density\",\n", " linewidth=0,\n", " color=\"orange\",\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating standard errors and confidence intervals from existing bootstrap result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you've already run ``bootstrap`` once, you can simply pass the existing result object to a new call of ``bootstrap``. Estimagic reuses the existing bootstrap outcomes and now only draws ``n_draws`` - ``n_existing`` outcomes instead of drawing entirely new ``n_draws``. Depending on the ``n_draws`` you specified (this is set to 1_000 by default), this may save considerable computation time. \n", "\n", "We can go on and compute confidence intervals and standard errors, just the same way as before, with several methods (e.g. \"percentile\" and \"bc\"), yet without duplicated evaluations of the bootstrap outcome function. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 90.709236\n", " time 0.151193\n", " dtype: float64,\n", " constant 96.827145\n", " time 0.627507\n", " dtype: float64)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_results = em.bootstrap(\n", " data=df,\n", " outcome=ols_fit,\n", " existing_result=result,\n", ")\n", "my_results.ci(ci_method=\"t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use this to calculate confidence intervals with several methods (e.g. \"percentile\" and \"bc\") without duplicated evaluations of the bootstrap outcome function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Extending bootstrap results with more draws\n", "\n", "It is often the case that, for speed reasons, you set the number of bootstrap draws quite low, so you can look at the results earlier and later decide that you need more draws. \n", "\n", "In the long run, we will offer a Dashboard integration for this. For now, you can do it manually.\n", "\n", "As an example, we will take an initial sample of 500 draws. We then extend it with another 1500 draws. \n", "\n", "*Note*: It is very important to use a different random seed when you calculate the additional outcomes!!!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 90.768859\n", " time 0.137692\n", " dtype: float64,\n", " constant 96.601067\n", " time 0.607616\n", " dtype: float64)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initial_result = em.bootstrap(data=df, outcome=ols_fit, seed=5471, n_draws=500)\n", "initial_result.ci(ci_method=\"t\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 90.689112\n", " time 0.128597\n", " dtype: float64,\n", " constant 96.696522\n", " time 0.622954\n", " dtype: float64)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_result = em.bootstrap(\n", " data=df, outcome=ols_fit, existing_result=initial_result, seed=2365, n_draws=2000\n", ")\n", "combined_result.ci(ci_method=\"t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Using less draws than totally available bootstrap outcomes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have a large sample of bootstrap outcomes but want to compute summary statistics only on a subset? No problem! Estimagic got you covered. You can simply pass any number of ``n_draws`` to your next call of ``bootstrap``, regardless of the size of the existing sample you want to use. We already covered the case where ``n_draws`` > ``n_existing`` above, in which case estimagic draws the remaining bootstrap outcomes for you.\n", "\n", "If ``n_draws`` <= ``n_existing``, estimagic takes a random subset of the existing outcomes - and voilà! " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(constant 90.619182\n", " time 0.130242\n", " dtype: float64,\n", " constant 96.557777\n", " time 0.625645\n", " dtype: float64)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_result = em.bootstrap(\n", " data=df, outcome=ols_fit, existing_result=combined_result, seed=4632, n_draws=500\n", ")\n", "subset_result.ci(ci_method=\"t\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing the bootstrap samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to just access the bootstrap samples. You may do so, for example, if you want to calculate your bootstrap outcomes in parallel in a way that is not yet supported by estimagic (e.g. on a large cluster or super-computer)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddietpulsetimekindconstant
8830no fat11115running1
8730no fat991running1
8830no fat11115running1
3412low fat10315walking1
156no fat831rest1
.....................
7827no fat1001running1
7726no fat14330running1
8730no fat991running1
2910no fat10030rest1
7526no fat951running1
\n", "

90 rows × 6 columns

\n", "
" ], "text/plain": [ " id diet pulse time kind constant\n", "88 30 no fat 111 15 running 1\n", "87 30 no fat 99 1 running 1\n", "88 30 no fat 111 15 running 1\n", "34 12 low fat 103 15 walking 1\n", "15 6 no fat 83 1 rest 1\n", ".. .. ... ... ... ... ...\n", "78 27 no fat 100 1 running 1\n", "77 26 no fat 143 30 running 1\n", "87 30 no fat 99 1 running 1\n", "29 10 no fat 100 30 rest 1\n", "75 26 no fat 95 1 running 1\n", "\n", "[90 rows x 6 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from estimagic.inference import get_bootstrap_samples\n", "\n", "rng = np.random.default_rng(1234)\n", "my_samples = get_bootstrap_samples(data=df, rng=rng)\n", "my_samples[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "estimagic", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:27:35) [Clang 14.0.6 ]" }, "vscode": { "interpreter": { "hash": "e8a16b1bdcc80285313db4674a5df2a5a80c75795379c5d9f174c7c712f05b3a" } } }, "nbformat": 4, "nbformat_minor": 4 }