Coverage for /home/martinb/.local/share/virtualenvs/camcops/lib/python3.6/site-packages/statsmodels/imputation/mice.py : 10%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
1"""
2Overview
3--------
5This module implements the Multiple Imputation through Chained
6Equations (MICE) approach to handling missing data in statistical data
7analyses. The approach has the following steps:
90. Impute each missing value with the mean of the observed values of
10the same variable.
121. For each variable in the data set with missing values (termed the
13'focus variable'), do the following:
151a. Fit an 'imputation model', which is a regression model for the
16focus variable, regressed on the observed and (current) imputed values
17of some or all of the other variables.
191b. Impute the missing values for the focus variable. Currently this
20imputation must use the 'predictive mean matching' (pmm) procedure.
222. Once all variables have been imputed, fit the 'analysis model' to
23the data set.
253. Repeat steps 1-2 multiple times and combine the results using a
26'combining rule' to produce point estimates of all parameters in the
27analysis model and standard errors for them.
29The imputations for each variable are based on an imputation model
30that is specified via a model class and a formula for the regression
31relationship. The default model is OLS, with a formula specifying
32main effects for all other variables.
34The MICE procedure can be used in one of two ways:
36* If the goal is only to produce imputed data sets, the MICEData class
37can be used to wrap a data frame, providing facilities for doing the
38imputation. Summary plots are available for assessing the performance
39of the imputation.
41* If the imputed data sets are to be used to fit an additional
42'analysis model', a MICE instance can be used. After specifying the
43MICE instance and running it, the results are combined using the
44`combine` method. Results and various summary plots are then
45available.
47Terminology
48-----------
50The primary goal of the analysis is usually to fit and perform
51inference using an 'analysis model'. If an analysis model is not
52specified, then imputed datasets are produced for later use.
54The MICE procedure involves a family of imputation models. There is
55one imputation model for each variable with missing values. An
56imputation model may be conditioned on all or a subset of the
57remaining variables, using main effects, transformations,
58interactions, etc. as desired.
60A 'perturbation method' is a method for setting the parameter estimate
61in an imputation model. The 'gaussian' perturbation method first fits
62the model (usually using maximum likelihood, but it could use any
63statsmodels fit procedure), then sets the parameter vector equal to a
64draw from the Gaussian approximation to the sampling distribution for
65the fit. The 'bootstrap' perturbation method sets the parameter
66vector equal to a fitted parameter vector obtained when fitting the
67conditional model to a bootstrapped version of the data set.
69Class structure
70---------------
72There are two main classes in the module:
74* 'MICEData' wraps a Pandas dataframe, incorporating information about
75 the imputation model for each variable with missing values. It can
76 be used to produce multiply imputed data sets that are to be further
77 processed or distributed to other researchers. A number of plotting
78 procedures are provided to visualize the imputation results and
79 missing data patterns. The `history_func` hook allows any features
80 of interest of the imputed data sets to be saved for further
81 analysis.
83* 'MICE' takes both a 'MICEData' object and an analysis model
84 specification. It runs the multiple imputation, fits the analysis
85 models, and combines the results to produce a `MICEResults` object.
86 The summary method of this results object can be used to see the key
87 estimands and inferential quantities.
89Notes
90-----
92By default, to conserve memory 'MICEData' saves very little
93information from one iteration to the next. The data set passed by
94the user is copied on entry, but then is over-written each time new
95imputations are produced. If using 'MICE', the fitted
96analysis models and results are saved. MICEData includes a
97`history_callback` hook that allows arbitrary information from the
98intermediate datasets to be saved for future use.
100References
101----------
103JL Schafer: 'Multiple Imputation: A Primer', Stat Methods Med Res,
1041999.
106TE Raghunathan et al.: 'A Multivariate Technique for Multiply
107Imputing Missing Values Using a Sequence of Regression Models', Survey
108Methodology, 2001.
110SAS Institute: 'Predictive Mean Matching Method for Monotone Missing
111Data', SAS 9.2 User's Guide, 2014.
113A Gelman et al.: 'Multiple Imputation with Diagnostics (mi) in R:
114Opening Windows into the Black Box', Journal of Statistical Software,
1152009.
116"""
118import pandas as pd
119import numpy as np
120import patsy
121from statsmodels.base.model import LikelihoodModelResults
122from statsmodels.regression.linear_model import OLS
123from collections import defaultdict
126_mice_data_example_1 = """
127 >>> imp = mice.MICEData(data)
128 >>> imp.set_imputer('x1', formula='x2 + np.square(x2) + x3')
129 >>> for j in range(20):
130 ... imp.update_all()
131 ... imp.data.to_csv('data%02d.csv' % j)"""
133_mice_data_example_2 = """
134 >>> imp = mice.MICEData(data)
135 >>> j = 0
136 >>> for data in imp:
137 ... imp.data.to_csv('data%02d.csv' % j)
138 ... j += 1"""
141class PatsyFormula(object):
142 """
143 A simple wrapper for a string to be interpreted as a Patsy formula.
144 """
145 def __init__(self, formula):
146 self.formula = "0 + " + formula
149class MICEData(object):
151 __doc__ = """\
152 Wrap a data set to allow missing data handling with MICE.
154 Parameters
155 ----------
156 data : Pandas data frame
157 The data set, which is copied internally.
158 perturbation_method : str
159 The default perturbation method
160 k_pmm : int
161 The number of nearest neighbors to use during predictive mean
162 matching. Can also be specified in `fit`.
163 history_callback : function
164 A function that is called after each complete imputation
165 cycle. The return value is appended to `history`. The
166 MICEData object is passed as the sole argument to
167 `history_callback`.
169 Examples
170 --------
171 Draw 20 imputations from a data set called `data` and save them in
172 separate files with filename pattern `dataXX.csv`. The variables
173 other than `x1` are imputed using linear models fit with OLS, with
174 mean structures containing main effects of all other variables in
175 `data`. The variable named `x1` has a conditional mean structure
176 that includes an additional term for x2^2.
177 %(_mice_data_example_1)s
179 Impute using default models, using the MICEData object as an
180 iterator.
181 %(_mice_data_example_2)s
183 Notes
184 -----
185 Allowed perturbation methods are 'gaussian' (the model parameters
186 are set to a draw from the Gaussian approximation to the posterior
187 distribution), and 'boot' (the model parameters are set to the
188 estimated values obtained when fitting a bootstrapped version of
189 the data set).
191 `history_callback` can be implemented to have side effects such as
192 saving the current imputed data set to disk.
193 """ % {'_mice_data_example_1': _mice_data_example_1,
194 '_mice_data_example_2': _mice_data_example_2}
196 def __init__(self, data, perturbation_method='gaussian',
197 k_pmm=20, history_callback=None):
199 if data.columns.dtype != np.dtype('O'):
200 msg = "MICEData data column names should be string type"
201 raise ValueError(msg)
203 self.regularized = dict()
205 # Drop observations where all variables are missing. This
206 # also has the effect of copying the data frame.
207 self.data = data.dropna(how='all').reset_index(drop=True)
209 self.history_callback = history_callback
210 self.history = []
211 self.predict_kwds = {}
213 # Assign the same perturbation method for all variables.
214 # Can be overridden when calling 'set_imputer'.
215 self.perturbation_method = defaultdict(lambda:
216 perturbation_method)
218 # Map from variable name to indices of observed/missing
219 # values.
220 self.ix_obs = {}
221 self.ix_miss = {}
222 for col in self.data.columns:
223 ix_obs, ix_miss = self._split_indices(self.data[col])
224 self.ix_obs[col] = ix_obs
225 self.ix_miss[col] = ix_miss
227 # Most recent model instance and results instance for each variable.
228 self.models = {}
229 self.results = {}
231 # Map from variable names to the conditional formula.
232 self.conditional_formula = {}
234 # Map from variable names to init/fit args of the conditional
235 # models.
236 self.init_kwds = defaultdict(lambda: dict())
237 self.fit_kwds = defaultdict(lambda: dict())
239 # Map from variable names to the model class.
240 self.model_class = {}
242 # Map from variable names to most recent params update.
243 self.params = {}
245 # Set default imputers.
246 for vname in data.columns:
247 self.set_imputer(vname)
249 # The order in which variables are imputed in each cycle.
250 # Impute variables with the fewest missing values first.
251 vnames = list(data.columns)
252 nmiss = [len(self.ix_miss[v]) for v in vnames]
253 nmiss = np.asarray(nmiss)
254 ii = np.argsort(nmiss)
255 ii = ii[sum(nmiss == 0):]
256 self._cycle_order = [vnames[i] for i in ii]
258 self._initial_imputation()
260 self.k_pmm = k_pmm
262 def next_sample(self):
263 """
264 Returns the next imputed dataset in the imputation process.
266 Returns
267 -------
268 data : array_like
269 An imputed dataset from the MICE chain.
271 Notes
272 -----
273 `MICEData` does not have a `skip` parameter. Consecutive
274 values returned by `next_sample` are immediately consecutive
275 in the imputation chain.
277 The returned value is a reference to the data attribute of
278 the class and should be copied before making any changes.
279 """
281 self.update_all(1)
282 return self.data
284 def _initial_imputation(self):
285 """
286 Use a PMM-like procedure for initial imputed values.
288 For each variable, missing values are imputed as the observed
289 value that is closest to the mean over all observed values.
290 """
292 for col in self.data.columns:
293 di = self.data[col] - self.data[col].mean()
294 di = np.abs(di)
295 ix = di.idxmin()
296 imp = self.data[col].loc[ix]
297 self.data[col].fillna(imp, inplace=True)
299 def _split_indices(self, vec):
300 null = pd.isnull(vec)
301 ix_obs = np.flatnonzero(~null)
302 ix_miss = np.flatnonzero(null)
303 if len(ix_obs) == 0:
304 raise ValueError("variable to be imputed has no observed values")
305 return ix_obs, ix_miss
307 def set_imputer(self, endog_name, formula=None, model_class=None,
308 init_kwds=None, fit_kwds=None, predict_kwds=None,
309 k_pmm=20, perturbation_method=None, regularized=False):
310 """
311 Specify the imputation process for a single variable.
313 Parameters
314 ----------
315 endog_name : str
316 Name of the variable to be imputed.
317 formula : str
318 Conditional formula for imputation. Defaults to a formula
319 with main effects for all other variables in dataset. The
320 formula should only include an expression for the mean
321 structure, e.g. use 'x1 + x2' not 'x4 ~ x1 + x2'.
322 model_class : statsmodels model
323 Conditional model for imputation. Defaults to OLS. See below
324 for more information.
325 init_kwds : dit-like
326 Keyword arguments passed to the model init method.
327 fit_kwds : dict-like
328 Keyword arguments passed to the model fit method.
329 predict_kwds : dict-like
330 Keyword arguments passed to the model predict method.
331 k_pmm : int
332 Determines number of neighboring observations from which
333 to randomly sample when using predictive mean matching.
334 perturbation_method : str
335 Either 'gaussian' or 'bootstrap'. Determines the method
336 for perturbing parameters in the imputation model. If
337 None, uses the default specified at class initialization.
338 regularized : dict
339 If regularized[name]=True, `fit_regularized` rather than
340 `fit` is called when fitting imputation models for this
341 variable. When regularized[name]=True for any variable,
342 perturbation_method must be set to boot.
344 Notes
345 -----
346 The model class must meet the following conditions:
347 * A model must have a 'fit' method that returns an object.
348 * The object returned from `fit` must have a `params` attribute
349 that is an array-like object.
350 * The object returned from `fit` must have a cov_params method
351 that returns a square array-like object.
352 * The model must have a `predict` method.
353 """
355 if formula is None:
356 main_effects = [x for x in self.data.columns
357 if x != endog_name]
358 fml = endog_name + " ~ " + " + ".join(main_effects)
359 self.conditional_formula[endog_name] = fml
360 else:
361 fml = endog_name + " ~ " + formula
362 self.conditional_formula[endog_name] = fml
364 if model_class is None:
365 self.model_class[endog_name] = OLS
366 else:
367 self.model_class[endog_name] = model_class
369 if init_kwds is not None:
370 self.init_kwds[endog_name] = init_kwds
372 if fit_kwds is not None:
373 self.fit_kwds[endog_name] = fit_kwds
375 if predict_kwds is not None:
376 self.predict_kwds[endog_name] = predict_kwds
378 if perturbation_method is not None:
379 self.perturbation_method[endog_name] = perturbation_method
381 self.k_pmm = k_pmm
382 self.regularized[endog_name] = regularized
384 def _store_changes(self, col, vals):
385 """
386 Fill in dataset with imputed values.
388 Parameters
389 ----------
390 col : str
391 Name of variable to be filled in.
392 vals : ndarray
393 Array of imputed values to use for filling-in missing values.
394 """
396 ix = self.ix_miss[col]
397 if len(ix) > 0:
398 self.data.iloc[ix, self.data.columns.get_loc(col)] = np.atleast_1d(vals)
400 def update_all(self, n_iter=1):
401 """
402 Perform a specified number of MICE iterations.
404 Parameters
405 ----------
406 n_iter : int
407 The number of updates to perform. Only the result of the
408 final update will be available.
410 Notes
411 -----
412 The imputed values are stored in the class attribute `self.data`.
413 """
415 for k in range(n_iter):
416 for vname in self._cycle_order:
417 self.update(vname)
419 if self.history_callback is not None:
420 hv = self.history_callback(self)
421 self.history.append(hv)
423 def get_split_data(self, vname):
424 """
425 Return endog and exog for imputation of a given variable.
427 Parameters
428 ----------
429 vname : str
430 The variable for which the split data is returned.
432 Returns
433 -------
434 endog_obs : DataFrame
435 Observed values of the variable to be imputed.
436 exog_obs : DataFrame
437 Current values of the predictors where the variable to be
438 imputed is observed.
439 exog_miss : DataFrame
440 Current values of the predictors where the variable to be
441 Imputed is missing.
442 init_kwds : dict-like
443 The init keyword arguments for `vname`, processed through Patsy
444 as required.
445 fit_kwds : dict-like
446 The fit keyword arguments for `vname`, processed through Patsy
447 as required.
448 """
450 formula = self.conditional_formula[vname]
451 endog, exog = patsy.dmatrices(formula, self.data,
452 return_type="dataframe")
454 # Rows with observed endog
455 ixo = self.ix_obs[vname]
456 endog_obs = np.asarray(endog.iloc[ixo])
457 exog_obs = np.asarray(exog.iloc[ixo, :])
459 # Rows with missing endog
460 ixm = self.ix_miss[vname]
461 exog_miss = np.asarray(exog.iloc[ixm, :])
463 predict_obs_kwds = {}
464 if vname in self.predict_kwds:
465 kwds = self.predict_kwds[vname]
466 predict_obs_kwds = self._process_kwds(kwds, ixo)
468 predict_miss_kwds = {}
469 if vname in self.predict_kwds:
470 kwds = self.predict_kwds[vname]
471 predict_miss_kwds = self._process_kwds(kwds, ixo)
473 return (endog_obs, exog_obs, exog_miss, predict_obs_kwds,
474 predict_miss_kwds)
476 def _process_kwds(self, kwds, ix):
477 kwds = kwds.copy()
478 for k in kwds:
479 v = kwds[k]
480 if isinstance(v, PatsyFormula):
481 mat = patsy.dmatrix(v.formula, self.data,
482 return_type="dataframe")
483 mat = np.asarray(mat)[ix, :]
484 if mat.shape[1] == 1:
485 mat = mat[:, 0]
486 kwds[k] = mat
487 return kwds
489 def get_fitting_data(self, vname):
490 """
491 Return the data needed to fit a model for imputation.
493 The data is used to impute variable `vname`, and therefore
494 only includes cases for which `vname` is observed.
496 Values of type `PatsyFormula` in `init_kwds` or `fit_kwds` are
497 processed through Patsy and subset to align with the model's
498 endog and exog.
500 Parameters
501 ----------
502 vname : str
503 The variable for which the fitting data is returned.
505 Returns
506 -------
507 endog : DataFrame
508 Observed values of `vname`.
509 exog : DataFrame
510 Regression design matrix for imputing `vname`.
511 init_kwds : dict-like
512 The init keyword arguments for `vname`, processed through Patsy
513 as required.
514 fit_kwds : dict-like
515 The fit keyword arguments for `vname`, processed through Patsy
516 as required.
517 """
519 # Rows with observed endog
520 ix = self.ix_obs[vname]
522 formula = self.conditional_formula[vname]
523 endog, exog = patsy.dmatrices(formula, self.data,
524 return_type="dataframe")
526 endog = np.asarray(endog.iloc[ix, 0])
527 exog = np.asarray(exog.iloc[ix, :])
529 init_kwds = self._process_kwds(self.init_kwds[vname], ix)
530 fit_kwds = self._process_kwds(self.fit_kwds[vname], ix)
532 return endog, exog, init_kwds, fit_kwds
534 def plot_missing_pattern(self, ax=None, row_order="pattern",
535 column_order="pattern",
536 hide_complete_rows=False,
537 hide_complete_columns=False,
538 color_row_patterns=True):
539 """
540 Generate an image showing the missing data pattern.
542 Parameters
543 ----------
544 ax : AxesSubplot
545 Axes on which to draw the plot.
546 row_order : str
547 The method for ordering the rows. Must be one of 'pattern',
548 'proportion', or 'raw'.
549 column_order : str
550 The method for ordering the columns. Must be one of 'pattern',
551 'proportion', or 'raw'.
552 hide_complete_rows : bool
553 If True, rows with no missing values are not drawn.
554 hide_complete_columns : bool
555 If True, columns with no missing values are not drawn.
556 color_row_patterns : bool
557 If True, color the unique row patterns, otherwise use grey
558 and white as colors.
560 Returns
561 -------
562 A figure containing a plot of the missing data pattern.
563 """
565 # Create an indicator matrix for missing values.
566 miss = np.zeros(self.data.shape)
567 cols = self.data.columns
568 for j, col in enumerate(cols):
569 ix = self.ix_miss[col]
570 miss[ix, j] = 1
572 # Order the columns as requested
573 if column_order == "proportion":
574 ix = np.argsort(miss.mean(0))
575 elif column_order == "pattern":
576 cv = np.cov(miss.T)
577 u, s, vt = np.linalg.svd(cv, 0)
578 ix = np.argsort(cv[:, 0])
579 elif column_order == "raw":
580 ix = np.arange(len(cols))
581 else:
582 raise ValueError(
583 column_order + " is not an allowed value for `column_order`.")
584 miss = miss[:, ix]
585 cols = [cols[i] for i in ix]
587 # Order the rows as requested
588 if row_order == "proportion":
589 ix = np.argsort(miss.mean(1))
590 elif row_order == "pattern":
591 x = 2**np.arange(miss.shape[1])
592 rky = np.dot(miss, x)
593 ix = np.argsort(rky)
594 elif row_order == "raw":
595 ix = np.arange(miss.shape[0])
596 else:
597 raise ValueError(
598 row_order + " is not an allowed value for `row_order`.")
599 miss = miss[ix, :]
601 if hide_complete_rows:
602 ix = np.flatnonzero((miss == 1).any(1))
603 miss = miss[ix, :]
605 if hide_complete_columns:
606 ix = np.flatnonzero((miss == 1).any(0))
607 miss = miss[:, ix]
608 cols = [cols[i] for i in ix]
610 from statsmodels.graphics import utils as gutils
611 from matplotlib.colors import LinearSegmentedColormap
613 if ax is None:
614 fig, ax = gutils.create_mpl_ax(ax)
615 else:
616 fig = ax.get_figure()
618 if color_row_patterns:
619 x = 2**np.arange(miss.shape[1])
620 rky = np.dot(miss, x)
621 _, rcol = np.unique(rky, return_inverse=True)
622 miss *= 1 + rcol[:, None]
623 ax.imshow(miss, aspect="auto", interpolation="nearest",
624 cmap='gist_ncar_r')
625 else:
626 cmap = LinearSegmentedColormap.from_list("_",
627 ["white", "darkgrey"])
628 ax.imshow(miss, aspect="auto", interpolation="nearest",
629 cmap=cmap)
631 ax.set_ylabel("Cases")
632 ax.set_xticks(range(len(cols)))
633 ax.set_xticklabels(cols, rotation=90)
635 return fig
637 def plot_bivariate(self, col1_name, col2_name,
638 lowess_args=None, lowess_min_n=40,
639 jitter=None, plot_points=True, ax=None):
640 """
641 Plot observed and imputed values for two variables.
643 Displays a scatterplot of one variable against another. The
644 points are colored according to whether the values are
645 observed or imputed.
647 Parameters
648 ----------
649 col1_name : str
650 The variable to be plotted on the horizontal axis.
651 col2_name : str
652 The variable to be plotted on the vertical axis.
653 lowess_args : dictionary
654 A dictionary of dictionaries, keys are 'ii', 'io', 'oi'
655 and 'oo', where 'o' denotes 'observed' and 'i' denotes
656 imputed. See Notes for details.
657 lowess_min_n : int
658 Minimum sample size to plot a lowess fit
659 jitter : float or tuple
660 Standard deviation for jittering points in the plot.
661 Either a single scalar applied to both axes, or a tuple
662 containing x-axis jitter and y-axis jitter, respectively.
663 plot_points : bool
664 If True, the data points are plotted.
665 ax : AxesSubplot
666 Axes on which to plot, created if not provided.
668 Returns
669 -------
670 The matplotlib figure on which the plot id drawn.
671 """
673 from statsmodels.graphics import utils as gutils
674 from statsmodels.nonparametric.smoothers_lowess import lowess
676 if lowess_args is None:
677 lowess_args = {}
679 if ax is None:
680 fig, ax = gutils.create_mpl_ax(ax)
681 else:
682 fig = ax.get_figure()
684 ax.set_position([0.1, 0.1, 0.7, 0.8])
686 ix1i = self.ix_miss[col1_name]
687 ix1o = self.ix_obs[col1_name]
688 ix2i = self.ix_miss[col2_name]
689 ix2o = self.ix_obs[col2_name]
691 ix_ii = np.intersect1d(ix1i, ix2i)
692 ix_io = np.intersect1d(ix1i, ix2o)
693 ix_oi = np.intersect1d(ix1o, ix2i)
694 ix_oo = np.intersect1d(ix1o, ix2o)
696 vec1 = np.asarray(self.data[col1_name])
697 vec2 = np.asarray(self.data[col2_name])
699 if jitter is not None:
700 if np.isscalar(jitter):
701 jitter = (jitter, jitter)
702 vec1 += jitter[0] * np.random.normal(size=len(vec1))
703 vec2 += jitter[1] * np.random.normal(size=len(vec2))
705 # Plot the points
706 keys = ['oo', 'io', 'oi', 'ii']
707 lak = {'i': 'imp', 'o': 'obs'}
708 ixs = {'ii': ix_ii, 'io': ix_io, 'oi': ix_oi, 'oo': ix_oo}
709 color = {'oo': 'grey', 'ii': 'red', 'io': 'orange',
710 'oi': 'lime'}
711 if plot_points:
712 for ky in keys:
713 ix = ixs[ky]
714 lab = lak[ky[0]] + "/" + lak[ky[1]]
715 ax.plot(vec1[ix], vec2[ix], 'o', color=color[ky],
716 label=lab, alpha=0.6)
718 # Plot the lowess fits
719 for ky in keys:
720 ix = ixs[ky]
721 if len(ix) < lowess_min_n:
722 continue
723 if ky in lowess_args:
724 la = lowess_args[ky]
725 else:
726 la = {}
727 ix = ixs[ky]
728 lfit = lowess(vec2[ix], vec1[ix], **la)
729 if plot_points:
730 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky],
731 alpha=0.6, lw=4)
732 else:
733 lab = lak[ky[0]] + "/" + lak[ky[1]]
734 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky],
735 alpha=0.6, lw=4, label=lab)
737 ha, la = ax.get_legend_handles_labels()
738 pad = 0.0001 if plot_points else 0.5
739 leg = fig.legend(ha, la, 'center right', numpoints=1,
740 handletextpad=pad)
741 leg.draw_frame(False)
743 ax.set_xlabel(col1_name)
744 ax.set_ylabel(col2_name)
746 return fig
748 def plot_fit_obs(self, col_name, lowess_args=None,
749 lowess_min_n=40, jitter=None,
750 plot_points=True, ax=None):
751 """
752 Plot fitted versus imputed or observed values as a scatterplot.
754 Parameters
755 ----------
756 col_name : str
757 The variable to be plotted on the horizontal axis.
758 lowess_args : dict-like
759 Keyword arguments passed to lowess fit. A dictionary of
760 dictionaries, keys are 'o' and 'i' denoting 'observed' and
761 'imputed', respectively.
762 lowess_min_n : int
763 Minimum sample size to plot a lowess fit
764 jitter : float or tuple
765 Standard deviation for jittering points in the plot.
766 Either a single scalar applied to both axes, or a tuple
767 containing x-axis jitter and y-axis jitter, respectively.
768 plot_points : bool
769 If True, the data points are plotted.
770 ax : AxesSubplot
771 Axes on which to plot, created if not provided.
773 Returns
774 -------
775 The matplotlib figure on which the plot is drawn.
776 """
778 from statsmodels.graphics import utils as gutils
779 from statsmodels.nonparametric.smoothers_lowess import lowess
781 if lowess_args is None:
782 lowess_args = {}
784 if ax is None:
785 fig, ax = gutils.create_mpl_ax(ax)
786 else:
787 fig = ax.get_figure()
789 ax.set_position([0.1, 0.1, 0.7, 0.8])
791 ixi = self.ix_miss[col_name]
792 ixo = self.ix_obs[col_name]
794 vec1 = np.asarray(self.data[col_name])
796 # Fitted values
797 formula = self.conditional_formula[col_name]
798 endog, exog = patsy.dmatrices(formula, self.data,
799 return_type="dataframe")
800 results = self.results[col_name]
801 vec2 = results.predict(exog=exog)
802 vec2 = self._get_predicted(vec2)
804 if jitter is not None:
805 if np.isscalar(jitter):
806 jitter = (jitter, jitter)
807 vec1 += jitter[0] * np.random.normal(size=len(vec1))
808 vec2 += jitter[1] * np.random.normal(size=len(vec2))
810 # Plot the points
811 keys = ['o', 'i']
812 ixs = {'o': ixo, 'i': ixi}
813 lak = {'o': 'obs', 'i': 'imp'}
814 color = {'o': 'orange', 'i': 'lime'}
815 if plot_points:
816 for ky in keys:
817 ix = ixs[ky]
818 ax.plot(vec1[ix], vec2[ix], 'o', color=color[ky],
819 label=lak[ky], alpha=0.6)
821 # Plot the lowess fits
822 for ky in keys:
823 ix = ixs[ky]
824 if len(ix) < lowess_min_n:
825 continue
826 if ky in lowess_args:
827 la = lowess_args[ky]
828 else:
829 la = {}
830 ix = ixs[ky]
831 lfit = lowess(vec2[ix], vec1[ix], **la)
832 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky],
833 alpha=0.6, lw=4, label=lak[ky])
835 ha, la = ax.get_legend_handles_labels()
836 leg = fig.legend(ha, la, 'center right', numpoints=1)
837 leg.draw_frame(False)
839 ax.set_xlabel(col_name + " observed or imputed")
840 ax.set_ylabel(col_name + " fitted")
842 return fig
844 def plot_imputed_hist(self, col_name, ax=None, imp_hist_args=None,
845 obs_hist_args=None, all_hist_args=None):
846 """
847 Display imputed values for one variable as a histogram.
849 Parameters
850 ----------
851 col_name : str
852 The name of the variable to be plotted.
853 ax : AxesSubplot
854 An axes on which to draw the histograms. If not provided,
855 one is created.
856 imp_hist_args : dict
857 Keyword arguments to be passed to pyplot.hist when
858 creating the histogram for imputed values.
859 obs_hist_args : dict
860 Keyword arguments to be passed to pyplot.hist when
861 creating the histogram for observed values.
862 all_hist_args : dict
863 Keyword arguments to be passed to pyplot.hist when
864 creating the histogram for all values.
866 Returns
867 -------
868 The matplotlib figure on which the histograms were drawn
869 """
871 from statsmodels.graphics import utils as gutils
873 if imp_hist_args is None:
874 imp_hist_args = {}
875 if obs_hist_args is None:
876 obs_hist_args = {}
877 if all_hist_args is None:
878 all_hist_args = {}
880 if ax is None:
881 fig, ax = gutils.create_mpl_ax(ax)
882 else:
883 fig = ax.get_figure()
885 ax.set_position([0.1, 0.1, 0.7, 0.8])
887 ixm = self.ix_miss[col_name]
888 ixo = self.ix_obs[col_name]
890 imp = self.data[col_name].iloc[ixm]
891 obs = self.data[col_name].iloc[ixo]
893 for di in imp_hist_args, obs_hist_args, all_hist_args:
894 if 'histtype' not in di:
895 di['histtype'] = 'step'
897 ha, la = [], []
898 if len(imp) > 0:
899 h = ax.hist(np.asarray(imp), **imp_hist_args)
900 ha.append(h[-1][0])
901 la.append("Imp")
902 h1 = ax.hist(np.asarray(obs), **obs_hist_args)
903 h2 = ax.hist(np.asarray(self.data[col_name]), **all_hist_args)
904 ha.extend([h1[-1][0], h2[-1][0]])
905 la.extend(["Obs", "All"])
907 leg = fig.legend(ha, la, 'center right', numpoints=1)
908 leg.draw_frame(False)
910 ax.set_xlabel(col_name)
911 ax.set_ylabel("Frequency")
913 return fig
915 # Try to identify any auxiliary arrays (e.g. status vector in
916 # PHReg) that need to be bootstrapped along with exog and endog.
917 def _boot_kwds(self, kwds, rix):
919 for k in kwds:
920 v = kwds[k]
922 # This is only relevant for ndarrays
923 if not isinstance(v, np.ndarray):
924 continue
926 # Handle 1d vectors
927 if (v.ndim == 1) and (v.shape[0] == len(rix)):
928 kwds[k] = v[rix]
930 # Handle 2d arrays
931 if (v.ndim == 2) and (v.shape[0] == len(rix)):
932 kwds[k] = v[rix, :]
934 return kwds
936 def _perturb_bootstrap(self, vname):
937 """
938 Perturbs the model's parameters using a bootstrap.
939 """
941 endog, exog, init_kwds, fit_kwds = self.get_fitting_data(vname)
943 m = len(endog)
944 rix = np.random.randint(0, m, m)
945 endog = endog[rix]
946 exog = exog[rix, :]
948 init_kwds = self._boot_kwds(init_kwds, rix)
949 fit_kwds = self._boot_kwds(fit_kwds, rix)
951 klass = self.model_class[vname]
952 self.models[vname] = klass(endog, exog, **init_kwds)
954 if vname in self.regularized and self.regularized[vname]:
955 self.results[vname] = (
956 self.models[vname].fit_regularized(**fit_kwds))
957 else:
958 self.results[vname] = self.models[vname].fit(**fit_kwds)
960 self.params[vname] = self.results[vname].params
962 def _perturb_gaussian(self, vname):
963 """
964 Gaussian perturbation of model parameters.
966 The normal approximation to the sampling distribution of the
967 parameter estimates is used to define the mean and covariance
968 structure of the perturbation distribution.
969 """
971 endog, exog, init_kwds, fit_kwds = self.get_fitting_data(vname)
973 klass = self.model_class[vname]
974 self.models[vname] = klass(endog, exog, **init_kwds)
975 self.results[vname] = self.models[vname].fit(**fit_kwds)
977 cov = self.results[vname].cov_params()
978 mu = self.results[vname].params
979 self.params[vname] = np.random.multivariate_normal(mean=mu, cov=cov)
981 def perturb_params(self, vname):
983 if self.perturbation_method[vname] == "gaussian":
984 self._perturb_gaussian(vname)
985 elif self.perturbation_method[vname] == "boot":
986 self._perturb_bootstrap(vname)
987 else:
988 raise ValueError("unknown perturbation method")
990 def impute(self, vname):
991 # Wrap this in case we later add additional imputation
992 # methods.
993 self.impute_pmm(vname)
995 def update(self, vname):
996 """
997 Impute missing values for a single variable.
999 This is a two-step process in which first the parameters are
1000 perturbed, then the missing values are re-imputed.
1002 Parameters
1003 ----------
1004 vname : str
1005 The name of the variable to be updated.
1006 """
1008 self.perturb_params(vname)
1009 self.impute(vname)
1011 # work-around for inconsistent predict return values
1012 def _get_predicted(self, obj):
1014 if isinstance(obj, np.ndarray):
1015 return obj
1016 elif isinstance(obj, pd.Series):
1017 return obj.values
1018 elif hasattr(obj, 'predicted_values'):
1019 return obj.predicted_values
1020 else:
1021 raise ValueError(
1022 "cannot obtain predicted values from %s" % obj.__class__)
1024 def impute_pmm(self, vname):
1025 """
1026 Use predictive mean matching to impute missing values.
1028 Notes
1029 -----
1030 The `perturb_params` method must be called first to define the
1031 model.
1032 """
1034 k_pmm = self.k_pmm
1036 endog_obs, exog_obs, exog_miss, predict_obs_kwds, predict_miss_kwds = (
1037 self.get_split_data(vname))
1039 # Predict imputed variable for both missing and non-missing
1040 # observations
1041 model = self.models[vname]
1042 pendog_obs = model.predict(self.params[vname], exog_obs,
1043 **predict_obs_kwds)
1044 pendog_miss = model.predict(self.params[vname], exog_miss,
1045 **predict_miss_kwds)
1047 pendog_obs = self._get_predicted(pendog_obs)
1048 pendog_miss = self._get_predicted(pendog_miss)
1050 # Jointly sort the observed and predicted endog values for the
1051 # cases with observed values.
1052 ii = np.argsort(pendog_obs)
1053 endog_obs = endog_obs[ii]
1054 pendog_obs = pendog_obs[ii]
1056 # Find the closest match to the predicted endog values for
1057 # cases with missing endog values.
1058 ix = np.searchsorted(pendog_obs, pendog_miss)
1060 # Get the indices for the closest k_pmm values on
1061 # either side of the closest index.
1062 ixm = ix[:, None] + np.arange(-k_pmm, k_pmm)[None, :]
1064 # Account for boundary effects
1065 msk = np.nonzero((ixm < 0) | (ixm > len(endog_obs) - 1))
1066 ixm = np.clip(ixm, 0, len(endog_obs) - 1)
1068 # Get the distances
1069 dx = pendog_miss[:, None] - pendog_obs[ixm]
1070 dx = np.abs(dx)
1071 dx[msk] = np.inf
1073 # Closest positions in ix, row-wise.
1074 dxi = np.argsort(dx, 1)[:, 0:k_pmm]
1076 # Choose a column for each row.
1077 ir = np.random.randint(0, k_pmm, len(pendog_miss))
1079 # Unwind the indices
1080 jj = np.arange(dxi.shape[0])
1081 ix = dxi[(jj, ir)]
1082 iz = ixm[(jj, ix)]
1084 imputed_miss = np.array(endog_obs[iz]).squeeze()
1085 self._store_changes(vname, imputed_miss)
1088_mice_example_1 = """
1089 >>> imp = mice.MICEData(data)
1090 >>> fml = 'y ~ x1 + x2 + x3 + x4'
1091 >>> mice = mice.MICE(fml, sm.OLS, imp)
1092 >>> results = mice.fit(10, 10)
1093 >>> print(results.summary())
1095 .. literalinclude:: ../plots/mice_example_1.txt
1096 """
1098_mice_example_2 = """
1099 >>> imp = mice.MICEData(data)
1100 >>> fml = 'y ~ x1 + x2 + x3 + x4'
1101 >>> mice = mice.MICE(fml, sm.OLS, imp)
1102 >>> results = []
1103 >>> for k in range(10):
1104 >>> x = mice.next_sample()
1105 >>> results.append(x)
1106 """
1109class MICE(object):
1111 __doc__ = """\
1112 Multiple Imputation with Chained Equations.
1114 This class can be used to fit most statsmodels models to data sets
1115 with missing values using the 'multiple imputation with chained
1116 equations' (MICE) approach..
1118 Parameters
1119 ----------
1120 model_formula : str
1121 The model formula to be fit to the imputed data sets. This
1122 formula is for the 'analysis model'.
1123 model_class : statsmodels model
1124 The model to be fit to the imputed data sets. This model
1125 class if for the 'analysis model'.
1126 data : MICEData instance
1127 MICEData object containing the data set for which
1128 missing values will be imputed
1129 n_skip : int
1130 The number of imputed datasets to skip between consecutive
1131 imputed datasets that are used for analysis.
1132 init_kwds : dict-like
1133 Dictionary of keyword arguments passed to the init method
1134 of the analysis model.
1135 fit_kwds : dict-like
1136 Dictionary of keyword arguments passed to the fit method
1137 of the analysis model.
1139 Examples
1140 --------
1141 Run all MICE steps and obtain results:
1142 %(mice_example_1)s
1144 Obtain a sequence of fitted analysis models without combining
1145 to obtain summary::
1146 %(mice_example_2)s
1147 """ % {'mice_example_1': _mice_example_1,
1148 'mice_example_2': _mice_example_2}
1150 def __init__(self, model_formula, model_class, data, n_skip=3,
1151 init_kwds=None, fit_kwds=None):
1153 self.model_formula = model_formula
1154 self.model_class = model_class
1155 self.n_skip = n_skip
1156 self.data = data
1157 self.results_list = []
1159 self.init_kwds = init_kwds if init_kwds is not None else {}
1160 self.fit_kwds = fit_kwds if fit_kwds is not None else {}
1162 def next_sample(self):
1163 """
1164 Perform one complete MICE iteration.
1166 A single MICE iteration updates all missing values using their
1167 respective imputation models, then fits the analysis model to
1168 the imputed data.
1170 Returns
1171 -------
1172 params : array_like
1173 The model parameters for the analysis model.
1175 Notes
1176 -----
1177 This function fits the analysis model and returns its
1178 parameter estimate. The parameter vector is not stored by the
1179 class and is not used in any subsequent calls to `combine`.
1180 Use `fit` to run all MICE steps together and obtain summary
1181 results.
1183 The complete cycle of missing value imputation followed by
1184 fitting the analysis model is repeated `n_skip + 1` times and
1185 the analysis model parameters from the final fit are returned.
1186 """
1188 # Impute missing values
1189 self.data.update_all(self.n_skip + 1)
1190 start_params = None
1191 if len(self.results_list) > 0:
1192 start_params = self.results_list[-1].params
1194 # Fit the analysis model.
1195 model = self.model_class.from_formula(self.model_formula,
1196 self.data.data,
1197 **self.init_kwds)
1198 self.fit_kwds.update({"start_params": start_params})
1199 result = model.fit(**self.fit_kwds)
1201 return result
1203 def fit(self, n_burnin=10, n_imputations=10):
1204 """
1205 Fit a model using MICE.
1207 Parameters
1208 ----------
1209 n_burnin : int
1210 The number of burn-in cycles to skip.
1211 n_imputations : int
1212 The number of data sets to impute
1213 """
1215 # Run without fitting the analysis model
1216 self.data.update_all(n_burnin)
1218 for j in range(n_imputations):
1219 result = self.next_sample()
1220 self.results_list.append(result)
1222 self.endog_names = result.model.endog_names
1223 self.exog_names = result.model.exog_names
1225 return self.combine()
1227 def combine(self):
1228 """
1229 Pools MICE imputation results.
1231 This method can only be used after the `run` method has been
1232 called. Returns estimates and standard errors of the analysis
1233 model parameters.
1235 Returns a MICEResults instance.
1236 """
1238 # Extract a few things from the models that were fit to
1239 # imputed data sets.
1240 params_list = []
1241 cov_within = 0.
1242 scale_list = []
1243 for results in self.results_list:
1244 results_uw = results._results
1245 params_list.append(results_uw.params)
1246 cov_within += results_uw.cov_params()
1247 scale_list.append(results.scale)
1248 params_list = np.asarray(params_list)
1249 scale_list = np.asarray(scale_list)
1251 # The estimated parameters for the MICE analysis
1252 params = params_list.mean(0)
1254 # The average of the within-imputation covariances
1255 cov_within /= len(self.results_list)
1257 # The between-imputation covariance
1258 cov_between = np.cov(params_list.T)
1260 # The estimated covariance matrix for the MICE analysis
1261 f = 1 + 1 / float(len(self.results_list))
1262 cov_params = cov_within + f * cov_between
1264 # Fraction of missing information
1265 fmi = f * np.diag(cov_between) / np.diag(cov_params)
1267 # Set up a results instance
1268 scale = np.mean(scale_list)
1269 results = MICEResults(self, params, cov_params / scale)
1270 results.scale = scale
1271 results.frac_miss_info = fmi
1272 results.exog_names = self.exog_names
1273 results.endog_names = self.endog_names
1274 results.model_class = self.model_class
1276 return results
1279class MICEResults(LikelihoodModelResults):
1281 def __init__(self, model, params, normalized_cov_params):
1283 super(MICEResults, self).__init__(model, params,
1284 normalized_cov_params)
1286 def summary(self, title=None, alpha=.05):
1287 """
1288 Summarize the results of running MICE.
1290 Parameters
1291 ----------
1292 title : str, optional
1293 Title for the top table. If not None, then this replaces
1294 the default title
1295 alpha : float
1296 Significance level for the confidence intervals
1298 Returns
1299 -------
1300 smry : Summary instance
1301 This holds the summary tables and text, which can be
1302 printed or converted to various output formats.
1303 """
1305 from statsmodels.iolib import summary2
1306 from collections import OrderedDict
1308 smry = summary2.Summary()
1309 float_format = "%8.3f"
1311 info = OrderedDict()
1312 info["Method:"] = "MICE"
1313 info["Model:"] = self.model_class.__name__
1314 info["Dependent variable:"] = self.endog_names
1315 info["Sample size:"] = "%d" % self.model.data.data.shape[0]
1316 info["Scale"] = "%.2f" % self.scale
1317 info["Num. imputations"] = "%d" % len(self.model.results_list)
1319 smry.add_dict(info, align='l', float_format=float_format)
1321 param = summary2.summary_params(self, alpha=alpha)
1322 param["FMI"] = self.frac_miss_info
1324 smry.add_df(param, float_format=float_format)
1325 smry.add_title(title=title, results=self)
1327 return smry