Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1""" 

2Overview 

3-------- 

4 

5This module implements the Multiple Imputation through Chained 

6Equations (MICE) approach to handling missing data in statistical data 

7analyses. The approach has the following steps: 

8 

90. Impute each missing value with the mean of the observed values of 

10the same variable. 

11 

121. For each variable in the data set with missing values (termed the 

13'focus variable'), do the following: 

14 

151a. Fit an 'imputation model', which is a regression model for the 

16focus variable, regressed on the observed and (current) imputed values 

17of some or all of the other variables. 

18 

191b. Impute the missing values for the focus variable. Currently this 

20imputation must use the 'predictive mean matching' (pmm) procedure. 

21 

222. Once all variables have been imputed, fit the 'analysis model' to 

23the data set. 

24 

253. Repeat steps 1-2 multiple times and combine the results using a 

26'combining rule' to produce point estimates of all parameters in the 

27analysis model and standard errors for them. 

28 

29The imputations for each variable are based on an imputation model 

30that is specified via a model class and a formula for the regression 

31relationship. The default model is OLS, with a formula specifying 

32main effects for all other variables. 

33 

34The MICE procedure can be used in one of two ways: 

35 

36* If the goal is only to produce imputed data sets, the MICEData class 

37can be used to wrap a data frame, providing facilities for doing the 

38imputation. Summary plots are available for assessing the performance 

39of the imputation. 

40 

41* If the imputed data sets are to be used to fit an additional 

42'analysis model', a MICE instance can be used. After specifying the 

43MICE instance and running it, the results are combined using the 

44`combine` method. Results and various summary plots are then 

45available. 

46 

47Terminology 

48----------- 

49 

50The primary goal of the analysis is usually to fit and perform 

51inference using an 'analysis model'. If an analysis model is not 

52specified, then imputed datasets are produced for later use. 

53 

54The MICE procedure involves a family of imputation models. There is 

55one imputation model for each variable with missing values. An 

56imputation model may be conditioned on all or a subset of the 

57remaining variables, using main effects, transformations, 

58interactions, etc. as desired. 

59 

60A 'perturbation method' is a method for setting the parameter estimate 

61in an imputation model. The 'gaussian' perturbation method first fits 

62the model (usually using maximum likelihood, but it could use any 

63statsmodels fit procedure), then sets the parameter vector equal to a 

64draw from the Gaussian approximation to the sampling distribution for 

65the fit. The 'bootstrap' perturbation method sets the parameter 

66vector equal to a fitted parameter vector obtained when fitting the 

67conditional model to a bootstrapped version of the data set. 

68 

69Class structure 

70--------------- 

71 

72There are two main classes in the module: 

73 

74* 'MICEData' wraps a Pandas dataframe, incorporating information about 

75 the imputation model for each variable with missing values. It can 

76 be used to produce multiply imputed data sets that are to be further 

77 processed or distributed to other researchers. A number of plotting 

78 procedures are provided to visualize the imputation results and 

79 missing data patterns. The `history_func` hook allows any features 

80 of interest of the imputed data sets to be saved for further 

81 analysis. 

82 

83* 'MICE' takes both a 'MICEData' object and an analysis model 

84 specification. It runs the multiple imputation, fits the analysis 

85 models, and combines the results to produce a `MICEResults` object. 

86 The summary method of this results object can be used to see the key 

87 estimands and inferential quantities. 

88 

89Notes 

90----- 

91 

92By default, to conserve memory 'MICEData' saves very little 

93information from one iteration to the next. The data set passed by 

94the user is copied on entry, but then is over-written each time new 

95imputations are produced. If using 'MICE', the fitted 

96analysis models and results are saved. MICEData includes a 

97`history_callback` hook that allows arbitrary information from the 

98intermediate datasets to be saved for future use. 

99 

100References 

101---------- 

102 

103JL Schafer: 'Multiple Imputation: A Primer', Stat Methods Med Res, 

1041999. 

105 

106TE Raghunathan et al.: 'A Multivariate Technique for Multiply 

107Imputing Missing Values Using a Sequence of Regression Models', Survey 

108Methodology, 2001. 

109 

110SAS Institute: 'Predictive Mean Matching Method for Monotone Missing 

111Data', SAS 9.2 User's Guide, 2014. 

112 

113A Gelman et al.: 'Multiple Imputation with Diagnostics (mi) in R: 

114Opening Windows into the Black Box', Journal of Statistical Software, 

1152009. 

116""" 

117 

118import pandas as pd 

119import numpy as np 

120import patsy 

121from statsmodels.base.model import LikelihoodModelResults 

122from statsmodels.regression.linear_model import OLS 

123from collections import defaultdict 

124 

125 

126_mice_data_example_1 = """ 

127 >>> imp = mice.MICEData(data) 

128 >>> imp.set_imputer('x1', formula='x2 + np.square(x2) + x3') 

129 >>> for j in range(20): 

130 ... imp.update_all() 

131 ... imp.data.to_csv('data%02d.csv' % j)""" 

132 

133_mice_data_example_2 = """ 

134 >>> imp = mice.MICEData(data) 

135 >>> j = 0 

136 >>> for data in imp: 

137 ... imp.data.to_csv('data%02d.csv' % j) 

138 ... j += 1""" 

139 

140 

141class PatsyFormula(object): 

142 """ 

143 A simple wrapper for a string to be interpreted as a Patsy formula. 

144 """ 

145 def __init__(self, formula): 

146 self.formula = "0 + " + formula 

147 

148 

149class MICEData(object): 

150 

151 __doc__ = """\ 

152 Wrap a data set to allow missing data handling with MICE. 

153 

154 Parameters 

155 ---------- 

156 data : Pandas data frame 

157 The data set, which is copied internally. 

158 perturbation_method : str 

159 The default perturbation method 

160 k_pmm : int 

161 The number of nearest neighbors to use during predictive mean 

162 matching. Can also be specified in `fit`. 

163 history_callback : function 

164 A function that is called after each complete imputation 

165 cycle. The return value is appended to `history`. The 

166 MICEData object is passed as the sole argument to 

167 `history_callback`. 

168 

169 Examples 

170 -------- 

171 Draw 20 imputations from a data set called `data` and save them in 

172 separate files with filename pattern `dataXX.csv`. The variables 

173 other than `x1` are imputed using linear models fit with OLS, with 

174 mean structures containing main effects of all other variables in 

175 `data`. The variable named `x1` has a conditional mean structure 

176 that includes an additional term for x2^2. 

177 %(_mice_data_example_1)s 

178 

179 Impute using default models, using the MICEData object as an 

180 iterator. 

181 %(_mice_data_example_2)s 

182 

183 Notes 

184 ----- 

185 Allowed perturbation methods are 'gaussian' (the model parameters 

186 are set to a draw from the Gaussian approximation to the posterior 

187 distribution), and 'boot' (the model parameters are set to the 

188 estimated values obtained when fitting a bootstrapped version of 

189 the data set). 

190 

191 `history_callback` can be implemented to have side effects such as 

192 saving the current imputed data set to disk. 

193 """ % {'_mice_data_example_1': _mice_data_example_1, 

194 '_mice_data_example_2': _mice_data_example_2} 

195 

196 def __init__(self, data, perturbation_method='gaussian', 

197 k_pmm=20, history_callback=None): 

198 

199 if data.columns.dtype != np.dtype('O'): 

200 msg = "MICEData data column names should be string type" 

201 raise ValueError(msg) 

202 

203 self.regularized = dict() 

204 

205 # Drop observations where all variables are missing. This 

206 # also has the effect of copying the data frame. 

207 self.data = data.dropna(how='all').reset_index(drop=True) 

208 

209 self.history_callback = history_callback 

210 self.history = [] 

211 self.predict_kwds = {} 

212 

213 # Assign the same perturbation method for all variables. 

214 # Can be overridden when calling 'set_imputer'. 

215 self.perturbation_method = defaultdict(lambda: 

216 perturbation_method) 

217 

218 # Map from variable name to indices of observed/missing 

219 # values. 

220 self.ix_obs = {} 

221 self.ix_miss = {} 

222 for col in self.data.columns: 

223 ix_obs, ix_miss = self._split_indices(self.data[col]) 

224 self.ix_obs[col] = ix_obs 

225 self.ix_miss[col] = ix_miss 

226 

227 # Most recent model instance and results instance for each variable. 

228 self.models = {} 

229 self.results = {} 

230 

231 # Map from variable names to the conditional formula. 

232 self.conditional_formula = {} 

233 

234 # Map from variable names to init/fit args of the conditional 

235 # models. 

236 self.init_kwds = defaultdict(lambda: dict()) 

237 self.fit_kwds = defaultdict(lambda: dict()) 

238 

239 # Map from variable names to the model class. 

240 self.model_class = {} 

241 

242 # Map from variable names to most recent params update. 

243 self.params = {} 

244 

245 # Set default imputers. 

246 for vname in data.columns: 

247 self.set_imputer(vname) 

248 

249 # The order in which variables are imputed in each cycle. 

250 # Impute variables with the fewest missing values first. 

251 vnames = list(data.columns) 

252 nmiss = [len(self.ix_miss[v]) for v in vnames] 

253 nmiss = np.asarray(nmiss) 

254 ii = np.argsort(nmiss) 

255 ii = ii[sum(nmiss == 0):] 

256 self._cycle_order = [vnames[i] for i in ii] 

257 

258 self._initial_imputation() 

259 

260 self.k_pmm = k_pmm 

261 

262 def next_sample(self): 

263 """ 

264 Returns the next imputed dataset in the imputation process. 

265 

266 Returns 

267 ------- 

268 data : array_like 

269 An imputed dataset from the MICE chain. 

270 

271 Notes 

272 ----- 

273 `MICEData` does not have a `skip` parameter. Consecutive 

274 values returned by `next_sample` are immediately consecutive 

275 in the imputation chain. 

276 

277 The returned value is a reference to the data attribute of 

278 the class and should be copied before making any changes. 

279 """ 

280 

281 self.update_all(1) 

282 return self.data 

283 

284 def _initial_imputation(self): 

285 """ 

286 Use a PMM-like procedure for initial imputed values. 

287 

288 For each variable, missing values are imputed as the observed 

289 value that is closest to the mean over all observed values. 

290 """ 

291 

292 for col in self.data.columns: 

293 di = self.data[col] - self.data[col].mean() 

294 di = np.abs(di) 

295 ix = di.idxmin() 

296 imp = self.data[col].loc[ix] 

297 self.data[col].fillna(imp, inplace=True) 

298 

299 def _split_indices(self, vec): 

300 null = pd.isnull(vec) 

301 ix_obs = np.flatnonzero(~null) 

302 ix_miss = np.flatnonzero(null) 

303 if len(ix_obs) == 0: 

304 raise ValueError("variable to be imputed has no observed values") 

305 return ix_obs, ix_miss 

306 

307 def set_imputer(self, endog_name, formula=None, model_class=None, 

308 init_kwds=None, fit_kwds=None, predict_kwds=None, 

309 k_pmm=20, perturbation_method=None, regularized=False): 

310 """ 

311 Specify the imputation process for a single variable. 

312 

313 Parameters 

314 ---------- 

315 endog_name : str 

316 Name of the variable to be imputed. 

317 formula : str 

318 Conditional formula for imputation. Defaults to a formula 

319 with main effects for all other variables in dataset. The 

320 formula should only include an expression for the mean 

321 structure, e.g. use 'x1 + x2' not 'x4 ~ x1 + x2'. 

322 model_class : statsmodels model 

323 Conditional model for imputation. Defaults to OLS. See below 

324 for more information. 

325 init_kwds : dit-like 

326 Keyword arguments passed to the model init method. 

327 fit_kwds : dict-like 

328 Keyword arguments passed to the model fit method. 

329 predict_kwds : dict-like 

330 Keyword arguments passed to the model predict method. 

331 k_pmm : int 

332 Determines number of neighboring observations from which 

333 to randomly sample when using predictive mean matching. 

334 perturbation_method : str 

335 Either 'gaussian' or 'bootstrap'. Determines the method 

336 for perturbing parameters in the imputation model. If 

337 None, uses the default specified at class initialization. 

338 regularized : dict 

339 If regularized[name]=True, `fit_regularized` rather than 

340 `fit` is called when fitting imputation models for this 

341 variable. When regularized[name]=True for any variable, 

342 perturbation_method must be set to boot. 

343 

344 Notes 

345 ----- 

346 The model class must meet the following conditions: 

347 * A model must have a 'fit' method that returns an object. 

348 * The object returned from `fit` must have a `params` attribute 

349 that is an array-like object. 

350 * The object returned from `fit` must have a cov_params method 

351 that returns a square array-like object. 

352 * The model must have a `predict` method. 

353 """ 

354 

355 if formula is None: 

356 main_effects = [x for x in self.data.columns 

357 if x != endog_name] 

358 fml = endog_name + " ~ " + " + ".join(main_effects) 

359 self.conditional_formula[endog_name] = fml 

360 else: 

361 fml = endog_name + " ~ " + formula 

362 self.conditional_formula[endog_name] = fml 

363 

364 if model_class is None: 

365 self.model_class[endog_name] = OLS 

366 else: 

367 self.model_class[endog_name] = model_class 

368 

369 if init_kwds is not None: 

370 self.init_kwds[endog_name] = init_kwds 

371 

372 if fit_kwds is not None: 

373 self.fit_kwds[endog_name] = fit_kwds 

374 

375 if predict_kwds is not None: 

376 self.predict_kwds[endog_name] = predict_kwds 

377 

378 if perturbation_method is not None: 

379 self.perturbation_method[endog_name] = perturbation_method 

380 

381 self.k_pmm = k_pmm 

382 self.regularized[endog_name] = regularized 

383 

384 def _store_changes(self, col, vals): 

385 """ 

386 Fill in dataset with imputed values. 

387 

388 Parameters 

389 ---------- 

390 col : str 

391 Name of variable to be filled in. 

392 vals : ndarray 

393 Array of imputed values to use for filling-in missing values. 

394 """ 

395 

396 ix = self.ix_miss[col] 

397 if len(ix) > 0: 

398 self.data.iloc[ix, self.data.columns.get_loc(col)] = np.atleast_1d(vals) 

399 

400 def update_all(self, n_iter=1): 

401 """ 

402 Perform a specified number of MICE iterations. 

403 

404 Parameters 

405 ---------- 

406 n_iter : int 

407 The number of updates to perform. Only the result of the 

408 final update will be available. 

409 

410 Notes 

411 ----- 

412 The imputed values are stored in the class attribute `self.data`. 

413 """ 

414 

415 for k in range(n_iter): 

416 for vname in self._cycle_order: 

417 self.update(vname) 

418 

419 if self.history_callback is not None: 

420 hv = self.history_callback(self) 

421 self.history.append(hv) 

422 

423 def get_split_data(self, vname): 

424 """ 

425 Return endog and exog for imputation of a given variable. 

426 

427 Parameters 

428 ---------- 

429 vname : str 

430 The variable for which the split data is returned. 

431 

432 Returns 

433 ------- 

434 endog_obs : DataFrame 

435 Observed values of the variable to be imputed. 

436 exog_obs : DataFrame 

437 Current values of the predictors where the variable to be 

438 imputed is observed. 

439 exog_miss : DataFrame 

440 Current values of the predictors where the variable to be 

441 Imputed is missing. 

442 init_kwds : dict-like 

443 The init keyword arguments for `vname`, processed through Patsy 

444 as required. 

445 fit_kwds : dict-like 

446 The fit keyword arguments for `vname`, processed through Patsy 

447 as required. 

448 """ 

449 

450 formula = self.conditional_formula[vname] 

451 endog, exog = patsy.dmatrices(formula, self.data, 

452 return_type="dataframe") 

453 

454 # Rows with observed endog 

455 ixo = self.ix_obs[vname] 

456 endog_obs = np.asarray(endog.iloc[ixo]) 

457 exog_obs = np.asarray(exog.iloc[ixo, :]) 

458 

459 # Rows with missing endog 

460 ixm = self.ix_miss[vname] 

461 exog_miss = np.asarray(exog.iloc[ixm, :]) 

462 

463 predict_obs_kwds = {} 

464 if vname in self.predict_kwds: 

465 kwds = self.predict_kwds[vname] 

466 predict_obs_kwds = self._process_kwds(kwds, ixo) 

467 

468 predict_miss_kwds = {} 

469 if vname in self.predict_kwds: 

470 kwds = self.predict_kwds[vname] 

471 predict_miss_kwds = self._process_kwds(kwds, ixo) 

472 

473 return (endog_obs, exog_obs, exog_miss, predict_obs_kwds, 

474 predict_miss_kwds) 

475 

476 def _process_kwds(self, kwds, ix): 

477 kwds = kwds.copy() 

478 for k in kwds: 

479 v = kwds[k] 

480 if isinstance(v, PatsyFormula): 

481 mat = patsy.dmatrix(v.formula, self.data, 

482 return_type="dataframe") 

483 mat = np.asarray(mat)[ix, :] 

484 if mat.shape[1] == 1: 

485 mat = mat[:, 0] 

486 kwds[k] = mat 

487 return kwds 

488 

489 def get_fitting_data(self, vname): 

490 """ 

491 Return the data needed to fit a model for imputation. 

492 

493 The data is used to impute variable `vname`, and therefore 

494 only includes cases for which `vname` is observed. 

495 

496 Values of type `PatsyFormula` in `init_kwds` or `fit_kwds` are 

497 processed through Patsy and subset to align with the model's 

498 endog and exog. 

499 

500 Parameters 

501 ---------- 

502 vname : str 

503 The variable for which the fitting data is returned. 

504 

505 Returns 

506 ------- 

507 endog : DataFrame 

508 Observed values of `vname`. 

509 exog : DataFrame 

510 Regression design matrix for imputing `vname`. 

511 init_kwds : dict-like 

512 The init keyword arguments for `vname`, processed through Patsy 

513 as required. 

514 fit_kwds : dict-like 

515 The fit keyword arguments for `vname`, processed through Patsy 

516 as required. 

517 """ 

518 

519 # Rows with observed endog 

520 ix = self.ix_obs[vname] 

521 

522 formula = self.conditional_formula[vname] 

523 endog, exog = patsy.dmatrices(formula, self.data, 

524 return_type="dataframe") 

525 

526 endog = np.asarray(endog.iloc[ix, 0]) 

527 exog = np.asarray(exog.iloc[ix, :]) 

528 

529 init_kwds = self._process_kwds(self.init_kwds[vname], ix) 

530 fit_kwds = self._process_kwds(self.fit_kwds[vname], ix) 

531 

532 return endog, exog, init_kwds, fit_kwds 

533 

534 def plot_missing_pattern(self, ax=None, row_order="pattern", 

535 column_order="pattern", 

536 hide_complete_rows=False, 

537 hide_complete_columns=False, 

538 color_row_patterns=True): 

539 """ 

540 Generate an image showing the missing data pattern. 

541 

542 Parameters 

543 ---------- 

544 ax : AxesSubplot 

545 Axes on which to draw the plot. 

546 row_order : str 

547 The method for ordering the rows. Must be one of 'pattern', 

548 'proportion', or 'raw'. 

549 column_order : str 

550 The method for ordering the columns. Must be one of 'pattern', 

551 'proportion', or 'raw'. 

552 hide_complete_rows : bool 

553 If True, rows with no missing values are not drawn. 

554 hide_complete_columns : bool 

555 If True, columns with no missing values are not drawn. 

556 color_row_patterns : bool 

557 If True, color the unique row patterns, otherwise use grey 

558 and white as colors. 

559 

560 Returns 

561 ------- 

562 A figure containing a plot of the missing data pattern. 

563 """ 

564 

565 # Create an indicator matrix for missing values. 

566 miss = np.zeros(self.data.shape) 

567 cols = self.data.columns 

568 for j, col in enumerate(cols): 

569 ix = self.ix_miss[col] 

570 miss[ix, j] = 1 

571 

572 # Order the columns as requested 

573 if column_order == "proportion": 

574 ix = np.argsort(miss.mean(0)) 

575 elif column_order == "pattern": 

576 cv = np.cov(miss.T) 

577 u, s, vt = np.linalg.svd(cv, 0) 

578 ix = np.argsort(cv[:, 0]) 

579 elif column_order == "raw": 

580 ix = np.arange(len(cols)) 

581 else: 

582 raise ValueError( 

583 column_order + " is not an allowed value for `column_order`.") 

584 miss = miss[:, ix] 

585 cols = [cols[i] for i in ix] 

586 

587 # Order the rows as requested 

588 if row_order == "proportion": 

589 ix = np.argsort(miss.mean(1)) 

590 elif row_order == "pattern": 

591 x = 2**np.arange(miss.shape[1]) 

592 rky = np.dot(miss, x) 

593 ix = np.argsort(rky) 

594 elif row_order == "raw": 

595 ix = np.arange(miss.shape[0]) 

596 else: 

597 raise ValueError( 

598 row_order + " is not an allowed value for `row_order`.") 

599 miss = miss[ix, :] 

600 

601 if hide_complete_rows: 

602 ix = np.flatnonzero((miss == 1).any(1)) 

603 miss = miss[ix, :] 

604 

605 if hide_complete_columns: 

606 ix = np.flatnonzero((miss == 1).any(0)) 

607 miss = miss[:, ix] 

608 cols = [cols[i] for i in ix] 

609 

610 from statsmodels.graphics import utils as gutils 

611 from matplotlib.colors import LinearSegmentedColormap 

612 

613 if ax is None: 

614 fig, ax = gutils.create_mpl_ax(ax) 

615 else: 

616 fig = ax.get_figure() 

617 

618 if color_row_patterns: 

619 x = 2**np.arange(miss.shape[1]) 

620 rky = np.dot(miss, x) 

621 _, rcol = np.unique(rky, return_inverse=True) 

622 miss *= 1 + rcol[:, None] 

623 ax.imshow(miss, aspect="auto", interpolation="nearest", 

624 cmap='gist_ncar_r') 

625 else: 

626 cmap = LinearSegmentedColormap.from_list("_", 

627 ["white", "darkgrey"]) 

628 ax.imshow(miss, aspect="auto", interpolation="nearest", 

629 cmap=cmap) 

630 

631 ax.set_ylabel("Cases") 

632 ax.set_xticks(range(len(cols))) 

633 ax.set_xticklabels(cols, rotation=90) 

634 

635 return fig 

636 

637 def plot_bivariate(self, col1_name, col2_name, 

638 lowess_args=None, lowess_min_n=40, 

639 jitter=None, plot_points=True, ax=None): 

640 """ 

641 Plot observed and imputed values for two variables. 

642 

643 Displays a scatterplot of one variable against another. The 

644 points are colored according to whether the values are 

645 observed or imputed. 

646 

647 Parameters 

648 ---------- 

649 col1_name : str 

650 The variable to be plotted on the horizontal axis. 

651 col2_name : str 

652 The variable to be plotted on the vertical axis. 

653 lowess_args : dictionary 

654 A dictionary of dictionaries, keys are 'ii', 'io', 'oi' 

655 and 'oo', where 'o' denotes 'observed' and 'i' denotes 

656 imputed. See Notes for details. 

657 lowess_min_n : int 

658 Minimum sample size to plot a lowess fit 

659 jitter : float or tuple 

660 Standard deviation for jittering points in the plot. 

661 Either a single scalar applied to both axes, or a tuple 

662 containing x-axis jitter and y-axis jitter, respectively. 

663 plot_points : bool 

664 If True, the data points are plotted. 

665 ax : AxesSubplot 

666 Axes on which to plot, created if not provided. 

667 

668 Returns 

669 ------- 

670 The matplotlib figure on which the plot id drawn. 

671 """ 

672 

673 from statsmodels.graphics import utils as gutils 

674 from statsmodels.nonparametric.smoothers_lowess import lowess 

675 

676 if lowess_args is None: 

677 lowess_args = {} 

678 

679 if ax is None: 

680 fig, ax = gutils.create_mpl_ax(ax) 

681 else: 

682 fig = ax.get_figure() 

683 

684 ax.set_position([0.1, 0.1, 0.7, 0.8]) 

685 

686 ix1i = self.ix_miss[col1_name] 

687 ix1o = self.ix_obs[col1_name] 

688 ix2i = self.ix_miss[col2_name] 

689 ix2o = self.ix_obs[col2_name] 

690 

691 ix_ii = np.intersect1d(ix1i, ix2i) 

692 ix_io = np.intersect1d(ix1i, ix2o) 

693 ix_oi = np.intersect1d(ix1o, ix2i) 

694 ix_oo = np.intersect1d(ix1o, ix2o) 

695 

696 vec1 = np.asarray(self.data[col1_name]) 

697 vec2 = np.asarray(self.data[col2_name]) 

698 

699 if jitter is not None: 

700 if np.isscalar(jitter): 

701 jitter = (jitter, jitter) 

702 vec1 += jitter[0] * np.random.normal(size=len(vec1)) 

703 vec2 += jitter[1] * np.random.normal(size=len(vec2)) 

704 

705 # Plot the points 

706 keys = ['oo', 'io', 'oi', 'ii'] 

707 lak = {'i': 'imp', 'o': 'obs'} 

708 ixs = {'ii': ix_ii, 'io': ix_io, 'oi': ix_oi, 'oo': ix_oo} 

709 color = {'oo': 'grey', 'ii': 'red', 'io': 'orange', 

710 'oi': 'lime'} 

711 if plot_points: 

712 for ky in keys: 

713 ix = ixs[ky] 

714 lab = lak[ky[0]] + "/" + lak[ky[1]] 

715 ax.plot(vec1[ix], vec2[ix], 'o', color=color[ky], 

716 label=lab, alpha=0.6) 

717 

718 # Plot the lowess fits 

719 for ky in keys: 

720 ix = ixs[ky] 

721 if len(ix) < lowess_min_n: 

722 continue 

723 if ky in lowess_args: 

724 la = lowess_args[ky] 

725 else: 

726 la = {} 

727 ix = ixs[ky] 

728 lfit = lowess(vec2[ix], vec1[ix], **la) 

729 if plot_points: 

730 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky], 

731 alpha=0.6, lw=4) 

732 else: 

733 lab = lak[ky[0]] + "/" + lak[ky[1]] 

734 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky], 

735 alpha=0.6, lw=4, label=lab) 

736 

737 ha, la = ax.get_legend_handles_labels() 

738 pad = 0.0001 if plot_points else 0.5 

739 leg = fig.legend(ha, la, 'center right', numpoints=1, 

740 handletextpad=pad) 

741 leg.draw_frame(False) 

742 

743 ax.set_xlabel(col1_name) 

744 ax.set_ylabel(col2_name) 

745 

746 return fig 

747 

748 def plot_fit_obs(self, col_name, lowess_args=None, 

749 lowess_min_n=40, jitter=None, 

750 plot_points=True, ax=None): 

751 """ 

752 Plot fitted versus imputed or observed values as a scatterplot. 

753 

754 Parameters 

755 ---------- 

756 col_name : str 

757 The variable to be plotted on the horizontal axis. 

758 lowess_args : dict-like 

759 Keyword arguments passed to lowess fit. A dictionary of 

760 dictionaries, keys are 'o' and 'i' denoting 'observed' and 

761 'imputed', respectively. 

762 lowess_min_n : int 

763 Minimum sample size to plot a lowess fit 

764 jitter : float or tuple 

765 Standard deviation for jittering points in the plot. 

766 Either a single scalar applied to both axes, or a tuple 

767 containing x-axis jitter and y-axis jitter, respectively. 

768 plot_points : bool 

769 If True, the data points are plotted. 

770 ax : AxesSubplot 

771 Axes on which to plot, created if not provided. 

772 

773 Returns 

774 ------- 

775 The matplotlib figure on which the plot is drawn. 

776 """ 

777 

778 from statsmodels.graphics import utils as gutils 

779 from statsmodels.nonparametric.smoothers_lowess import lowess 

780 

781 if lowess_args is None: 

782 lowess_args = {} 

783 

784 if ax is None: 

785 fig, ax = gutils.create_mpl_ax(ax) 

786 else: 

787 fig = ax.get_figure() 

788 

789 ax.set_position([0.1, 0.1, 0.7, 0.8]) 

790 

791 ixi = self.ix_miss[col_name] 

792 ixo = self.ix_obs[col_name] 

793 

794 vec1 = np.asarray(self.data[col_name]) 

795 

796 # Fitted values 

797 formula = self.conditional_formula[col_name] 

798 endog, exog = patsy.dmatrices(formula, self.data, 

799 return_type="dataframe") 

800 results = self.results[col_name] 

801 vec2 = results.predict(exog=exog) 

802 vec2 = self._get_predicted(vec2) 

803 

804 if jitter is not None: 

805 if np.isscalar(jitter): 

806 jitter = (jitter, jitter) 

807 vec1 += jitter[0] * np.random.normal(size=len(vec1)) 

808 vec2 += jitter[1] * np.random.normal(size=len(vec2)) 

809 

810 # Plot the points 

811 keys = ['o', 'i'] 

812 ixs = {'o': ixo, 'i': ixi} 

813 lak = {'o': 'obs', 'i': 'imp'} 

814 color = {'o': 'orange', 'i': 'lime'} 

815 if plot_points: 

816 for ky in keys: 

817 ix = ixs[ky] 

818 ax.plot(vec1[ix], vec2[ix], 'o', color=color[ky], 

819 label=lak[ky], alpha=0.6) 

820 

821 # Plot the lowess fits 

822 for ky in keys: 

823 ix = ixs[ky] 

824 if len(ix) < lowess_min_n: 

825 continue 

826 if ky in lowess_args: 

827 la = lowess_args[ky] 

828 else: 

829 la = {} 

830 ix = ixs[ky] 

831 lfit = lowess(vec2[ix], vec1[ix], **la) 

832 ax.plot(lfit[:, 0], lfit[:, 1], '-', color=color[ky], 

833 alpha=0.6, lw=4, label=lak[ky]) 

834 

835 ha, la = ax.get_legend_handles_labels() 

836 leg = fig.legend(ha, la, 'center right', numpoints=1) 

837 leg.draw_frame(False) 

838 

839 ax.set_xlabel(col_name + " observed or imputed") 

840 ax.set_ylabel(col_name + " fitted") 

841 

842 return fig 

843 

844 def plot_imputed_hist(self, col_name, ax=None, imp_hist_args=None, 

845 obs_hist_args=None, all_hist_args=None): 

846 """ 

847 Display imputed values for one variable as a histogram. 

848 

849 Parameters 

850 ---------- 

851 col_name : str 

852 The name of the variable to be plotted. 

853 ax : AxesSubplot 

854 An axes on which to draw the histograms. If not provided, 

855 one is created. 

856 imp_hist_args : dict 

857 Keyword arguments to be passed to pyplot.hist when 

858 creating the histogram for imputed values. 

859 obs_hist_args : dict 

860 Keyword arguments to be passed to pyplot.hist when 

861 creating the histogram for observed values. 

862 all_hist_args : dict 

863 Keyword arguments to be passed to pyplot.hist when 

864 creating the histogram for all values. 

865 

866 Returns 

867 ------- 

868 The matplotlib figure on which the histograms were drawn 

869 """ 

870 

871 from statsmodels.graphics import utils as gutils 

872 

873 if imp_hist_args is None: 

874 imp_hist_args = {} 

875 if obs_hist_args is None: 

876 obs_hist_args = {} 

877 if all_hist_args is None: 

878 all_hist_args = {} 

879 

880 if ax is None: 

881 fig, ax = gutils.create_mpl_ax(ax) 

882 else: 

883 fig = ax.get_figure() 

884 

885 ax.set_position([0.1, 0.1, 0.7, 0.8]) 

886 

887 ixm = self.ix_miss[col_name] 

888 ixo = self.ix_obs[col_name] 

889 

890 imp = self.data[col_name].iloc[ixm] 

891 obs = self.data[col_name].iloc[ixo] 

892 

893 for di in imp_hist_args, obs_hist_args, all_hist_args: 

894 if 'histtype' not in di: 

895 di['histtype'] = 'step' 

896 

897 ha, la = [], [] 

898 if len(imp) > 0: 

899 h = ax.hist(np.asarray(imp), **imp_hist_args) 

900 ha.append(h[-1][0]) 

901 la.append("Imp") 

902 h1 = ax.hist(np.asarray(obs), **obs_hist_args) 

903 h2 = ax.hist(np.asarray(self.data[col_name]), **all_hist_args) 

904 ha.extend([h1[-1][0], h2[-1][0]]) 

905 la.extend(["Obs", "All"]) 

906 

907 leg = fig.legend(ha, la, 'center right', numpoints=1) 

908 leg.draw_frame(False) 

909 

910 ax.set_xlabel(col_name) 

911 ax.set_ylabel("Frequency") 

912 

913 return fig 

914 

915 # Try to identify any auxiliary arrays (e.g. status vector in 

916 # PHReg) that need to be bootstrapped along with exog and endog. 

917 def _boot_kwds(self, kwds, rix): 

918 

919 for k in kwds: 

920 v = kwds[k] 

921 

922 # This is only relevant for ndarrays 

923 if not isinstance(v, np.ndarray): 

924 continue 

925 

926 # Handle 1d vectors 

927 if (v.ndim == 1) and (v.shape[0] == len(rix)): 

928 kwds[k] = v[rix] 

929 

930 # Handle 2d arrays 

931 if (v.ndim == 2) and (v.shape[0] == len(rix)): 

932 kwds[k] = v[rix, :] 

933 

934 return kwds 

935 

936 def _perturb_bootstrap(self, vname): 

937 """ 

938 Perturbs the model's parameters using a bootstrap. 

939 """ 

940 

941 endog, exog, init_kwds, fit_kwds = self.get_fitting_data(vname) 

942 

943 m = len(endog) 

944 rix = np.random.randint(0, m, m) 

945 endog = endog[rix] 

946 exog = exog[rix, :] 

947 

948 init_kwds = self._boot_kwds(init_kwds, rix) 

949 fit_kwds = self._boot_kwds(fit_kwds, rix) 

950 

951 klass = self.model_class[vname] 

952 self.models[vname] = klass(endog, exog, **init_kwds) 

953 

954 if vname in self.regularized and self.regularized[vname]: 

955 self.results[vname] = ( 

956 self.models[vname].fit_regularized(**fit_kwds)) 

957 else: 

958 self.results[vname] = self.models[vname].fit(**fit_kwds) 

959 

960 self.params[vname] = self.results[vname].params 

961 

962 def _perturb_gaussian(self, vname): 

963 """ 

964 Gaussian perturbation of model parameters. 

965 

966 The normal approximation to the sampling distribution of the 

967 parameter estimates is used to define the mean and covariance 

968 structure of the perturbation distribution. 

969 """ 

970 

971 endog, exog, init_kwds, fit_kwds = self.get_fitting_data(vname) 

972 

973 klass = self.model_class[vname] 

974 self.models[vname] = klass(endog, exog, **init_kwds) 

975 self.results[vname] = self.models[vname].fit(**fit_kwds) 

976 

977 cov = self.results[vname].cov_params() 

978 mu = self.results[vname].params 

979 self.params[vname] = np.random.multivariate_normal(mean=mu, cov=cov) 

980 

981 def perturb_params(self, vname): 

982 

983 if self.perturbation_method[vname] == "gaussian": 

984 self._perturb_gaussian(vname) 

985 elif self.perturbation_method[vname] == "boot": 

986 self._perturb_bootstrap(vname) 

987 else: 

988 raise ValueError("unknown perturbation method") 

989 

990 def impute(self, vname): 

991 # Wrap this in case we later add additional imputation 

992 # methods. 

993 self.impute_pmm(vname) 

994 

995 def update(self, vname): 

996 """ 

997 Impute missing values for a single variable. 

998 

999 This is a two-step process in which first the parameters are 

1000 perturbed, then the missing values are re-imputed. 

1001 

1002 Parameters 

1003 ---------- 

1004 vname : str 

1005 The name of the variable to be updated. 

1006 """ 

1007 

1008 self.perturb_params(vname) 

1009 self.impute(vname) 

1010 

1011 # work-around for inconsistent predict return values 

1012 def _get_predicted(self, obj): 

1013 

1014 if isinstance(obj, np.ndarray): 

1015 return obj 

1016 elif isinstance(obj, pd.Series): 

1017 return obj.values 

1018 elif hasattr(obj, 'predicted_values'): 

1019 return obj.predicted_values 

1020 else: 

1021 raise ValueError( 

1022 "cannot obtain predicted values from %s" % obj.__class__) 

1023 

1024 def impute_pmm(self, vname): 

1025 """ 

1026 Use predictive mean matching to impute missing values. 

1027 

1028 Notes 

1029 ----- 

1030 The `perturb_params` method must be called first to define the 

1031 model. 

1032 """ 

1033 

1034 k_pmm = self.k_pmm 

1035 

1036 endog_obs, exog_obs, exog_miss, predict_obs_kwds, predict_miss_kwds = ( 

1037 self.get_split_data(vname)) 

1038 

1039 # Predict imputed variable for both missing and non-missing 

1040 # observations 

1041 model = self.models[vname] 

1042 pendog_obs = model.predict(self.params[vname], exog_obs, 

1043 **predict_obs_kwds) 

1044 pendog_miss = model.predict(self.params[vname], exog_miss, 

1045 **predict_miss_kwds) 

1046 

1047 pendog_obs = self._get_predicted(pendog_obs) 

1048 pendog_miss = self._get_predicted(pendog_miss) 

1049 

1050 # Jointly sort the observed and predicted endog values for the 

1051 # cases with observed values. 

1052 ii = np.argsort(pendog_obs) 

1053 endog_obs = endog_obs[ii] 

1054 pendog_obs = pendog_obs[ii] 

1055 

1056 # Find the closest match to the predicted endog values for 

1057 # cases with missing endog values. 

1058 ix = np.searchsorted(pendog_obs, pendog_miss) 

1059 

1060 # Get the indices for the closest k_pmm values on 

1061 # either side of the closest index. 

1062 ixm = ix[:, None] + np.arange(-k_pmm, k_pmm)[None, :] 

1063 

1064 # Account for boundary effects 

1065 msk = np.nonzero((ixm < 0) | (ixm > len(endog_obs) - 1)) 

1066 ixm = np.clip(ixm, 0, len(endog_obs) - 1) 

1067 

1068 # Get the distances 

1069 dx = pendog_miss[:, None] - pendog_obs[ixm] 

1070 dx = np.abs(dx) 

1071 dx[msk] = np.inf 

1072 

1073 # Closest positions in ix, row-wise. 

1074 dxi = np.argsort(dx, 1)[:, 0:k_pmm] 

1075 

1076 # Choose a column for each row. 

1077 ir = np.random.randint(0, k_pmm, len(pendog_miss)) 

1078 

1079 # Unwind the indices 

1080 jj = np.arange(dxi.shape[0]) 

1081 ix = dxi[(jj, ir)] 

1082 iz = ixm[(jj, ix)] 

1083 

1084 imputed_miss = np.array(endog_obs[iz]).squeeze() 

1085 self._store_changes(vname, imputed_miss) 

1086 

1087 

1088_mice_example_1 = """ 

1089 >>> imp = mice.MICEData(data) 

1090 >>> fml = 'y ~ x1 + x2 + x3 + x4' 

1091 >>> mice = mice.MICE(fml, sm.OLS, imp) 

1092 >>> results = mice.fit(10, 10) 

1093 >>> print(results.summary()) 

1094 

1095 .. literalinclude:: ../plots/mice_example_1.txt 

1096 """ 

1097 

1098_mice_example_2 = """ 

1099 >>> imp = mice.MICEData(data) 

1100 >>> fml = 'y ~ x1 + x2 + x3 + x4' 

1101 >>> mice = mice.MICE(fml, sm.OLS, imp) 

1102 >>> results = [] 

1103 >>> for k in range(10): 

1104 >>> x = mice.next_sample() 

1105 >>> results.append(x) 

1106 """ 

1107 

1108 

1109class MICE(object): 

1110 

1111 __doc__ = """\ 

1112 Multiple Imputation with Chained Equations. 

1113 

1114 This class can be used to fit most statsmodels models to data sets 

1115 with missing values using the 'multiple imputation with chained 

1116 equations' (MICE) approach.. 

1117 

1118 Parameters 

1119 ---------- 

1120 model_formula : str 

1121 The model formula to be fit to the imputed data sets. This 

1122 formula is for the 'analysis model'. 

1123 model_class : statsmodels model 

1124 The model to be fit to the imputed data sets. This model 

1125 class if for the 'analysis model'. 

1126 data : MICEData instance 

1127 MICEData object containing the data set for which 

1128 missing values will be imputed 

1129 n_skip : int 

1130 The number of imputed datasets to skip between consecutive 

1131 imputed datasets that are used for analysis. 

1132 init_kwds : dict-like 

1133 Dictionary of keyword arguments passed to the init method 

1134 of the analysis model. 

1135 fit_kwds : dict-like 

1136 Dictionary of keyword arguments passed to the fit method 

1137 of the analysis model. 

1138 

1139 Examples 

1140 -------- 

1141 Run all MICE steps and obtain results: 

1142 %(mice_example_1)s 

1143 

1144 Obtain a sequence of fitted analysis models without combining 

1145 to obtain summary:: 

1146 %(mice_example_2)s 

1147 """ % {'mice_example_1': _mice_example_1, 

1148 'mice_example_2': _mice_example_2} 

1149 

1150 def __init__(self, model_formula, model_class, data, n_skip=3, 

1151 init_kwds=None, fit_kwds=None): 

1152 

1153 self.model_formula = model_formula 

1154 self.model_class = model_class 

1155 self.n_skip = n_skip 

1156 self.data = data 

1157 self.results_list = [] 

1158 

1159 self.init_kwds = init_kwds if init_kwds is not None else {} 

1160 self.fit_kwds = fit_kwds if fit_kwds is not None else {} 

1161 

1162 def next_sample(self): 

1163 """ 

1164 Perform one complete MICE iteration. 

1165 

1166 A single MICE iteration updates all missing values using their 

1167 respective imputation models, then fits the analysis model to 

1168 the imputed data. 

1169 

1170 Returns 

1171 ------- 

1172 params : array_like 

1173 The model parameters for the analysis model. 

1174 

1175 Notes 

1176 ----- 

1177 This function fits the analysis model and returns its 

1178 parameter estimate. The parameter vector is not stored by the 

1179 class and is not used in any subsequent calls to `combine`. 

1180 Use `fit` to run all MICE steps together and obtain summary 

1181 results. 

1182 

1183 The complete cycle of missing value imputation followed by 

1184 fitting the analysis model is repeated `n_skip + 1` times and 

1185 the analysis model parameters from the final fit are returned. 

1186 """ 

1187 

1188 # Impute missing values 

1189 self.data.update_all(self.n_skip + 1) 

1190 start_params = None 

1191 if len(self.results_list) > 0: 

1192 start_params = self.results_list[-1].params 

1193 

1194 # Fit the analysis model. 

1195 model = self.model_class.from_formula(self.model_formula, 

1196 self.data.data, 

1197 **self.init_kwds) 

1198 self.fit_kwds.update({"start_params": start_params}) 

1199 result = model.fit(**self.fit_kwds) 

1200 

1201 return result 

1202 

1203 def fit(self, n_burnin=10, n_imputations=10): 

1204 """ 

1205 Fit a model using MICE. 

1206 

1207 Parameters 

1208 ---------- 

1209 n_burnin : int 

1210 The number of burn-in cycles to skip. 

1211 n_imputations : int 

1212 The number of data sets to impute 

1213 """ 

1214 

1215 # Run without fitting the analysis model 

1216 self.data.update_all(n_burnin) 

1217 

1218 for j in range(n_imputations): 

1219 result = self.next_sample() 

1220 self.results_list.append(result) 

1221 

1222 self.endog_names = result.model.endog_names 

1223 self.exog_names = result.model.exog_names 

1224 

1225 return self.combine() 

1226 

1227 def combine(self): 

1228 """ 

1229 Pools MICE imputation results. 

1230 

1231 This method can only be used after the `run` method has been 

1232 called. Returns estimates and standard errors of the analysis 

1233 model parameters. 

1234 

1235 Returns a MICEResults instance. 

1236 """ 

1237 

1238 # Extract a few things from the models that were fit to 

1239 # imputed data sets. 

1240 params_list = [] 

1241 cov_within = 0. 

1242 scale_list = [] 

1243 for results in self.results_list: 

1244 results_uw = results._results 

1245 params_list.append(results_uw.params) 

1246 cov_within += results_uw.cov_params() 

1247 scale_list.append(results.scale) 

1248 params_list = np.asarray(params_list) 

1249 scale_list = np.asarray(scale_list) 

1250 

1251 # The estimated parameters for the MICE analysis 

1252 params = params_list.mean(0) 

1253 

1254 # The average of the within-imputation covariances 

1255 cov_within /= len(self.results_list) 

1256 

1257 # The between-imputation covariance 

1258 cov_between = np.cov(params_list.T) 

1259 

1260 # The estimated covariance matrix for the MICE analysis 

1261 f = 1 + 1 / float(len(self.results_list)) 

1262 cov_params = cov_within + f * cov_between 

1263 

1264 # Fraction of missing information 

1265 fmi = f * np.diag(cov_between) / np.diag(cov_params) 

1266 

1267 # Set up a results instance 

1268 scale = np.mean(scale_list) 

1269 results = MICEResults(self, params, cov_params / scale) 

1270 results.scale = scale 

1271 results.frac_miss_info = fmi 

1272 results.exog_names = self.exog_names 

1273 results.endog_names = self.endog_names 

1274 results.model_class = self.model_class 

1275 

1276 return results 

1277 

1278 

1279class MICEResults(LikelihoodModelResults): 

1280 

1281 def __init__(self, model, params, normalized_cov_params): 

1282 

1283 super(MICEResults, self).__init__(model, params, 

1284 normalized_cov_params) 

1285 

1286 def summary(self, title=None, alpha=.05): 

1287 """ 

1288 Summarize the results of running MICE. 

1289 

1290 Parameters 

1291 ---------- 

1292 title : str, optional 

1293 Title for the top table. If not None, then this replaces 

1294 the default title 

1295 alpha : float 

1296 Significance level for the confidence intervals 

1297 

1298 Returns 

1299 ------- 

1300 smry : Summary instance 

1301 This holds the summary tables and text, which can be 

1302 printed or converted to various output formats. 

1303 """ 

1304 

1305 from statsmodels.iolib import summary2 

1306 from collections import OrderedDict 

1307 

1308 smry = summary2.Summary() 

1309 float_format = "%8.3f" 

1310 

1311 info = OrderedDict() 

1312 info["Method:"] = "MICE" 

1313 info["Model:"] = self.model_class.__name__ 

1314 info["Dependent variable:"] = self.endog_names 

1315 info["Sample size:"] = "%d" % self.model.data.data.shape[0] 

1316 info["Scale"] = "%.2f" % self.scale 

1317 info["Num. imputations"] = "%d" % len(self.model.results_list) 

1318 

1319 smry.add_dict(info, align='l', float_format=float_format) 

1320 

1321 param = summary2.summary_params(self, alpha=alpha) 

1322 param["FMI"] = self.frac_miss_info 

1323 

1324 smry.add_df(param, float_format=float_format) 

1325 smry.add_title(title=title, results=self) 

1326 

1327 return smry