sensortoolkit.qc._outlier_detection.cooks_outlier_detection

cooks_outlier_detection(hourly_df_list, hourly_ref_df, param, serials, invalidate=False)[source]

Estimate outliers via Cook’s distance for 1-hr sensor vs. ref. regress.

Values for timestamps exceeding a threshold of 4/L (L is the total number of sensor-FRM/FEM data pairs) are indicated by Cooks distance to be potential outliers. To ensure that data points identified by Cooks distance are likely outliers, the absolute difference (AD) and percent difference (PD) (and their respective standard deviations (SD)) are computed between sensor and reference data. The median plus twice the SD of both the AD and PD are computed, and each data point identified by Cook’s distance is compared against these thresholds. If the AD and PD for the potential outlier data point exceed these thresholds, a QA/QC code is assigned to the corresponding time stamp.

If ‘invalidate’ is true, sensor evaluation parameter data points that are identified by Cook’s distance as potential outliers and exceed the AD and PD thresholds are set to null.

Parameters
  • hourly_df_list (list) – List of sensor datasets at 1-hour averaged intervals.

  • hourly_ref_df (pandas DataFrame) – Reference dataframe at 1-hour averaged intervals for the passed parameter.

  • param (str) – Column header name for the parameter values.

  • serials (dict) – A dictionary of unique serial identifiers for each sensor in the testing group.

  • invalidate (bool, optional) – If True, outlier entries will be set null (np.nan). Defaults to False.

Returns

A list of modified sensor datasets.

Return type

hourly_df_list (list)