API Documentation¶
The primary method of using RSMTool is via the command-line scripts rsmtool, rsmeval, rsmpredict, rsmcompare, and rsmsummarize. However, there are certain functions in the rsmtool
API that may also be useful to advanced users for use directly in their Python code. We document these functions below.
Note
RSMTool v5.7 and older provided the API functionsmetrics_helper
,convert_ipynb_to_html
, andremove_outliers
. These functions have now been turned into static methods for different classes. If you are using these functions in your code and want to migrate to the new API, you should replace the follwing statements in your code:
from rsmtool.analysis import metrics_helper
metrics_helper(...)
from rsmtool.report import convert_ipynb_to_html
convert_ipynb_to_html(...)
from rsmtool.preprocess import remove_outliers
remove_outliers(...)
with the following, respectively:
from rsmtool.analyzer import Analyzer
Analyzer.metrics_helper(...)
from rsmtool.reporter import Reporter
Reporter.convert_ipynb_to_html(...)
from rsmtool.preprocessor import FeaturePreprocessor
FeaturePreprocessor.remove_outliers(...)
rsmtool
Package¶
-
rsmtool.
run_experiment
(config_file_or_obj, output_dir)[source]¶ Run RSMTool experiment using the given configuration file and generate all outputs in the given directory.
Parameters: - config_file_or_obj (str or Configuration) – Path to the experiment configuration file. Users can also pass a Configuration object that is in memory.
- output_dir (str) – Path to the experiment output directory.
Raises: ValueError
– If any of the required fields are missing or ill-specified.
-
rsmtool.
run_evaluation
(config_file_or_obj, output_dir)[source]¶ Run an rsmeval experiment using the given configuration file and generate all outputs in the given directory.
Parameters: - config_file_or_obj (str or configuration_parser.Configuration) – Path to the experiment configuration file. Users can also pass a Configuration object that is in memory.
- output_dir (str) – Path to the experiment output directory.
Raises: ValueError
– If any of the required fields are missing or ill-specified.
-
rsmtool.
run_comparison
(config_file_or_obj, output_dir)[source]¶ Run an
rsmcompare
experiment using the given configuration file and generate the report in the given directory.Parameters: - config_file_or_obj (str or Configuration) – Path to the experiment configuration file. Users can also pass a Configuration object that is in memory.
- output_dir (str) – Path to the experiment output directory.
Raises: ValueError
– If any of the required fields are missing or ill-specified.
-
rsmtool.
run_summary
(config_file_or_obj, output_dir)[source]¶ Run rsmsummarize experiment using the given configuration file and generate all outputs in the given directory.
Parameters: - config_file_or_obj (str or configuration_parser.Configuration) – Path to the experiment configuration file. Users can also pass a Configuration object that is in memory.
- output_dir (str) – Path to the experiment output directory.
Raises: ValueError
– If any of the required fields are missing or ill-specified.
-
rsmtool.
compute_and_save_predictions
(config_file_or_obj, output_file, feats_file=None)[source]¶ Run
rsmpredict
with given configuration file and generate predictions (and, optionally, pre-processed feature values).Parameters: - config_file_or_obj (str or configuration_parser.Configuration) – Path to the experiment configuration file. Users can also pass a Configuration object that is in memory.
- output_dir (str) – Path to the output directory for saving files.
- (optional) (feats_file) – Path to the output file for saving preprocessed feature values.
Raises: ValueError
– If any of the required fields are missing or ill-specified.
From analyzer
Module¶
Classes for analyzing RSMTool predictions, metrics, etc.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.analyzer.
Analyzer
[source]¶ Bases:
object
Analyzer class, which performs analysis on all metrics, predictions, etc.
-
static
analyze_excluded_responses
(df, features, header, exclude_zero_scores=True, exclude_listwise=False)[source]¶ Compute statistics on the responses that were excluded from analyses, either in the training set or in the test set.
Parameters: - df (pandas DataFrame) – Data frame containing the excluded responses
- features (list of str) – List of column names containing the features to which we want to restrict the analyses.
- header (str) – String to be used as the table header for the output data frame.
- exclude_zero_scores (bool, optional) – Whether or not the zero-score responses should be counted in the exclusion statistics, defaults to True.
- exclude_listwise (bool, optional) – Whether or not the candidates were excluded based on minimal number of responses
Returns: df_full_crosstab – Two-dimensional data frame containing the exclusion statistics.
Return type: pandas DataFrame
-
static
analyze_used_predictions
(df_test, subgroups, candidate_column)[source]¶ Compute statistics on the predictions that were used in analyses.
Parameters: - df_test (pandas DataFrame) – Data frame containing the test set predictions.
- subgroups (list of str) – List of column names that contain grouping information.
- candidate_column (str) – Column name that contains candidate identification information.
Returns: df_analysis – Data frame containing information about the used predictions.
Return type: pandas DataFrame
-
static
analyze_used_responses
(df_train, df_test, subgroups, candidate_column)[source]¶ Compute statistics on the responses that were used in analyses, either in the training set or in the test set.
Parameters: - df_train (pandas DataFrame) – Data frame containing the response information for the training set.
- df_test (pandas DataFrame) – Data frame containing the response information for the test set.
- subgroups (list of str) – List of column names that contain grouping information.
- candidate_column (str) – Column name that contains candidate identification information.
Returns: df_analysis – Data frame containing information about the used responses.
Return type: pandas DataFrame
-
static
check_frame_names
(data_container, dataframe_names)[source]¶ Check to make sure all specified DataFrames are in the data container object.
Parameters: - data_container (container.DataContainer) – A DataContainer object
- dataframe_names (list of str) – The names of the DataFrames expected in the DataContainer object.
Raises: KeyError
– If a given dataframe_name is not in the DataContainer object.
-
static
check_param_names
(configuration_obj, parameter_names)[source]¶ Check to make sure all specified parameters are in the configuration object.
Parameters: - configuration_obj (configuration_parser.Configuration) – A configuration object
- parameter_names (list of str) – The names of the parameters (keys) expected in the Configuration object.
Raises: KeyError
– If a given parameter_name is not in the Configuration object.
-
static
compute_basic_descriptives
(df, selected_features)[source]¶ Compute basic descriptive statistics for the columns in the given data frame.
Parameters: - df (pandas DataFrame) – Input data frame containing the feature values.
- selected_features (list of str) – List of feature names for which to compute the descriptives.
Returns: df_desc – DataFrame containing the descriptives for each of the features.
Return type: pandas DataFrame
-
compute_correlations_by_group
(df, selected_features, target_variable, grouping_variable, include_length=False)[source]¶ Compute various marginal and partial correlations of the given columns in the given data frame against the target variable for all data and for each level of the grouping variable.
Parameters: - df (pandas DataFrame) – Input data frame.
- selected_features (list of str) – List of feature names for which to compute the correlations.
- target_variable (str) – Feature name indicating the target variable i.e., the dependent variable
- grouping_variable (str) – Feature name that contain the grouping information
- include_length (bool, optional) – Whether or not to include the length when computing the partial correlations. Defaults to False.
Returns: df_output – Data frame containing the correlations.
Return type: pandas DataFrame
-
compute_degradation_and_disattenuated_correlations
(df, use_all_responses=True)[source]¶ Compute the degradation in performance when using the machine to predict the score instead of a second human and the the disattenuated correlations between human and machine scores. These are computed as the Pearson’s correlation between the human score and the machine score divided by the square root of correlation between two human raters.
For this, we can compute the machine performance either only on the double scored data or on the full dataset. Both options have their pros and cons. The default is to use the full dataset. This function also assumes that the sc2 column exists in the given data frame, in addition to sc1 and the various types of predictions.
Parameters: - df (pandas DataFrame) – Input data frame.
- use_all_responses (bool, optional) – Use the full data set instead of only using the double-scored subset, defaults to True.
Returns: - df_degradation (pandas DataFrame) – Data frame containing the degradation statistics.
- df_correlations (pandas DataFrame) – Data frame containing the HM correlation, HH correlation and disattenuated correlation
-
static
compute_disattenuated_correlations
(human_machine_corr, human_human_corr)[source]¶ Compute the disattenuated correlations between human and machine scores. These are computed as the Pearson’s correlation between the human score and the machine score divided by the square root of correlation between two human raters.
Parameters: - human_machine_corr (pandas Series) – Series containing of pearson’s correlation coefficients human-machine correlations
- human_human_corr (pandas Series) – Series containing of pearson’s correlation coefficients for human-human correlations. This can contain a single value or have the index matching that of human-machine correlations
Returns: df_correlations – Data frame containing the HM correlation, HH correlation, and disattenuated correlation
Return type: pandas DataFrame
-
compute_metrics
(df, compute_shortened=False, use_scaled_predictions=False, include_second_score=False, population_sd_dict=None)[source]¶ Compute the evaluation metrics for the scores in the given data frame. This function compute metrics for all score types.
If include_second_score is True, then assume that a column called sc2 containing a second human score is available and use that to compute the human-human evaluation stats and the performance degradation stats.
If compute_shortened is set to True, then this function also computes a shortened version of the full human-machine metrics data frame. See filter_metrics() for a description of the default columns included in the shortened data frame.
Parameters: - df (pandas DataFrame) – Input data frame
- compute_shortened (bool, optional) – Also compute a shortened version of the full metrics data frame, defaults to False.
- use_scaled_predictions (bool, optional) – Use evaluations based on scaled predictions in the shortened version of the metrics data frame.
- include_second_score (bool, optional) – Second human score available, defaults to False.
- population_sd_dict (dict, optional) – Dictionary containing population standard deviation for each column containing human or system scores. This is used to compute SMD for subgroups.
Returns: - df_human_machine_eval (pandas DataFrame) – Data frame containing the full set of evaluation metrics.
- df_human_machine_eval_filtered (pandas DataFrame) – Data frame containing the human-human statistics but is empty if include_second_score is False.
- df_human_human_eval (pandas DataFrame) – A shortened version of the first data frame but is empty if compute_shortened is False.
-
compute_metrics_by_group
(df_test, grouping_variable, use_scaled_predictions=False, include_second_score=False)[source]¶ Compute a subset of the evaluation metrics for the scores in the given data frame by group specified in grouping_variable. See filter_metrics() above for a description of the subset that is selected.
Parameters: - df_test (pandas DataFrame) – Input data frame.
- grouping_variable (str) – Feature name indicating the column that contains grouping information.
- use_scaled_predictions (bool, optional) – Include scaled predictions when computing the evaluation metrics, defaults to False.
- include_second_score (bool, optional) – Include human-human association statistics, defaults to False.
Returns: - df_human_machine_eval_by_group (pandas DataFrame) – Data frame containing the correlation human-machine association statistics.
- df_human_human_eval_by_group (pandas DataFrame) – Data frame that either contains the human-human statistics or is an empty data frame, depending on whether include_second_score is True.
-
static
compute_outliers
(df, selected_features)[source]¶ Compute the number and percentage of outliers outside mean +/- 4 SD for the given columns in the given data frame.
Parameters: - df (pandas DataFrame) – Input data frame containing the feature values.
- selected_features (list of str) – List of feature names for which to compute outlier information.
Returns: df_output – Data frame containing outlier information for each of the features.
Return type: pandas DataFrame
-
static
compute_pca
(df, selected_features)[source]¶ Compute the PCA decomposition of features in the given data frame, restricted to the given columns.
Parameters: - df (pandas DataFrame) – Input data frame containing feature values.
- selected_features (list of str) – List of feature names to be used in the PCA decomposition.
Returns: - df_components (pandas DataFrame) – Data frame containing the PCA components.
- df_variance (pandas DataFrame) – Data frame containing the variance information.
-
static
compute_percentiles
(df, selected_features, percentiles=None)[source]¶ Compute percentiles and outlier descriptives for the columns in the given data frame.
Parameters: - df (pandas DataFrame) – Input data frame containing the feature values.
- selected_features (list of str) – List of feature names for which to compute the percentile descriptives.
- percentiles (list of ints, optional) – The percentiles to calculate. If None, use the percentiles {1, 5, 25, 50, 75, 95, 99}. Defaults to None.
Returns: df_output – Data frame containing the percentile information for each of the features.
Return type: pandas DataFrame
-
static
correlation_helper
(df, target_variable, grouping_variable, include_length=False)[source]¶ A helper function to compute marginal and partial correlations of all the columns in the given data frame against the target variable separately for each level in the the grouping variable. If include_length is True, it additionally computes partial correlations of each column in the data frame against the target variable after controlling for the length column.
Parameters: - df (pandas DataFrame) – Input data frame containing numeric feature values, the numeric target variable and the grouping variable.
- target_variable (str) – The name of the column used as a reference for computing correlations.
- grouping_variable (str) – The name of the column defining groups in the data
- include_length (bool, optional) – If True compute additional partial correlations of each column in the data frame against target variable only partialling out length column.
Returns: - df_target_cors (pandas DataFrame) – Data frame containing Pearson’s correlation coefficients for marginal correlations between features and target_variable.
- df_target_partcors (pandas DataFrame) – Data frame containing Pearson’s correlation coefficients for partial correlations between each feature and target_variable after controlling for all other features. If include_length is set to True, length will not be included into partial correlation comutation.
- df_target_partcors_no_length (pandas DataFrame) – If include_length is set to true: Data frame containing Pearson’s correlation coefficients for partial correlations between each feature and target_variable after controlling for length. Otherwise, it will be an empty data frame.
-
filter_metrics
(df_metrics, use_scaled_predictions=False, chosen_metric_dict=None)[source]¶ Filter the data frame df_metrics that contain all of the metric values by all score types (raw, raw_trim etc.) to retain only the metrics as defined in the given dictionary chosen_metric_dict. This is a dictionary that maps score types (‘raw’, ‘scale’, ‘raw_trim’ etc.) to the list of metrics that should be computed for them. The full list is
- 'corr' - 'kappa' - 'wtkappa' - 'exact_agr' - 'adj_agr' - 'SMD' - 'RMSE' - 'R2' - 'sys_min' - 'sys_max' - 'sys_mean' - 'sys_sd' - 'h_min' - 'h_max' - 'h_mean' - 'h_sd' - 'N'
Parameters: - df_metrics (pd.DataFrame) – The DataFrame to filter.
- use_scaled_predictions (bool, optional) – Whether to use scaled predictions. Defaults to False.
- chosen_metric_dict (dict, optional) – The dictionary to map score types to metrics that should be computer for them. Defaults to None.
Notes
Note that the last five metrics will be the same for all score types. If the dictionary is not specified then, the following dictionary, containing the recommended metrics, is used
{'raw/scale_trim': ['N', 'h_mean', 'h_sd', 'sys_mean', 'sys_sd', 'corr', 'RMSE', 'R2', 'SMD'], 'raw/scale_trim_round': ['sys_mean', 'sys_sd', 'wtkappa', 'kappa', 'exact_agr', 'adj_agr', 'SMD']}
where raw/scale is chosen depending on whether use_scaled_predictions is False or True.
-
static
metrics_helper
(human_scores, system_scores, population_human_score_sd=None, population_system_score_sd=None)[source]¶ This is a helper function that computes some basic agreement and association metrics between the system scores and the human scores.
Parameters: - human_scores (pandas Series) – Series containing numeric human (reference) scores.
- system_scores (pandas Series) – Series containing numeric scores predicted by the model.
- population_human_score_sd (float, optional) – Reference standard deviation for human scores. This is used to compute SMD and should be the standard deviation for the whole population when SMD are computed for individual subgroups. When None, this will be computed as the standard deviation of human_scores
- population_system_score_sd (float, optional) – Reference standard deviation for system scores. This is used to compute SMD and should be the standard deviation for the whole population when SMD are computed for individual subgroups. When None, this will be computed as the standard devaiation of system_scores.
Returns: metrics – Series containing different evaluation metrics comparing human and system scores. The following metrics are included
- `kappa`: unweighted Cohen's kappa - `wtkappa`: quadratic weighted kappa - `exact_agr`: exact agreement - `adj_agr`: adjacent agreement with tolerance set to 1 - `SMD`: standardized mean difference - `corr`: Pearson's r - `R2`: r squared - `RMSE`: root mean square error - `sys_min`: min system score - `sys_max`: max system score - `sys_mean`: mean system score (ddof=1) - `sys_sd`: standard deviation of system scores (ddof=1) - `h_min`: min human score - `h_max`: max human score - `h_mean`: mean human score (ddof=1) - `h_sd`: standard deviation of human scores (ddof=1) - `N`: total number of responses
Return type: pandas Series
-
run_data_composition_analyses_for_rsmeval
(data_container, configuration)[source]¶ Similar to run_data_composition_analyses_for_rsmtool() but for RSMEval.
Parameters: - data_container (container.DataContainer) – The DataContainer object. This container must include the following DataFrames: {‘test_metadata’, ‘test_excluded’}
- configuration (configuration_parser.Configuration) –
The Configuration object. This configuration object must include the following parameters (keys)
{'subgroups', 'candidate_column', 'exclude_zero_scores', 'exclude_listwise'}
Returns: data_container (container.DataContainer) –
- A new DataContainer object with the following DataFrames ::
- test_excluded_composition
- data_composition
- data_composition_by_*
configuration (configuration_parser.Configuration) – A new Configuration object.
-
run_data_composition_analyses_for_rsmtool
(data_container, configuration)[source]¶ Run all data composition analyses for RSMTool.
Parameters: - data_container (container.DataContainer) – The DataContainer object. This container must include the following DataFrames: {‘test_metadata’, ‘train_metadata’,’train_excluded’, ‘test_excluded’, ‘train_features’}
- configuration (configuration_parser.Configuration) –
The Configuration object. This configuration object must include the following parameters (keys)
{'subgroups', 'candidate_column', 'exclude_zero_scores', 'exclude_listwise'}
Returns: data_container (container.DataContainer) –
- A new DataContainer object with the following DataFrames ::
- test_excluded_composition
- train_excluded_composition
- data_composition
- data_composition_by_*
configuration (configuration_parser.Configuration) – A new Configuration object.
-
run_prediction_analyses
(data_container, configuration)[source]¶ Run all the analyses on the machine predictions.
Parameters: - data_container (container.DataContainer) – The DataContainer object. This container must include the following DataFrames: {‘train_features’, ‘train_metadata’,’train_preprocessed_features’, ‘train_length’, ‘train_features’}
- configuration (configuration_parser.Configuration) –
The Configuration object. This configuration object must include the following parameters (keys)
{'subgroups', 'second_human_score_column', 'use_scaled_predictions'}
Returns: data_container (container.DataContainer) –
A new DataContainer object with the following DataFrames
- eval - eval_short - consistency - degradation - disattenudated_correlations - confMatrix - score_dist - eval_by_* - consistency_by_* - disattenduated_correlations_by_*
configuration (configuration_parser.Configuration) – A new Configuration object.
-
run_training_analyses
(data_container, configuration)[source]¶ Run all of the analyses on the training data.
Parameters: - data_container (container.DataContainer) – The DataContainer object. This container must include the following DataFrames: {‘train_features’, ‘train_metadata’,’train_preprocessed_features’, ‘train_length’, ‘train_features’}
- configuration (configuration_parser.Configuration) –
The Configuration object. This configuration object mus include the following parameters (keys)
{'length_column', 'subgroups', 'selected_features'}
Returns: data_container (container.DataContainer) –
- A new DataContainer object with the following DataFrames ::
- feature_descriptives
- feature_descriptivesExtra
- feature_outliers
- cors_orig
- cors_processed
- margcor_score_all_data
- pcor_score_all_data
- pcor_score_no_length_all_data
- margcor_length_all_data
- pcor_length_all_data
- pca
- pcavar
- margcor_length_by_*
- pcor_length_by_*
- margcor_score_by_*
- pcor_score_by_*
- pcor_score_no_length_by_*
configuration (configuration_parser.Configuration) – A new Configuration object
-
static
From comparer
Module¶
Classes for comparing outputs of two RSMTool experiments.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.comparer.
Comparer
[source]¶ Bases:
object
A class to perform comparisons between two RSMTool experiments.
-
static
compute_correlations_between_versions
(df_old, df_new, human_score='sc1', id_column='spkitemid')[source]¶ Computes correlations between respective feature values in the two given frames as well as the correlations between each feature values and the human scores.
Parameters: - df_old (pandas DataFrame) – Data frame with feature values for the ‘old’ model.
- df_new (pandas DataFrame) – Data frame with feature valeus for the ‘new’ model.
- human_score (str, optional) – Name of the column containing human score. Defaults to
sc1
. Must be the same for both data sets. - id_column (str, optional) – Name of the column containing id for each response. Defaults to
spkitemid
. Must be the same for both data sets.
Returns: df_correlations –
Data frame with a row for each feature and the following columns
- N: total number of responses - human_old: correlation with human score in the old frame - human_new: correlation with human score in the new frame - old_new: correlation between old and new frames
Return type: pandas DataFrame
Raises: ValueError
– If there are no shared features between the two sets or if there are no shared responses between the two sets.
-
load_rsmtool_output
(filedir, figdir, experiment_id, prefix, groups_eval)[source]¶ Function to load all of the outputs of an rsmtool experiment. For each type of output, we first check whether the file exists to allow comparing experiments with different sets of outputs.
Parameters: - filedir (str) – Path to the directory containing output files.
- figdir (str) – Path to the directory containing output figures.
- experiment_id (str) – Original
experiment_id
used to generate the output files. - prefix (str) – Must be set to
scale
orraw
. Indicates whether the score is scaled or not. - groups_eval (list) – List of subgroup names used for subgroup evaluation.
Returns: - files (dict) – A dictionary with outputs converted to pandas data frames. If a particular type of output did not exist for the experiment, its value will be an empty data frame.
- figs (dict) – A dictionary with experiment figures.
-
static
make_summary_stat_df
(df)[source]¶ Compute summary statistics for the data in the given frame.
Parameters: df (pandas DataFrame) – Data frame containing numeric data. Returns: res – Data frame containing summary statistics for data in the input frame. Return type: pandas DataFrame
-
static
process_confusion_matrix
(conf_matrix)[source]¶ Process confusion matrix to add ‘human’ and ‘machine’ to column names.
Parameters: conf_matrix (TYPE) – pandas Data Frame containing the confusion matrix. Returns: conf_matrix_renamed – pandas Data Frame containing the confusion matrix with the columns renamed. Return type: pandas DataFrame
-
static
From configuration_parser
Module¶
Classes related to parsing configuration files and creating configuration objects.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.configuration_parser.
CFGConfigurationParser
[source]¶ Bases:
rsmtool.configuration_parser.ConfigurationParser
A subclass of ConfiguraitonParser for parsing Microsoft INI-style config files.
-
class
rsmtool.configuration_parser.
Configuration
(config_dict, filepath=None, context='rsmtool')[source]¶ Bases:
object
Configuration class, which encapsulates all of the configuration parameters and methods to access these parameters.
-
check_exclude_listwise
()[source]¶ Check if we are excluding candidates based on number of responses, and add this to the configuration file
Returns: exclude_listwise – Whether to exclude list-wise Return type: bool
-
check_flag_column
(flag_column='flag_column')[source]¶ Make sure the flag_column field is in the correct format. Get flag columns and values for filtering if any and convert single values to lists. Raises an exception if flag_column is not correctly specified.
Returns: - new_filtering_dict (dict) – Properly formatted flag_column dictionary.
- flag_column ({‘flag_column’, ‘flag_column_test’}) – The flag column to check. Defaults to ‘flag_column’.
Raises: ValueError
– If the flag_column is not a dictionary
-
context
¶ Get the context.
-
filepath
¶ Get file path.
Returns: filepath – The path for the config file. Return type: str
-
get
(key, default=None)[source]¶ Get value or default, given key.
Parameters: - key (str) – Key to check in the Configuration object.
- optional (default,) – The default value to return, if no key exists. Defaults to None.
Returns: The value in the Configuration object dictionary.
Return type: value
-
get_default_converter
()[source]¶ Get the default converter dictionary for reader.
Returns: default_converter – The default converter for a train or test file. Return type: dict
-
get_names_and_paths
(keys, names)[source]¶ Get a a list of values, given keys. Remove any values that are None.
Parameters: - keys (list) – A list of keys whose values to retrieve.
- names (list) – The default value to use if key cannot be found. Defaults to None.
Returns: values – The list of values.
Return type: list
Raises: ValueError
– If there are any duplicate keys or names.
-
get_trim_min_max
()[source]¶ Get the specified trim min and max, if any, and make sure they are numeric.
Returns: - spec_trim_min (float) – Specified trim min value
- spec_trim_max (float) – Specified trim max value
-
items
()[source]¶ Return items as a list of tuples.
Returns: items – A list of (key, value) tuples in the Configuration object. Return type: list of tuples
-
keys
()[source]¶ Return keys as a list.
Returns: keys – A list of keys in the Configuration object. Return type: list of str
-
save
(output_dir=None)[source]¶ Save the configuration file to the output directory specified.
Parameters: output_dir (str) – The path to the output directory.
-
-
class
rsmtool.configuration_parser.
ConfigurationParser
[source]¶ Bases:
object
A ConfigurationParser class to create a Configuration object.
-
static
check_id_fields
(id_field_values)[source]¶ Check whether the ID fields in the given dictionary are properly formatted, i.e., they :: - do not contain any spaces - are shorter than 200 characters
Parameters: id_field_values (dict) – A dictionary containing the ID fields names as the keys and the value from the configuration file as the value. Raises: ValueError
– If the values for the ID fields in the given dictionary are not formatted correctly.
-
classmethod
get_configparser
(filepath, *args, **kwargs)[source]¶ Get the correct ConfigurationParser object, based on the file extension.
Parameters: filepath (str) – The path to the configuration file. Returns: config – The configuration parser object. Return type: ConfigurationParser Raises: ValueError
– If config file is not .json or .cfg.
-
load_config_from_dict
(config_dict)[source]¶ Load configuration dictionary.
Parameters: config_dict (dict) – A dictionary containing the configuration parameters to parse.
Raises: TypeError
– If config_dict is not adict
AttributeError
– If config has already been assigned.
-
normalize_config
(inplace=True)[source]¶ Normalize the field names in self._config or config in order to maintain backwards compatibility with old configuration files.
Parameters: inplace (bool) – Maintain the state of the config object produced by this method. Defaults to True. Returns: new_config – A normalized configuration object Return type: Configuration Raises: ValueError
– If no JSON configuration object exists, or if value passed for use_scaled_predictions is in the wrong format.
-
normalize_validate_and_process_config
(context='rsmtool')[source]¶ Normalize, validate, and process data from a config file.
Parameters: context (str, optional) – Context of the tool in which we are validating. Possible values are
{'rsmtool', 'rsmeval', 'rsmpredict', 'rsmcompare', 'rsmsummarize'}
Defaults to ‘rsmtool’.
Returns: config_obj – A configuration object Return type: Configuration Raises: NameError
– If config does not exist, or no config read.
-
process_config
(inplace=True)[source]¶ Converts fields which are read in as string to the appropriate format. Fields which can take multiple string values are converted to lists if they have not been already formatted as such.
Parameters: inplace (bool) – Maintain the state of the config object produced by this method. Defaults to True. Returns: config_obj – A configuration object Return type: Configuration Raises: NameError
– If config does not exist, or no config read.
-
read_config_from_file
(filepath)[source]¶ Read the configuration file.
Parameters: filepath (str) – The path to the configuration file. Raises: NotImplementedError
– This method must be implemented in subclass.
-
read_normalize_validate_and_process_config
(filepath, context='rsmtool')[source]¶ Read, normalize, validate, and process data from a config file.
Parameters: - filepath (str) – The path to the configuration file.
- context (str, optional) –
Context of the tool in which we are validating. Possible values are
{'rsmtool', 'rsmeval', 'rsmpredict', 'rsmcompare', 'rsmsummarize'}
Defaults to ‘rsmtool’.
Returns: config_obj – A configuration object
Return type:
-
validate_config
(context='rsmtool', inplace=True)[source]¶ Ensure that all required fields are specified, add default values values for all unspecified fields, and ensure that all specified fields are valid.
Parameters: - context (str, optional) –
Context of the tool in which we are validating. Possible values are
{'rsmtool', 'rsmeval', 'rsmpredict', 'rsmcompare', 'rsmsummarize'}
Defaults to ‘rsmtool’.
- inplace (bool) – Maintain the state of the config object produced by this method. Defaults to True.
Returns: config_obj – A configuration object
Return type: Raises: ValueError
– If config does not exist, and no config passed.- context (str, optional) –
-
static
-
class
rsmtool.configuration_parser.
JSONConfigurationParser
[source]¶ Bases:
rsmtool.configuration_parser.ConfigurationParser
A subclass of ConfigurationParser for parsing JSON-style config files.
From container
Module¶
Classes for storing any kind of data contained in a pd.DataFrame object.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.container.
DataContainer
(datasets=None)[source]¶ Bases:
object
A class to encapsulate datasets.
-
add_dataset
(dataset_dict, update=False)[source]¶ Update or add a new DataFrame to the instance.
Parameters: - dataset_dict (pd.DataFrame) – The dataset dictionary to add.
- update (bool, optional) – Update an existing DataFrame, if True. Defaults to False.
-
copy
(deep=True)[source]¶ Create a copy of the DataContainer object
Parameters: deep (bool, optional) – If True, create a deep copy. Defaults to True.
-
get_frame
(key, default=None)[source]¶ Get frame, given key.
Parameters: - key (str) – Name for the data.
- default – The default argument, if the frame does not exist
Returns: frame – The DataFrame.
Return type: pd.DataFrame
-
get_path
(key, default=None)[source]¶ Get path, given key.
Parameters: key (str) – Name for the data. Returns: path – Path to the data. Return type: str
-
items
()[source]¶ Return items as a list of tuples.
Returns: items – A list of (key, value) tuples in the Configuration object. Return type: list of tuples
-
keys
()[source]¶ Return keys as a list.
Returns: keys – A list of keys in the Configuration object. Return type: list
-
static
to_datasets
(data_container)[source]¶ Convert a DataContainer object to a list of dataset dictionaries with keys {name, path, frame}.
Parameters: data_container (DataContainer) – A DataContainer object. Returns: datasets_dict – A list of dataset dictionaries. Return type: list of dicts
-
From convert_feature_json
Module¶
-
rsmtool.
convert_feature_json_file
(json_file, output_file, delete=False)[source]¶ Convert the given feature JSON file into a tabular format inferred by the extension of the output file.
Parameters: - json_file (str) – Path to feature JSON file that is to be converted.
- output_file (str) – Path to CSV/TSV/XLS/XLSX output file.
- delete (bool, optional) – Whether to delete the original file after conversion. Defaults to False.
Raises: RuntimeError
– If the given input file is not a valid feature JSON file or if the output file has an unsupported extension.
From modeler
Module¶
Class for dealing with training built-in or SKLL models, as well as making predictions for new data.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.modeler.
Modeler
[source]¶ Bases:
object
A class for training and predicting with either built-in or SKLL models. Also provides helper functions for predicting train and test datasets.
-
static
create_fake_skll_learner
(df_coefficients)[source]¶ Create fake SKLL linear regression learner object using the coefficients in the given data frame.
Parameters: df_coefficients (pandas DataFrame) – Data frame containing the linear coefficients we want to create the fake SKLL model with. Returns: learner – SKLL LinearRegression Learner object containing with the specified coefficients. Return type: skll Learner object
-
get_coefficients
()[source]¶ Get the coefficients of the model, if available.
Returns: coefficients – The coefficients of the model. Return type: np.array or None
-
get_feature_names
()[source]¶ Get the feature names, if available.
Returns: feature_names – A list of feature names, or None if no learner was trained. Return type: list or None
-
get_intercept
()[source]¶ Get the intercept of the model, if available.
Returns: intercept – The intercept of the model. Return type: float or None
-
classmethod
load_from_file
(model_path)[source]¶ Load a Model object from file.
Parameters: model_path (str) – The path to a model Returns: model – A Modeler instance Return type: Modeler Raises: ValuError
– If the model_path does not end with ‘.model’
-
classmethod
load_from_learner
(learner)[source]¶ Load a Modeler object from file.
Parameters: learner (SKLL.Learner) – A SKLL Learner object Returns: modeler – A Modeler instance Return type: Modeler Raises: TypeError
– If learner is not SKLL.Learner instance.
-
static
model_fit_to_dataframe
(fit)[source]¶ Take an object containing a statsmodels OLS model fit and extact the main model fit metrics into a data frame.
Parameters: fit (a statsmodels fit object) – Model fit object obtained from a linear model trained using statsmodels.OLS. Returns: df_fit – Data frame with the main model fit metrics. Return type: pandas DataFrame
-
static
ols_coefficients_to_dataframe
(coefs)[source]¶ Take a series containing OLS coefficients and convert it to a data frame.
Parameters: coefs (pandas Series) – Series with feature names in the index and the coefficient values as the data, obtained from a linear model trained using statsmodels.OLS. Returns: df_coef – Data frame with two columns, the first being the feature name and the second being the coefficient value. Return type: pandas DataFrame Note
The first row in the output data frame is always for the intercept and the rest are sorted by feature name.
-
predict
(df, min_score, max_score, predict_expected=False)[source]¶ Get the raw predictions of the given SKLL model on the data contained in the given data frame.
Parameters: - df (pandas DataFrame) – Data frame containing features on which to make the predictions. The data must contain pre-processed feature values, an ID column named spkitemid, and a label column named sc1.
- min_score (int) – Minimum score level to be used if computing expected scores.
- max_score (int) – Maximum score level to be used if computing expected scores.
- predict_expected (bool, optional) – Predict expected scores for classifiers that return probability distributions over score. This will be ignored with a warning if the specified model does not support probability distributions. Note also that this assumes that the score range consists of contiguous integers - starting at min_score and ending at max_score. Defaults to False.
Returns: df_predictions – Data frame containing the raw predictions, the IDs, and the human scores.
Return type: pandas DataFrame
Raises: ValueError
– If the model cannot predict probability distributions and predict_expected is set to True or if the score range specified by min_score and max_score does not match what the model predicts in its probability distribution.
-
predict_train_and_test
(df_train, df_test, configuration)[source]¶ Generate raw, scaled, and trimmed predictions of model on the given training and testing data.
Parameters: - df_train (pandas DataFrame) – Data frame containing the pre-processed training set features.
- df_test (pandas DataFrame) – Data frame containing the pre-processed test set features.
- configuration (configuration_parser.Configuration) – A configuration object containing trim_max and trim_min
Returns: - List of data frames containing predictions and other
- information.
-
scale_coefficients
(configuration)[source]¶ Scale coefficients and intercept using human scores and model prediction on the training set. This procedure approximates what is done in operational setting but does not apply trimming to predictions.
Parameters: configuration (configuration_parser.Configuration) – A configuration object containing train_predictions_mean, and train_predictions_sd, and human_labels_sd. Returns: data_container – A data_container object containing coefficients_scaled This DataFrame contains the scaled coefficients and the feature names, along with the intercept. Return type: container.DataContainer
-
static
skll_learner_params_to_dataframe
(learner)[source]¶ Take the given SKLL learner object and return a data frame containing its parameters.
Parameters: learner (SKLL.Learner) – A SKLL learner object Returns: df_coef – a data frame containing the model parameters from the given SKLL learner object. Return type: pandas DataFrame Note
1. We use underlying sklearn model object to get at the coefficients and the intercept because the model_params attribute of the SKLL model ignores zero coefficients, which we do not want. 2. The first row in the output data frame is always for the intercept and the rest are sorted by feature name.
-
train
(configuration, data_container, filedir, figdir, file_format='csv')[source]¶ The main driver function to train the given model on the given data and save the results in the given directories using the given experiment ID as the prefix.
Parameters: - configuration (configuration_parser.Configuration) – A configuration object containing experiment_id and model_name
- data_container (container.DataContainer) – A data_container object containing train_preprocessed_features
- filedir (str) – Path to the output experiment output directory.
- figdir (str) – Path to the figure experiment output directory.
- file_format ({'csv', 'tsv', 'xlsx'}, optional) – The format in which to save files. Defaults to ‘csv’.
Returns: name
Return type: SKLL Learner object
-
train_builtin_model
(model_name, df_train, experiment_id, filedir, figdir, file_format='csv')[source]¶ Train one of the built-in linear regression models.
Parameters: - model_name (str) – Name of the built-in model to train.
- df_train (pandas DataFrame) – Data frame containing the features on which to train the model. The data frame must contain the ID column named spkitemid and the numeric label column named sc1.
- experiment_id (str) – The experiment ID.
- filedir (str) – Path to the output experiment output directory.
- figdir (str) – Path to the figure experiment output directory.
- file_format ({'csv', 'tsv', 'xlsx'}, optional) – The format in which to save files. Defaults to ‘csv’.
Returns: learner – SKLL LinearRegression Learner object containing the coefficients learned by training the built-in model.
Return type: Learner object
-
train_equal_weights_lr
(df_train, feature_columns)[source]¶ Train EqualWeightsLR (formerly eqWt) - All features get equal weight.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_lasso_fixed_lambda
(df_train, feature_columns)[source]¶ Train LassoFixedLambda (formerly lassoWtLasso) - A Lasso model with a fixed lambda
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object or None.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_lasso_fixed_lambda_then_lr
(df_train, feature_columns)[source]¶ Train LassoFixedLambdaThenLR (formerly empWtLasso) - First do feature selection using lasso regression with a fixed lambda and then use only those features to train a second linear regression
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_lasso_fixed_lambda_then_non_negative_lr
(df_train, feature_columns)[source]¶ Train LassoFixedLambdaThenNNLR (formerly empWtDropNegLasso) - First do feature selection using lasso regression and positive only weights. Then fit an NNLR (see above) on those features.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_linear_regression
(df_train, feature_columns)[source]¶ Train LinearRegression (formerly empWt) - A simple linear regression model.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_non_negative_lr
(df_train, feature_columns)[source]¶ Train NNLR (formerly empWtNNLS) - First do feature selection using non-negative least squares (NNLS) and then use only its non-zero features to train a regular linear regression. We do the regular LR at the end since we want an LR object so that we have access to R^2 and other useful statistics. There should be no difference between the non-zero coefficients from NNLS and the coefficients that end up coming out of the subsequent LR.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_positive_lasso_cv
(df_train, feature_columns)[source]¶ Train PositiveLassoCV (formerly lassoWtLassoBest) - Feature selection using lasso regression optimized for log likelihood using cross validation.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object or None.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_positive_lasso_cv_then_lr
(df_train, feature_columns)[source]¶ Train PositiveLassoCVThenLR (formerly empWtLassoBest) - First do feature selection using lasso regression optimized for log likelihood using cross validation and then use only those features to train a second linear regression
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_rebalanced_lr
(df_train, feature_columns)[source]¶ Train RebalancedLR (formerly empWtBalanced) - Balanced empirical weights by changing betas [adapted from http://bit.ly/UTP7gS]
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_score_weighted_lr
(df_train, feature_columns)[source]¶ Train ScoreWeightedLR - Linear regression model weighted by score.
Parameters: - df_train (pd.DataFrame) – Data frame containing the features on which to train the model.
- feature_columns (list) – A list of feature columns to use in training the model.
Returns: - learner (skll.Learner) – The SKLL learner object
- fit (statsmodels.RegressionResults) – A statsmodels regression results object or None.
- df_coef (pd.DataFrame) – The model coefficients in a data_frame
- used_features (list) – A list of features used in the final model.
-
train_skll_model
(model_name, df_train, experiment_id, filedir, figdir, file_format='csv', custom_objective=None, predict_expected_scores=False)[source]¶ Train a SKLL classification or regression model.
Parameters: - model_name (str) – Name of the SKLL model to train.
- df_train (pandas DataFrame) – Data frame containing the features on which to train the model.
- experiment_id (str) – The experiment ID.
- filedir (str) – Path to the output experiment output directory.
- figdir (str) – Path to the figure experiment output directory.
- file_format ({'csv', 'tsv', 'xlsx'}, optional) – The format in which to save files. For SKLL models, this argument does not actually change the format of the output files at this time, as no betas are computed. Defaults to ‘csv’.
- custom_objective (str, optional) – Name of custom user-specified objective. If not specified or None, neg_mean_squared_error is used as the objective. Defaults to None.
- predict_expected_scores (bool, optional) – Whether we want the trained classifiers to predict expected scores. Defaults to False.
Returns: - Tuple containing a SKLL Learner object of the appropriate type
- and the chosen tuning objective.
-
static
From preprocessor
Module¶
Classes for preprocessing input data in various contexts.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.preprocessor.
FeaturePreprocessor
[source]¶ Bases:
object
A class to pre-process training and testing features.
-
static
check_model_name
(model_name)[source]¶ Check that the given model name is valid and determine its type.
Parameters: model_name (str) – Name of the model. Returns: model_type – One of BUILTIN or SKLL. Return type: str Raises: ValueError
– If the model is not supported.
-
static
check_subgroups
(df, subgroups)[source]¶ Check that all subgroups, if specified, correspond to columns in the provided data frame, and replace all NaNs in subgroups values with ‘No info’ for later convenience. Raises an exception if any specified subgroup columns are missing.
Parameters: - df (pd.DataFrame) – DataFrame with subgroups to check.
- subgroups (list of str) – List of column names that contain grouping information.
Returns: df – Modified input data frame with NaNs replaced.
Return type: pandas DataFrame
Raises: KeyError
– If the data does not contain columns for all subgroups
-
filter_data
(df, label_column, id_column, length_column, second_human_score_column, candidate_column, requested_feature_names, reserved_column_names, given_trim_min, given_trim_max, flag_column_dict, subgroups, exclude_zero_scores=True, exclude_zero_sd=False, feature_subset_specs=None, feature_subset=None, min_candidate_items=None, use_fake_labels=False)[source]¶ Filter the data to remove rows that have zero/non-numeric values for label_column. If feature_names are specified, check whether any features that are specifically requested in feature_names are missing from the data. If no feature_names are specified, these are generated based on column names and subset information, if available. The function then excludes non-numeric values for any feature. If the user requested to exclude candidates with less than min_items_per_candidates, such candidates are excluded. It also generates fake labels between 1 and 10 if use_fake_parameters is set to True. Finally, it renames the id and label column and splits the data into the data frame with feature values and score label, the data frame with information about subgroup and candidate (metadata) and the data frame with all other columns.
Parameters: - df (pd.DataFrame) – The DataFrame to filter.
- label_column (str) – The label column in the data.
- id_column (str) – The ID column in the data.
- length_column (str) – The length column in the data.
- second_human_score_column (str) – The second human score column in the data.
- candidate_column (str) – The candidate column in the data.
- requested_feature_names (list) – A list of requested feature names.
- reserved_column_names (list) – A list of reserved column names.
- given_trim_min (int) – The minimum trim value.
- given_trim_max (int) – The maximum trim value.
- flag_column_dict (dict) – A dictionary of flag columns.
- subgroups (list, optional) – A list of subgroups, if any.
- exclude_zero_scores (bool) – Whether to exclude zero scores. Defaults to True.
- exclude_zero_sd (bool, optional) – Whether to exclude zero standard deviation. Defaults to False.
- feature_subset_specs (pd.DataFrame, optional) – The feature_subset_specs DataFrame Defaults to None.
- feature_subset (str, optional) – The feature subset group (e.g. ‘A’). Defaults to None.
- min_candidate_items (int, optional) – The minimum number of items needed to include candidate. Defaults to None
- use_fake_labels (bool, optional) – Whether to use fake labels. Defaults to None.
Returns: - df_filtered_features (pd.DataFrame) – DataFrame with filtered features
- df_filtered_metadata (pd.DataFrame) – DataFrame with filtered metadata
- df_filtered_other_columns (pd.DataFrame) – DataFrame with other columns filtered
- df_excluded (pd.DataFrame) – DataFrame with excluded records
- df_filtered_length (pd.DataFrame) – DataFrame with length column(s) filtered
- df_filtered_human_scores (pd.DataFrame) – DataFrame with human scores filtered
- df_responses_with_excluded_flags (pd.DataFrame) – A DataFrame containing responses with excluded flags
- trim_min (float) – The maximum trim value
- trim_max (float) – The minimum trim value
- feature_names (list) – A list of feature names
-
static
filter_on_column
(df, column, id_column, exclude_zeros=False, exclude_zero_sd=False)[source]¶ Filter out the rows in the given data frame that contain non-numeric (or zero, if specified) values in the specified column. Additionally, it may exclude any columns if they have a standard deviation (\(\sigma\)) of 0.
Parameters: - df (pd.DataFrame) – The DataFrame to filter on.
- column (str) – Name of the column from which to filter out values.
- id_column (str) – Name of the column containing the unique response IDs.
- exclude_zeros (bool, optional) – Whether to exclude responses containing zeros in the specified column. Defaults to False.
- exclude_zero_sd (bool, optional) – Whether to perform the additional filtering step of removing columns that have \(\sigma = 0\). Defaults to False.
Returns: - df_filtered (pandas DataFrame) – Data frame containing the responses that were not filtered out.
- df_excluded (pandas DataFrame) – Data frame containing the non-numeric or zero responses that were filtered out.
Note
The columns with \(\sigma=0\) are removed from both output data frames.
-
filter_on_flag_columns
(df, flag_column_dict)[source]¶ Check that all flag_columns are present in the given data frame, convert these columns to strings and filter out the values which do not match the condition in flag_column_dict.
Parameters: - df (pd.DataFrame) – The DataFrame to filter on.
- flag_column_dict (dict) – Dictionary containing the flag column information.
Returns: - df_responses_with_requested_flags (pandas DataFrame) – Data frame containing the responses remaining after filtering using the specified flag columns.
- df_responses_with_excluded_flags (pandas DataFrame) – Data frame containing the responses filtered out using the specified flag columns.
Raises: KeyError
– If the columns listed in the dictionary are not actually present in the data frame.ValueError
– If no responses remain after filtering based on the flag column information.
-
generate_feature_names
(df, reserved_column_names, feature_subset_specs, feature_subset)[source]¶ Generate the feature names from the column names of the given data frame and select the specified subset of features.
Parameters: - df (pd.DataFrame) – The DataFrame from which to generate feature names.
- reserved_column_names (list) – Names of reserved columns.
- feature_subset_specs (pd.DataFrame) – Feature subset specs
- feature_subset (str) – Feature subset column.
Returns: feautre_names – A list of features names.
Return type: list
-
preprocess_feature
(values, feature_name, feature_transform, feature_mean, feature_sd, exclude_zero_sd=False, raise_error=True)[source]¶ Remove outliers and transform the values in the given numpy array using the given outlier and transformation parameters. The values are assumed for the given feature name.
Parameters: - values (np.array) – The feature values to preprocess
- feature_name (str) – Name of the feature being pre-processed.
- feature_transform (str) – Name of the transformation function to apply.
- feature_mean (float) – Mean value to use for outlier detection instead of the mean of the given feature values.
- feature_sd (float) – Std. dev. value to use for outlier detection instead of the std. dev. of the given feature values.
- exclude_zero_sd (bool, optional) – Check data has a zero std. dev. Defaults to False.
- raise_error (bool, optional) – Raise error if any of the transformations lead to inf values or may change the ranking of feature values. Defaults to True
Returns: transformed_feature – Numpy array containing the transformed and clamped feature values.
Return type: numpy array
Raises: ValueError
– If the given values have zero standard deviation and exclude_zero_sd is set to True.
-
preprocess_features
(df_train, df_test, df_feature_specs, standardize_features=True)[source]¶ Pre-process those features in the given training and testing data frame df whose specifications are contained in feature_specs. Also return a third data frame containing the feature specs themselves.
Parameters: - df_train (pandas DataFrame) – Data frame containing the raw feature values for the training set.
- df_test (pandas DataFrame) – Data frame containing the raw feature values for the test set.
- df_feature_specs (pandas DataFrame) – Data frame containing the various specifications from the feature file.
- standardize_features (bool) – Whether to standardize the features Defaults to True.
Returns: - df_train_preprocessed (pd.DataFrame) – DataFrame with preprocessed training data
- df_test_preprocessed (pd.DataFrame) – DataFrame with preprocessed test data
- df_feature_info (pd.DataFrame) – DataFrame with feature information
-
preprocess_new_data
(df_input, df_feature_info, standardize_features=True)[source]¶ Process a data frame with feature values by applying preprocessing parameters stored in df_feature_info.
Parameters: - df_input (pandas DataFrame) – Data frame with raw feature values that will be used to generate the scores. Each feature is stored in a separate column. Each row corresponds to one response. There should also be a column named spkitemid containing a unique ID for each response.
- df_feature_info (pandas DataFrame) –
Data frame with preprocessing parameters stored in the following columns
- `feature` : the name of the feature; should match the feature names in `df_input`. - `sign` : `1` or `-1`. Indicates whether the feature value needs to be multiplied by -1. - `transform` : :ref:`transformation <json_transformation>` that needs to be applied to this feature - `train_mean`, `train_sd` : mean and standard deviation for outlier truncation. - `train_transformed_mean`,`train_transformed_sd` : mean and standard deviation for computing `z`-scores.
- standardize_features (bool, optional) – Whether the features should be standardized prior to prediction. Defaults to True.
Returns: - df_features_preprocessed (pd.DataFrame) – Data frame with processed feature values
- df_excluded (pd.DataFrame) – Data frame with responses excluded from further analysis due to non-numeric feature values in the original file or after applying transformations. The data frame always contains the original feature values.
Raises: KeyError
– if some of the features specified in df_feature_info are not present in df_inputValueError
– if all responses have at least one non-numeric feature value and therefore no score can be generated for any of the responses.
-
process_data
(config_obj, data_container_obj, context='rsmtool')[source]¶ Process the date for a given context.
Parameters: - config_obj (configuration_parser.Configuration) – A configuration object.
- data_container_obj (container.DataContainer) – A data container object.
- context ({'rsmtool', 'rsmeval', 'rsmpredict'}) – The context of the tool.
Returns: - config_obj (configuration_parser.Configuration) – A new configuration object.
- data_congtainer (container.DataContainer) – A new data container object.
Raises: ValueError
– If the the context is not in {‘rsmtool’, ‘rsmeval’, ‘rsmpredict’}
-
process_data_rsmeval
(config_obj, data_container_obj)[source]¶ The main function that sets up the experiment by loading the training and evaluation data sets and preprocessing them. Raises appropriate exceptions .
Parameters: - config_obj (configuration_parser.Configuration) – A configuration object.
- data_container_obj (container.DataContainer) – A data container object.
Returns: - config_obj (configuration_parser.Configuration) – A new configuration object.
- data_congtainer (container.DataContainer) – A new data container object.
Raises: ValueError
-
process_data_rsmpredict
(config_obj, data_container_obj)[source]¶ Process data for RSM predict.
Parameters: - config_obj (configuration_parser.Configuration) – A configuration object.
- data_container_obj (container.DataContainer) – A data container object.
Returns: - config_obj (configuration_parser.Configuration) – A new configuration object.
- data_congtainer (container.DataContainer) – A new data container object.
Raises: KeyError
– If columns in the config file do not exist in the dataValueError
– If data contains duplicate response IDs
-
process_data_rsmtool
(config_obj, data_container_obj)[source]¶ The main function that sets up the experiment by loading the training and evaluation data sets and preprocessing them. Raises appropriate exceptions .
Parameters: - config_obj (configuration_parser.Configuration) – A configuration object.
- data_container_obj (container.DataContainer) – A data container object.
Returns: - config_obj (configuration_parser.Configuration) – A Configuration object.
- data_container (container.DataContainer) – A DataContainer object.
Raises: ValueError
– If the columns in the config file do not exist in the data.
-
static
process_predictions
(df_test_predictions, train_predictions_mean, train_predictions_sd, human_labels_mean, human_labels_sd, trim_min, trim_max)[source]¶ Process predictions to create scaled, trimmed and rounded predictions.
Parameters: - df_test_predictions (pd.DataFrame) – Data frame containing the test set predictions.
- train_predictions_mean (float) – The mean of the predictions on the training set.
- train_predictions_sd (float) – The std. dev. of the predictions on the training set.
- human_labels_mean (float) – The mean of the human scores used to train the model.
- human_labels_sd (float) – The std. dev. of the human scores used to train the model.
- trim_min (float) – The lowest score on the score point, used for trimming the raw regression predictions.
- trim_max (float) – The highest score on the score point, used for trimming the raw regression predictions.
Returns: df_pred_processed – Data frame containing the various trimmed and rounded predictions.
Return type: pd.DataFrame
-
static
remove_outliers
(values, mean=None, sd=None, sd_multiplier=4)[source]¶ Clamp any values in the given numpy array that are +/- sd_multiplier (\(m\)) standard deviations (\(\sigma\)) away from the mean (\(\mu\)). Use given mean and sd instead of computing \(\sigma\) and \(\mu\), if specified. The values are clamped to the interval .. math:
[\mu - m * \sigma, \mu + m * \sigma]
Parameters: - values (np.array) – The values from which to remove outliers.
- mean (int or float, optional) – Use the given mean value when computing outliers instead of the mean from the data. Defaults to None
- sd (None, optional) – Use the given std. dev. value when computing outliers instead of the std. dev. from the data. Defaults to None.
- sd_multiplier (int, optional) – Use the given multipler for the std. dev. when computing the outliers. Defaults to 4. Defaults to 4.
Returns: new_values – Numpy array with the outliers clamped.
Return type: np.array
-
static
rename_default_columns
(df, requested_feature_names, id_column, first_human_score_column, second_human_score_column, length_column, system_score_column, candidate_column)[source]¶ Standardize all column names and rename all columns with default names to ##NAME##.
Parameters: - df (pd.DataFrame) – The DataFrame whose columns to rename.
- requested_feature_names (list) – List of feature column names that we want to include in the scoring model.
- id_column (str) – Column name containing the response IDs.
- first_human_score_column (str or None) – Column name containing the H1 scores.
- second_human_score_column (str or None) – Column name containing the H2 scores. Should be None if no H2 scores are available.
- length_column (str or None) – Column name containing response lengths. Should be None if lengths are not available.
- system_score_column (str) – Column name containing the score predicted by the system. This is only used for RSMEval.
- candidate_column (str or None) – Column name containing identifying information at the candidate level. Should be None if such information is not available.
Returns: df – Modified input data frame with all the approximate re-namings.
Return type: pandas DataFrame
-
static
select_candidates
(df, N, candidate_col='candidate')[source]¶ Only select candidates which have responses to N or more items.
Parameters: - df (pd.DataFrame) – The DataFrame from which to select candidates with N or more items.
- N (int) – minimal number of items per candidate
- candidate_col (str, optional) – name of the column which contains candidate ids. Defaults to ‘candidate’.
Returns: - df_included (pandas DataFrame) – Data frame with responses from candidates with responses to N or more items
- df_excluded (pandas DataFrame) – Data frame with responses from candidates with responses to less than N items
-
static
trim
(values, trim_min, trim_max, tolerance=0.49998)[source]¶ Trim the values contained in the given numpy array to trim_min - tolerance as the floor and trim_max + tolerance as the ceiling.
Parameters: - values (list or np.array) – The values to trim.
- trim_min (float) – The lowest score on the score point, used for trimming the raw regression predictions.
- trim_max (float) – The highest score on the score point, used for trimming the raw regression predictions.
- tolerance (float, optional) – The tolerance that will be used to compute the trim interval. Defaults to 0.49998.
Returns: trimmed_values – Trimmed values.
Return type: np.array
-
static
-
class
rsmtool.preprocessor.
FeatureSpecsProcessor
[source]¶ Bases:
object
Encapsulate feature file processing methods.
-
classmethod
find_feature_sign
(feature, sign_dict)[source]¶ Get the sign from the feature.csv file
Parameters: - feature (str) – The name of the feature
- sign_dict (dict) – A dictionary of feature signs.
Returns: feature_sign_numeric – The signed feature.
Return type: float
-
classmethod
generate_default_specs
(feature_names)[source]¶ Generate default feature “specifications” for the features with the given names. The specifications are stored as a data frame with three columns “feature”, “transform”, and “sign”.
Parameters: feature_names (list) – List of feature names for which to generate specifications. Returns: feature_specs – A dataframe with feature specifications that can be saved as a feature list file. Return type: pandas DataFrame Note
Since these are default specifications, the values for the transform column for each feature will be “raw” and the value for the sign column will be 1.
-
classmethod
generate_specs
(df, feature_names, train_label, feature_subset=None, feature_sign=None)[source]¶ Generate feature specifications using the features.csv for sign and the correlation with score to identify the best transformation.
Parameters: - df (pd.DataFrame) – The DataFrame form which to generate specs.
- feature_names (list) – A list of feature names.
- train_label (str) – The label column for the training data
- feature_subset (pd.DataFrame, optional) – A feature_subset_specs DataFrame
- feature_sign (int, optional) – The sign of the feature.
Returns: df_feature_specs – A feature specifications DataFrame
Return type: pd.DataFrame
-
classmethod
normalize_and_validate_json
(feature_json)[source]¶ Normalize the field names in feature_json in order to maintain backwards compatibility with old config files.
Parameters: feature_json (dict) – JSON object containing the information specified in the feature file, possibly containing the old-style names for feature fields. Returns: new_feature_json – JSON object with all old style names normalized to new style names. Return type: dict Raises: KeyError
– If required fields are missing in the feature JSON file.
-
classmethod
validate_feature_specs
(df)[source]¶ Check the supplied feature specs to make sure that there are no duplicate feature names and that all columns are in the right format. Add the default values for transform and sign if none is supplied
Parameters: df (pd.DataFrame) – The feature specification DataFrame to validate.
Returns: df_specs_new – A data frame with normalized values
Return type: pandas DataFrame
Raises: - KeyError : – If the data frame does not have a
feature
column. - ValueError: – If there are duplicate values in the
feature
column or if thesign
column contains invalid values.
- KeyError : – If the data frame does not have a
-
classmethod
-
class
rsmtool.preprocessor.
FeatureSubsetProcessor
[source]¶ Bases:
object
Encapsulate feature sub-setting methods.
-
classmethod
check_feature_subset_file
(df, subset=None, sign=None)[source]¶ Check that the file is in the correct format and contains all the requested values. Raises an exception if it finds any errors but otherwise returns nothing.
Parameters: - df (pd.DataFrame) – The feature subset file DataFrame.
- subset (str, optional) – Name of a pre-defined feature subset. Defaults to None.
- sign (str, optional) – Value of the sign Defaults to None.
Raises: ValueError
– If any columns are missing from the subset file or if any of the columns contain invalid values.
-
classmethod
select_by_subset
(feature_columns, feature_subset_specs, subset)[source]¶ Select feature columns using feature subset specs.
Parameters: - feature_columns (list) – A list of feature columns
- feature_subset_specs (pd.DataFrame) – The feature subset spec DataFrame.
- subset (str) – The column to subset.
Returns: feature_names – A list of feature names to include.
Return type: list
-
classmethod
From reader
Module¶
Classes for reading data files (or dictionaries) and converting them to DataContainer objects.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.reader.
DataReader
(filepaths, framenames, file_converters=None)[source]¶ Bases:
object
A DataReader class to generate DataContainer objects
-
static
locate_files
(filepaths, config_dir)[source]¶ Try to locate an experiment file, or a list of experiment files. If the given path doesn’t exist, then maybe the path is relative to the path of the config file. If neither exists, then return None.
Parameters: - filepath_or_paths (str or list) – Name of the experiment file we want to locate.
- config_dir (str) – Path to the experiment configuration file.
Returns: retval – Absolute path to the experiment file or None if the file could not be located. If the filepaths argument was a string, this method will return a string. Otherwise, it will return a list.
Return type: str or list
Raises: ValueError
– If filepaths is not a string or list.
-
read
(kwargs_dict=None)[source]¶ Read all files passed to the constructor.
Parameters: kwargs_dict (dict of dicts, optional) – Any additional keyword arguments to pass to a particular DataFrame. These arguments will be passed to the pandas IO reader function. Defaults to None. Returns: datacontainer – A DataContainer object. Return type: DataContainer
-
static
read_from_file
(filename, converters=None, **kwargs)[source]¶ Read a CSV/TSV/XLS/XLSX file and return a data frame.
Parameters: - filename (str) – Name of file to read.
- converters (None, optional) – A dictionary specifying how the types of the columns
in the file should be converted. Specified in the same
format as for
pandas.read_csv()
.
Returns: df – Data frame containing the data in the given file.
Return type: pandas DataFrame
Raises: ValueError
– If the file has an extension that we do not supportpd.parser.CParserError
– If the file is badly formatted or corrupt.
Note
Keyword arguments are passed to the given pandas IO reader function.
-
static
From reporter
Module¶
Classes for dealing with report generation.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.reporter.
Reporter
[source]¶ Bases:
object
A class for generating Jupyter notebook reports, and converting them to HTML.
-
static
check_section_names
(specified_sections, section_type, context='rsmtool')[source]¶ Check whether the specified section names are valid and raise an exception if they are not.
Parameters: - specified_sections (list of str) – List of report section names.
- section_type (str) – ‘general’ or ‘special’
- context (str, optional) –
Context in which we are validating the section names. Possible values are
{'rsmtool', 'rsmeval', 'rsmcompare'}
Defaults to ‘rsmtool’.
Raises: ValueError
– If any of the section names of the given type are not valid in the context of the given tool.
-
static
check_section_order
(chosen_sections, section_order)[source]¶ Check the order of the specified sections.
Parameters: - chosen_sections (list of str) – List of chosen section names.
- section_order (list of str) – An ordered list of the chosen section names.
Raises: ValueError
– If any sections specified in the order are missing from the list of chosen sections or vice versa.
-
static
convert_ipynb_to_html
(notebook_file, html_file)[source]¶ Convert the given Jupyter notebook file (
.ipynb
) to HTML and write it out as the given.html
file.Parameters: - notebook_file (str) – Path to input Jupyter notebook file.
- html_file (str) – Path to output HTML file.
Note
This function is also exposed as the render_notebook command-line utility.
-
create_comparison_report
(configuration, csvdir_old, figdir_old, csvdir_new, figdir_new, output_dir)[source]¶ The main driver function to generate a comparison report comparing the two RSMTool experiments as defined by the given arguments.
Parameters: - configuration (configuration_parser.Configuration) – A configuration object
- csvdir_old (str) – The old experiment CSV output directory.
- figdir_old (str) – The old figure output directory
- csvdir_new (str) – The new experiment CSV output directory.
- figdir_new (str) – The old figure output directory
- output_dir (str) – The output dir for the new report.
Returns: A jupyter notebook
Return type: notebook
-
create_report
(configuration, csvdir, figdir, context='rsmtool')[source]¶ The main driver function to generate the RSMTool HTML report the experiment as defined by the given arguments.
Parameters: - configuration (configuration_parser.Configuration) – A configuration object
- csvdir (str) – The CSV output directory.
- figdir (str) – The figure output directory
- context (str) – The context of the script Defaults to ‘rsmtool’
Returns: A jupyter notebook
Return type: notebook
Raises: KeyError
– If test_file_location or `pred_file_location not in configuration.
-
create_summary_report
(configuration, all_experiments, csvdir)[source]¶ The main function to generate a summary report comparing several RSMTool experiments as defined by the given arguments.
Parameters: - configuration (configuration_parser.Configuration) – A configuration object
- all_experiments (list) – A list of experiments to summarize.
- csvdir (str) – The experiment CSV output directory.
Returns: A jupyter notebook
Return type: notebook
-
determine_chosen_sections
(general_sections, special_sections, custom_sections, subgroups, context='rsmtool')[source]¶ Determine the section names that have been chosen by the user and that will be generated in the report.
Parameters: - general_sections (list of str) – List of specified general section names.
- special_sections (str) – List of specified special section names, if any.
- custom_sections (list of str) – List of specified custom sections, if any.
- subgroups (list of str) – List of column names that contain grouping information.
- context (str, optional) – Context of the tool in which we are validating. Possible values are {‘rsmtool’, ‘rsmeval’, ‘rsmcompare’} Defaults to ‘rsmtool’.
Returns: chosen_sections – Final list of chosen sections that are to be included in the HTML report.
Return type: list of str
Raises: ValueError
– If a subgroup report section is requested but no subgroups were specified in the configuration file.
-
get_ordered_notebook_files
(general_sections, special_sections=[], custom_sections=[], section_order=None, subgroups=[], model_type=None, context='rsmtool')[source]¶ Check all section names and section order, combine all section names with the appropriate file mapping, and generate an ordered list of notebook files that are needed to generate the final report.
Parameters: - general_sections (str) – List of specified general sections.
- special_sections (list, optional) – List of specified special sections, if any.
- custom_sections (list, optional) – List of specified custom sections, if any. Defaults to empty list.
- section_order (None, optional) – Ordered list in which the user wants the specified sections. Defaults to empty list
- subgroups (list, optional) – List of column names that contain grouping information. Defaults to empty list.
- model_type (None, optional) – Type of the model. Possible values are {‘BUILTIN’, ‘SKLL’}. We allow None here so that RSMEval can use the same function. Defaults to None
- context (str, optional) – Context of the tool in which we are validating. Possible values are {‘rsmtool’, ‘rsmeval’, ‘rsmcompare’} Defaults to ‘rsmtool’
Returns: chosen_notebook_files – List of the IPython notebook files that have to be rendered into the HTML report.
Return type: list of str
-
get_section_file_map
(special_sections, custom_sections, model_type=None, context='rsmtool')[source]¶ Map the section names to IPython notebook filenames.
Parameters: - special_sections (list of str) – List of special sections.
- custom_sections (list of str) – List of custom sections.
- model_type (None, optional) – Type of the model. Possible values are {‘BUILTIN’, ‘SKLL’}. We allow None here so that RSMEval can use the same function.
- context (str, optional) – Context of the tool in which we are validating. Possible values are {‘rsmtool’, ‘rsmeval’, ‘rsmcompare’} Defaults to ‘rsmtool’
Returns: section_file_map – Dictionary mapping each section name to the corresponding IPython notebook filename.
Return type: dict
-
static
locate_custom_sections
(custom_report_section_paths, config_dir)[source]¶ Get the absolute paths for custom report sections and check that the files exist. If a file does not exist, raise an exception.
Parameters: - custom_report_section_paths (list of str) – List of paths to IPython notebook files representing the custom sections.
- config_dir (str) – Path to the experiment configuration file.
Returns: custom_report_sections – List of absolute paths to the custom section notebooks.
Return type: list of str
Raises: FileNotFoundError
– If any of the files cannot be found.
-
static
merge_notebooks
(notebook_files, output_file)[source]¶ Merge the given Jupyter notebooks into a single Jupyter notebook.
Parameters: - notebook_files (list of str) – List of paths to the input Jupyter notebook files.
- output_file (str) – Path to output Jupyter notebook file
Note
Adapted from: http://stackoverflow.com/questions/ 20454668/how-to-merge-two-ipython-notebooks-correctly -without-getting-json-error.
-
static
From transformer
Module¶
Class for transforming features.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.transformer.
FeatureTransformer
[source]¶ Bases:
object
Encapsulate feature transformation methods.
-
classmethod
apply_add_one_inverse_transform
(name, values, raise_error=True)[source]¶ Apply the add one and invert transform to values.
Parameters: - name (str) – Name of the feature to transform.
- values (np.array) – Numpy array containing the feature values.
- raise_error (bool, optional) – When set to true, raises an error if the transform is applied to a feature that has zero or negative values.
Returns: new_data – Numpy array containing the transformed feature values.
Return type: np.array
Raises: ValueError
– If the transform is applied to a feature that can be negative and raise_error is set to True.
-
classmethod
apply_add_one_log_transform
(name, values, raise_error=True)[source]¶ Apply the add one and log transform to values.
Parameters: - name (str) – Name of the feature to transform.
- values (numpy array) – Numpy array containing the feature values.
- raise_error (bool, optional) – When set to true, raises an error if the transform is applied to a feature that has zero or negative values.
Returns: new_data – Numpy array that contains the transformed feature values.
Return type: numpy array
Raises: ValueError
– If the transform is applied to a feature that can be negative.
-
classmethod
apply_inverse_transform
(name, values, raise_error=True, sd_multiplier=4)[source]¶ Apply the inverse transform to values.
Parameters: - name (str) – Name of the feature to transform.
- values (numpy array) – Numpy array containing the feature values.
- raise_error (bool, optional) – When set to true, raises an error if the transform is applied to a feature that can be zero or to a feature that can have different signs.
- sd_multiplier (int, optional) – Use this std. dev. multiplier to compute the ceiling and floor for outlier removal and check that these are not equal to zero.
Returns: new_data – Numpy array containing the transformed feature values.
Return type: numpy array
Raises: ValueError
– If the transform is applied to a feature that can be zero or to a feature that can have different signs and raise_error is set to ‘True’
-
classmethod
apply_log_transform
(name, values, raise_error=True)[source]¶ Apply the log transform to values.
Parameters: - name (str) – Name of the feature to transform.
- values (numpy array) – Numpy array containing the feature values.
- raise_error (bool, optional) – When set to true, raises an error if the transform is applied to a feature that has zero or negative values.
Returns: new_data – Numpy array containing the transformed feature values.
Return type: numpy array
Raises: ValueError
– If the transform is applied to a feature that can be zero or negative and raise_error is set to true.
-
classmethod
apply_sqrt_transform
(name, values, raise_error=True)[source]¶ Apply the sqrt transform to values.
Parameters: - name (str) – Name of the feature to transform.
- values (numpy array) – Numpy array containing the feature values.
- raise_error (bool, optional) – When set to true, raises an error if the transform is applied to a feature that can have negative values.
Returns: new_data – Numpy array containing the transformed feature values.
Return type: numpy array
Raises: ValueError
– If the transform is applied to a feature that has negative values and raise_error is set to true.
-
classmethod
find_feature_transform
(feature_name, feature_value, scores)[source]¶ Identify the best transformation based on the highest absolute Pearson correlation with human score.
Parameters: - feature_name (str) – Name of feature for which to find the transformation.
- feature_value (pandas Series) – Series containing feature values.
- scores (pandas Series) – Numeric human scores.
Returns: best_transformation – The name of the transformation which gives the highest correlation between the feature values and the human scores. See documentation for the full list of transformations.
Return type: str
-
classmethod
transform_feature
(values, column_name, transform, raise_error=True)[source]¶ Applies the given transform to all of the values in the given numpy array. The values are assumed to be for the feature with the given name.
Parameters: - values (numpy array) – Numpy array containing the feature values.
- column_name (str) – Name of the feature to transform.
- transform (str) –
Name of the transform to apply. Valid options include
{'inv', 'sqrt', 'log', 'addOneInv', 'addOneLn', 'raw', 'org'}
- raise_error (bool, optional) – Raise a ValueError if a transformation leads to Inf values or may change the ranking of the responses
Returns: new_data – Numpy array containing the transformed feature values.
Return type: np.array
Raises: ValueError
– If the given transform is not recognized.Note
Many of these transformations may be meaningless for features which span both negative and positive values. Some transformations may throw errors for negative feature values.
-
classmethod
From utils
Module¶
-
rsmtool.utils.
agreement
(score1, score2, tolerance=0)[source]¶ This function computes the agreement between two raters, taking into account the provided tolerance.
Parameters: - score1 (list of int) – List of rater 1 scores
- score2 (list of int) – List of rater 2 scores
- tolerance (int, optional) – Difference in scores that is acceptable. Defaults to 0.
Returns: agreement_value – The percentage agreement between the two scores.
Return type: float
-
rsmtool.utils.
partial_correlations
(df)[source]¶ This is a python port of the pcor function implemented in the ppcor R package, which computes partial correlations of each pair of variables in the given data frame df, excluding all other variables.
Parameters: df (pd.DataFrame) – Data frame containing the feature values. Returns: df_pcor – Data frame containing the partial correlations of of each pair of variables in the given data frame df, excluding all other variables. Return type: pd.DataFrame
-
rsmtool.utils.
get_thumbnail_as_html
(path_to_image, image_id)[source]¶ Given an path to an image file, generate the HTML for a click-able thumbnail version of the image. On click, this HTML will open the full-sized version of the image in a new window.
Parameters: - path_to_image (str) – The absolute or relative path to the image. If an absolute path is provided, it will be converted to a relative path.
- image_id (int) – The id of the <img> tag in the HTML. This must be unique for each <img> tag.
Returns: image – The HTML string generated for the image.
Return type: str
Raises: FileNotFoundError
– If the image file cannot be located.
-
rsmtool.utils.
show_thumbnail
(path_to_image, image_id)[source]¶ Given an path to an image file, display a click-able thumbnail version of the image. On click, open the full-sized version of the image in a new window.
Parameters: - path_to_image (str) – The absolute or relative path to the image. If an absolute path is provided, it will be converted to a relative path.
- image_id (int) – The id of the <img> tag in the HTML. This must be unique for each <img> tag.
- Displays –
- -------- –
- display (IPython.core.display.HTML) – The HTML display of the thumbnail image.
-
rsmtool.utils.
compute_expected_scores_from_model
(model, featureset, min_score, max_score)[source]¶ Compute expected scores using probability distributions over the labels from the given SKLL model.
Parameters: - model (skll.Learner) – The SKLL Learner object to use for computing the expected scores.
- featureset (skll.data.FeatureSet) – The SKLL FeatureSet object for which predictions are to be made.
- min_score (int) – Minimum score level to be used for computing expected scores.
- max_score (int) – Maximum score level to be used for computing expected scores.
Returns: expected_scores – A numpy array containing the expected scores.
Return type: np.array
Raises: ValueError
– If the given model cannot predict probability distributions and or if the score range specified by min_score and max_score does not match what the model predicts in its probability distribution.
-
rsmtool.utils.
parse_json_with_comments
(filename)[source]¶ Parse a JSON file after removing any comments. Comments can use either
//
for single-line comments or or/* ... */
for multi-line comments.Parameters: filename (str) – Path to the input JSON file. Returns: obj – JSON object representing the input file. Return type: dict Note
This code was adapted from: http://www.lifl.fr/~riquetd/parse-a-json-file-with-comments.html.
From writer
Module¶
Classes for writing DataContainer DataFrames to files.
author: | Jeremy Biggs (jbiggs@ets.org) |
---|---|
author: | Anastassia Loukina (aloukina@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
date: | 10/25/2017 |
organization: | ETS |
-
class
rsmtool.writer.
DataWriter
(experiment_id=None)[source]¶ Bases:
object
A DataWriter class to write out DataContainer objects
-
write_experiment_output
(csvdir, container_or_dict, dataframe_names=None, new_names_dict=None, include_experiment_id=True, reset_index=False, file_format='csv', index=False, **kwargs)[source]¶ Write out each of the given list of data frames as a
.csv
, ‘’.tsv``, or.xlsx
file in the given directory. Each data frame was generated as part of running an RSMTool experiment. All files are prefixed with the given experiment ID and suffixed with either the name of the data frame in the DataContainer (or dict) object, or a new name ifnew_names_dict
is specified. Additionally, the indexes in the data frames are reset if so specified.Parameters: - csvdir (str) – Path to the output experiment sub-directory that will contain the CSV files corresponding to each of the data frames.
- container_or_dict (container.DataContainer or dict) – A DataContainer object or dict, where keys are data frame
names and vales are
pd.DataFrame
objects. - dataframe_names (list of str, optional) – List of data frame names, one for each of the data frames.
- new_names_dict (dict, optional) – New dictionary with new names for the data frames, if desired. Defaults to None.
- include_experiment_id (str, optional) – Whether to include the experiment ID in the file name. Defaults to True.
- reset_index (bool, optional) – Whether to reset the index of each data frame before writing to disk. Defaults to False.
- file_format ({'csv', 'xlsx', 'tsv'}, optional) – The file format in which to output the data. Defaults to ‘csv’.
- index (bool) – Whether to include index. Defaults to False.
Raises: KeyError
– If file_format is not correct, or data frame is not in the container or dictionary
-
write_feature_csv
(featuredir, data_container, selected_features, include_experiment_id=True, file_format='csv')[source]¶ Write out the feature file to disk.
Parameters: - featuredir (str) – Path to the feature experiment output directory where the feature JSON file will be saved.
- data_container (container.DataContainer) – A DataContainer object.
- selected_features (list of str) – List of features that were selected for model building.
- include_experiment_id (str, optional) – Whether to include the experiment ID in the file name. Defaults to True.
- file_format ({'csv', 'xlsx', 'tsv'}, optional) – The file format in which to output the data. Defaults to ‘csv’.
-