Advanced Uses of RSMTool

In addition to providing the rsmtool utility training and evaluating regression-based scoring models, the RSMTool package also provides three other command-line utilities for more advanced users.

rsmeval - Evaluate external predictions

RSMTool provides the rsmeval command-line utility to evaluate existing predictions and generate a report with all the built-in analyses. This can be useful in scenarios where the user wants to use more sophisticated machine learning algorithms not available in RSMTool to build the scoring model but still wants to be able to evaluate that model’s predictions using the standard analyses.

For example, say a researcher has an existing automated scoring engine for grading short responses that extracts the features and computes the predicted score. This engine uses a large number of binary, sparse features. She cannot use rsmtool to train her model since it requires numeric features. So, she uses scikit-learn to train her model.

Once the model is trained, the researcher wants to evaluate her engine’s performance using the analyses recommended by the educational measurement community as well as conduct additional investigations for specific subgroups of test-takers. However, these kinds of analyses are not available in scikit-learn. She can use rsmeval to set up a customized report using a combination of existing and custom sections and quickly produce the evaluation that is useful to her.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmeval is designed for evaluating existing machine scores. Once you have the scores computed for all the responses in your data, the next steps are fairly straightforward:

  1. Create a data file in one of the supported formats containing the computed system scores and the human scores you want to compare against.
  2. Create an experiment configuration file describing the evaluation experiment you would like to run.
  3. Run that configuration file with rsmeval and generate the experiment HTML report as well as the intermediate CSV files.
  4. Examine the HTML report to check various aspects of model performance.

Note that the above workflow does not use any customization features , e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Generate scores

rsmeval is designed for researchers who have developed their own scoring engine for generating scores and would like to produce an evaluation report for those scores. For this tutorial, we will use the scores we generated for the ASAP2 evaluation set using rsmtool tutorial.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
    "experiment_id": "ASAP2_evaluation",
    "description": "Evaluation of the scores generated using rsmtool.",
    "predictions_file": "ASAP2_scores.csv",
    "system_score_column": "system",
    "human_score_column": "human",
    "id_column": "ID",
    "trim_min": 1,
    "trim_max": 6,
    "second_human_score_column": "human2",
    "scale_with": "asis"
}

Let’s take a look at the options in our configuration file.

  • Line 2: We define an experiment ID.
  • Line 3: We also provide a description which will be included in the experiment report.
  • Line 4: We list the path to the file with the predicted and human scores. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
  • Line 5: This field indicates that the system scores in our .csv file are located in a column named system.
  • Line 6: This field indicates that the human (reference) scores in our .csv file are located in a column named human.
  • Line 7: This field indicates that the unique IDs for the responses in the .csv file are located in columns named ID.
  • Lines 8-9: These fields indicate that the lowest score on the scoring scale is a 1 and the highest score is a 6. This information is usually part of the rubric used by human graders.
  • Line 10: This field indicates that scores from a second set of human graders are also available (useful for comparing the agreement between human-machine scores to the agreement between two sets of humans) and are located in the human2 column in the .csv file.
  • Line 11: This field indicates that the provided machine scores are already re-scaled to match the distribution of human scores. rsmeval itself will not perform any scaling and the report will refer to these as scaled scores.

Documentation for all of the available configuration options is available here.

Run the experiment

Now that we have our scores in the right format and our configuration file in .json format, we can use the rsmeval command-line script to run our evaluation experiment.

$ cd examples/rsmeval
$ rsmeval config_rsmeval.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmeval
Assuming given system predictions are already scaled and will be used as such.
 predictions: /Users/nmadnani/work/rsmtool/examples/rsmeval/ASAP2_scores.csv
Processing predictions
Saving pre-processed predictions and the metadata to disk
Running analyses on predictions
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see the output, figure, and report sub-directories in the current directory. Each of these directories contain useful information but we are specifically interested in the report/ASAP2_evaluation_report.html file, which is the final evaluation report.

Examine the report

Our experiment report contains all the information we would need to evaluate the provided system scores against the human scores. It includes:

  1. The distributions for the human versus the system scores.
  2. Several different metrics indicating how well the machine’s scores agree with the humans’.
  3. Information about human-human agreement and the difference between human-human and human-system agreement.

… and much more.

Input

rsmeval requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmeval will use the current directory as the output directory.

Here are all the arguments to the rsmeval command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where all the files for this experiment will be stored.

-f, --force

If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmeval experiment.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmeval experiment. Here’s an example configuration file for rsmeval.

There are four required fields and the rest are all optional.

experiment_id

An identifier for the experiment that will be used to name the report and all intermediate files. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

predictions_file

The path to the file with predictions to evaluate. The file should be in one of the supported formats. Each row should correspond to a single response and contain the predicted and observed scores for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of the configuration file.

system_score_column

The name for the column containing the scores predicted by the system. These scores will be used for evaluation.

trim_min

The single numeric value for the lowest possible integer score that the machine should predict. This value will be used to compute the floor value for trimmed (bound) machine scores as trim_min - 0.49998.

trim_max

The single numeric value for the highest possible integer score that the machine should predict. This value will be used to compute the ceiling value for trimmed (bound) machine scores as trim_max + 0.49998.

Note

Although the trim_min and trim_max fields are optional for rsmtool, they are required for rsmeval.

description (Optional)

A brief description of the experiment. This will be included in the report. The description can contain spaces and punctuation. It’s blank by default.

file_format (Optional)

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmeval will look for a column called spkitemid in the prediction file.

human_score_column (Optional)

The name for the column containing the human scores for each response. The values in this column will be used as observed scores. Defaults to sc1.

Note

All responses with non-numeric values or zeros in either human_score_column or system_score_column will be automatically excluded from evaluation. You can use exclude_zero_scores (Optional) to keep responses with zero scores.

second_human_score_column (Optional)

The name for an optional column in the test data containing a second human score for each response. If specified, additional information about human-human agreement and degradation will be computed and included in the report. Note that this column must contain either numbers or be empty. Non-numeric values are not accepted. Note also that the exclude_zero_scores (Optional) option below will apply to this column too.

Note

You do not need to have second human scores for all responses to use this option. The human-human agreement statistics will be computed as long as there is at least one response with numeric value in this column. For responses that do not have a second human score, the value in this column should be blank.

flag_column (Optional)

This field makes it possible to only use responses with particular values in a given column (e.g. only responses with a value of 0 in a column called ADVISORY). The field takes a dictionary in Python format where the keys are the names of the columns and the values are lists of values for responses that will be evaluated. For example, a value of {"ADVISORY": 0} will mean that rsmeval will only use responses for which the ADVISORY column has the value 0. Defaults to None.

Note

If several conditions are specified (e.g., {"ADVISORY": 0, "ERROR": 0}) only those responses which satisfy all the conditions will be selected for further analysis (in this example, these will be the responses where the ADVISORY column has a value of 0 and the ERROR column has a value of 0).

Note

When reading the values in the supplied dictionary, rsmeval treats numeric strings, floats and integers as the same value. Thus 1, 1.0, "1" and "1.0" are all treated as the 1.0.

exclude_zero_scores (Optional)

By default, responses with human scores of 0 will be excluded from evaluations. Set this field to false if you want to keep responses with scores of 0. Defaults to true.

scale_with (Optional)

In many scoring applications, system scores are re-scaled so that their mean and standard deviation match those of the human scores for the training data.

If you want rsmeval to re-scale the supplied predictions, you need to provide – as the value for this field – the path to a second file in one of the supported formats containing the human scores and predictions of the same system on its training data. This file must have two columns: the human scores under the sc1 column and the predicted score under the prediction.

This field can also be set to "asis" if the scores are already scaled. In this case, no additional scaling will be performed by rsmeval but the report will refer to the scores as “scaled”.

Defaults to "raw" which means that no-rescaling is performed and the report refers to the scores as “raw”.

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. These subgroup columns need to be present in the input predictions file. If subgroups are specified, rsmeval will generate:

  • tables and barplots showing system-human agreement for each subgroup on the evaluation set.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmeval (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

  • data_description: Shows the total number of responses, along with any responses have been excluded due to non-numeric/zero scores or flag columns.

  • data_description_by_group: Shows the total number of responses for each of the subgroups specified in the configuration file. This section only covers the responses used to evaluate the model.

  • consistency: Shows metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated human-machine correlations.. This notebook is only generated if the config file specifies second_human_score_column

  • evaluation: Shows the standard set of evaluations recommended for scoring models on the evaluation data:

    • a table showing system-human association metrics;
    • the confusion matrix; and
    • a barplot showing the distributions for both human and machine scores.
  • evaluation by group: Shows barplots with the main evaluation metrics by each of the subgroups specified in the configuration file.

  • intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the evaluation.

  • sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

special_sections (Optional)

A list specifying special ETS-only sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

  1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
  2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
  3. All special sections specified using special_sections.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

candidate_column (Optional)

The name for an optional column in prediction file containing unique candidate IDs. Candidate IDs are different from response IDs since the same candidate (test-taker) might have responded to multiple questions.

min_items_per_candidate (Optional)

An integer value for the minimum number of responses expected from each candidate. If any candidates have fewer responses than the specified value, all responses from those candidates will be excluded from further analysis. Defaults to None.

Output

rsmeval produces a set of folders in the output directory.

report

This folder contains the final RSMEval report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmeval will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

figure

This folder contains all of the figures generated as part of the various analyses performed, saved as .svg files.

Intermediate files

Although the primary output of rsmeval is an HTML report, we also want the user to be able to conduct additional analyses outside of rsmeval. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the experiment_id provided by the user in the experiment configuration file. In addition, the names for certain columns are set to default values in these files irrespective of what they were named in the original data files. This is because RSMEval standardizes these column names internally for convenience. These values are:

  • spkitemid for the column containing response IDs.
  • sc1 for the column containing the human scores used as observed scores
  • sc2 for the column containing the second human scores, if this column was specified in the configuration file.
  • candidate for the column containing candidate IDs, if this column was specified in the configuration file.

Predictions

filename: pred_processed

This file contains the post-processed predicted scores: the predictions from the model are truncated, rounded, and re-scaled (if requested).

Flagged responses

filename: test_responses_with_excluded_flags

This file contains all of the rows in input predictions file that were filtered out based on conditions specified in flag_column.

Note

If the predictions file contained columns with internal names such as sc1 that were not actually used by rsmeval, they will still be included in these files but their names will be changed to ##name## (e.g. ##sc1##).

Excluded responses

filename: test_excluded_responses

This file contains all of the rows in the predictions file that were filtered out because of non-numeric or zero scores.

Response metadata

filename: test_metadata

This file contains the metadata columns (id_column, subgroups if provided) for all rows in the predictions file that used in the evaluation.

Unused columns

filename: test_other_columns

This file contains all of the the columns from the input predictions file that are not present in the *_pred_processed and *_metadata files. They only include the rows that were not filtered out.

Note

If the predictions file contained columns with internal names such as sc1 but these columns were not actually used by rsmeval, these columns will also be included into these files but their names will be changed to ##name## (e.g. ##sc1##).

Human scores

filename: test_human_scores

This file contains the human scores, if available in the input predictions file, under a column called sc1 with the response IDs under the spkitemid column.

If second_human_score_column was specfied, then it also contains the values in the predictions file from that column under a column called sc2. Only the rows that were not filtered out are included.

Note

If exclude_zero_scores was set to true (the default value), all zero scores in the second_human_score_column will be replaced by nan.

Data composition

filename: data_composition

This file contains the total number of responses in the input predictions file. If applicable, the table will also include the number of different subgroups.

Excluded data composition

filenames: test_excluded_composition

This file contains the composition of the set of excluded responses, e.g., why were they excluded and how many for each such exclusion.

Subgroup composition

filename: data_composition_by_<SUBGROUP>

There will be one such file for each of the specified subgroups and it contains the total number of responses in that subgroup.

Evaluation metrics

  • eval: This file contains the descriptives for predicted and human scores (mean, std.dev etc.) as well as the association metrics (correlation, quadartic weighted kappa, SMD etc.) for the raw as well as the post-processed scores.

  • eval_by_<SUBGROUP>: the same information as in *_eval.csv computed separately for each subgroup.

  • eval_short - a shortened version of eval that contains specific descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).

    • h_mean
    • h_sd
    • corr
    • sys_mean [raw/scale trim]
    • sys_sd [raw/scale trim]
    • SMD [raw/scale trim]
    • adj_agr [raw/scale trim_round]
    • exact_agr [raw/scale trim_round]
    • kappa [raw/scale trim_round]
    • wtkappa [raw/scale trim_round]
    • sys_mean [raw/scale trim_round]
    • sys_sd [raw/scale trim_round]
    • SMD [raw/scale trim_round]
    • R2 [raw/scale trim]
    • RMSE [raw/scale trim]
  • score_dist: the distributions of the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

  • confMatrix: the confusion matrix between the the human scores and the rounded raw/scaled predicted scores, depending on the value of use_scaled_predictions.

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.

Human-human Consistency

These files are created only if a second human score has been made available via the second_human_score_column option in the configuration file.

  • consistency: contains descriptives for both human raters as well as the agreement metrics between their ratings.
  • consistency_by_<SUBGROUP>: contains the same metrics as in consistency file computed separately for each group
  • degradation: shows the differences between human-human agreement and machine-human agreement for all association metrics and all forms of predicted scores.
  • disattenuated_correlations: shows the correlation between human-machine scores, human-human scores, and the disattenuated human-machine correlation computed as human-machine correlation divided by the square root of human-human correlation.
  • disattenuated_correlations_by_<SUBGROUP>: contains the same metrics as in disattenuated_correlations file computed separately for each group.

rsmpredict - Generate new predictions

RSMTool provides the rsmpredict command-line utility to generate predictions for new data using a model already trained using the rsmtool utility. This can be useful when processing a new set of responses to the same task without needing to retrain the model.

rsmpredict pre-processes the feature values according to user specifications before using them to generate the predicted scores. The generated scores are post-processed in the same manner as they are in rsmtool output.

Note

No score is generated for responses with non-numeric values for any of the features included into the model.

If the original model specified transformations for some of the features and these transformations led to NaN or Inf values when applied to the new data, rsmpredict will raise a warning. No score will be generated for such responses.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

Important

Although this tutorial provides feature values for the purpose of illustration, rsmpredict does not include any functionality for feature extraction; the tool is designed for researchers who use their own NLP/Speech processing pipeline to extract features for their data.

rsmpredict allows you to generate the scores for new data using an existing model trained using RSMTool. Therefore, before starting this tutorial, you first need to complete rsmtool tutorial which will produce a train RSMTool model. You will also need to process the new data to extract the same features as the ones used in the model.

Once you have the features for the new data and the RSMTool model, using rsmpredict is fairly straightforward:

  1. Create a file containing the features for the new data. The file should be in one of the supported formats.

  2. Create an experiment configuration file describing the experiment you would like to run.

  3. Run that configuration file with rsmpredict to generate the predicted scores.

    Note

    You do not need human scores to run rsmpredict since it does not produce any evaluation analyses. If you do have human scores for the new data and you would like to evaluate the system on this new data, you can first run rsmpredict to generate the predictions and then run rsmeval on the output of rsmpredict to generate an evaluation report.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial. Specifically, We are going to use the linear regression model we trained in that tutorial to generate scores for new data.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

Extract features

We will first need to generate features for the new set of responses for which we want to predict scores. For this experiment, we will simply re-use the test set from the rsmtool tutorial.

Note

The features used with rsmpredict should be generated using the same NLP/Speech processing pipeline that generated the features used in the rsmtool modeling experiment.

Create a configuration file

The next step is to create an rsmpredict experiment configuration file in .json format.

1
2
3
4
5
6
7
8
{
    "experiment_dir": "../rsmtool",
    "experiment_id": "ASAP2",
    "input_features_file": "../rsmtool/test.csv",
    "id_column": "ID",
    "human_score_column": "score",
    "second_human_score_column": "score2"
}

Let’s take a look at the options in our configuration file.

  • Line 2: We give the path to the directory containing the output of the rsmtool experiment.
  • Line 3: We provide the experiment_id of the rsmtool experiment used to train the model. This can usually be read off the output/<experiment_id>.model file in the rsmtool experiment output directory.
  • Line 4: We list the path to the data file with the feature values for the new data. For this tutorial we used .csv format, but RSMTool also supports several other input file formats.
  • Line 5: This field indicates that the unique IDs for the responses in the .csv file are located in a column named ID.
  • Lines 6-7: These fields indicates that there are two sets of human scores in our .csv file located in the columns named score and score2. The values from these columns will be added to the output file containing the predictions which can be useful if we want to evaluate the predictions using rsmeval.

Documentation for all of the available configuration options is available here.

Run the experiment

Now that we have the model, the features in the right format, and our configuration file in .json format, we can use the rsmpredict command-line script to generate the predictions and to save them in predictions.csv.

$ cd examples/rsmpredict
$ rsmpredict config_rsmpredict.json predictions.csv

This should produce output like:

WARNING: The following extraenous features will be ignored: {'spkitemid', 'sc1', 'sc2', 'LENGTH'}
Pre-processing input features
Generating predictions
Rescaling predictions
Trimming and rounding predictions
Saving predictions to /Users/nmadnani/work/rsmtool/examples/rsmpredict/predictions.csv

You should now see a file named predictions.csv in the current directory which contains the predicted scores for the new data in the predictions column.

Input

rsmpredict requires two arguments to generate predictions: the path to a configuration file and the path to the output file where the generated predictions are saved in .csv format.

If you also want to save the pre-processed feature values,``rsmpredict`` can take a third optional argument --features to specify the path to a .csv file to save these values.

Here are all the arguments to the rsmpredict command-line script.

config_file

The JSON configuration file for this experiment.

output_file

The output .csv file where predictions will be saved.

--features <preproc_feats_file>

If specified, the pre-processed values for the input features will also be saved in this .csv file.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmpredict experiment. Here’s an example configuration file for rsmpredict.

There are three required fields and the rest are all optional.

experiment_dir

The path to the directory containing rsmtool model to use for generating predictions. This directory must contain a sub-directory called output with the model files, feature pre-processing parameters, and score post-processing parameters. The path can be absolute or relative to the location of configuration file.

experiment_id

The experiment_id used to create the rsmtool model files being used for generating predictions. If you do not know the experiment_id, you can find it by looking at the prefix of the .model file under the output directory.

input_feature_file

The path to the file with the raw feature values that will be used for generating predictions. The file should be in one of the supported formats Each row should correspond to a single response and contain feature values for this response. In addition, there should be a column with a unique identifier (ID) for each response. The path can be absolute or relative to the location of config file. Note that the feature names must be the same as used in the original rsmtool experiment.

Note

rsmpredict will only generate predictions for responses in this file that have numeric values for the features included in the rsmtool model.

See also

rsmpredict does not require human scores for the new data since it does not evaluate the generated predictions. If you do have the human scores and want to evaluate the new predictions, you can use the rsmeval command-line utility.

file_format (Optional)

The format of the intermediate files. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

predict_expected_scores (Optional)

If the original model was a probabilistic SKLL classifier, then expected scores — probability-weighted averages over a contiguous, numeric score points — can be generated as the machine predictions instead of the most likely score point, which would be the default. Set this field to true to compute expected scores as predictions. Defaults to false.

Note

  1. If the model in the original rsmtool experiment is an SVC, that original experiment must have been run with predict_expected_scores set to true. This is because SVC classifiers are fit differently if probabilistic output is desired, in contrast to other probabilistic SKLL classifiers.
  2. You may see slight differences in expected score predictions if you run the experiment on different machines or on different operating systems most likely due to very small probability values for certain score points which can affect floating point computations.

id_column (Optional)

The name of the column containing the response IDs. Defaults to spkitemid, i.e., if this is not specified, rsmpredict will look for a column called spkitemid in the prediction file.

There are several other options in the configuration file that, while not directly used by rsmpredict, can simply be passed through from the input features file to the output predictions file. This can be particularly useful if you want to subsequently run rsmeval to evaluate the generated predictions.

candidate_column (Optional)

The name for the column containing unique candidate IDs. This column will be named candidate in the output file with predictions.

human_score_column (Optional)

The name for the column containing human scores. This column will be renamed to sc1.

second_human_score_column (Optional)

The name for the column containing the second human score. This column will be renamed to sc2.

standardize_features (Optional)

If this option is set to false features will not be standardized by dividing by the mean and multiplying by the standard deviation. Defaults to true.

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups. For example, ["prompt, gender, native_language, test_country"]. All these columns will be included into the predictions file with the original names.

flag_column (Optional)

See description in the rsmtool configuration file for further information. No filtering will be done by rsmpredict, but the contents of all specified columns will be added to the predictions file using the original column names.

Output

rsmpredict produces a .csv file with predictions for all responses in new data set, and, optionally, a .csv file with pre-processed feature values. If any of the responses had non-numeric feature values in the original data or after applying transformations, these are saved in a file name PREDICTIONS_NAME_excluded_responses.csv where PREDICTIONS_NAME is the name of the predictions file supplied by the user without the extension.

The predictions .csv file contains the following columns:

  • spkitemid : the unique resonse IDs from the original feature file.
  • sc1 and sc2 : the human scores for each response from the original feature file (human_score_column and second_human_score_column, respectively.
  • raw : raw predictions generated by the model.
  • raw_trim, raw_trim_round, scale, scale_trim, scale_trim_round : raw scores post-processed in different ways.

rsmcompare - Create a detailed comparison of two scoring models

RSMTool provides the rsmcompare command-line utility to compare two models and to generate a detailed comparison report including differences between the two models. This can be useful in many scenarios, e.g., say the user wants to compare the changes in model performance after adding a new feature into the model. To use rsmcompare, the user must first run two experiments using either rsmtool or rsmeval. rsmcompare can then be used to compare the outputs of these two experiments to each other.

Note

Currently rsmcompare takes the outputs of the analyses generated during the original experiments and creates comparison tables. These comparison tables were designed with a specific comparison scenario in mind: comparing a baseline model with a model which includes new feature(s). The tool can certianly be used for other comparison scenarios if the researcher feels that the generated comparison output is appropriate.

rsmcompare can be used to compare:

  1. Two rsmtool experiments, or
  2. Two rsmeval experiments, or
  3. An rsmtool experiment with an rsmeval experiment (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the comparison experiment are all done using the same version of RSMTool.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmcompare is designed to compare two existing rsmtool or rsmeval experiments. To use rsmcompare you need:

  1. Two experiments that were run using rsmtool or rsmeval.
  2. Create an experiment configuration file describing the comparison experiment you would like to run.
  3. Run that configuration file with rsmcompare and generate the comparison experiment HTML report.
  4. Examine HTML report to compare the two models.

Note that the above workflow does not use the customization features of rsmcompare, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Run rsmtool (or rsmeval) experiments

rsmcompare compares the results of the two existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to itself.

Note

If you have not already completed that tutorial, please do so now. You may need to complete it again if you deleted the output files.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
    "comparison_id": "ASAP2_vs_ASAP2",
    "experiment_id_old": "ASAP2",
    "experiment_dir_old": "../rsmtool/",
    "description_old": "RSMTool experiment.",
    "use_scaled_predictions_old": true,
    "experiment_id_new": "ASAP2",
    "experiment_dir_new": "../rsmtool",
    "description_new": "RSMTool experiment (copy).",
    "use_scaled_predictions_new": true
}

Let’s take a look at the options in our configuration file.

  • Line 2: We provide an ID for the comparison experiment.
  • Line 3: We provide the experiment_id for the experiment we want to use as a baseline.
  • Line 4: We also give the path to the directory containing the output of the original baseline experiment.
  • Line 5: We give a short description of this baseline experiment. This will be shown in the report.
  • Line 6: This field indicates that the baseline experiment used scaled scores for some evaluation analyses.
  • Line 7: We provide the experiment_id for the new experiment. We use the same experiment ID for both experiments since we are comparing the experiment to itself.
  • Line 8: We also give the path to the directory containing the output of the new experiment. As above, we use the same path because we are comparing the experiment to itself.
  • Line 9: We give a short description of the new experiment. This will also be shown in the report.
  • Line 10: This field indicates that the new experiment also used scaled scores for some evaluation analyses.

Documentation for all of the available configuration options is available here.

Run the experiment

Now that we have the two experiments we want to compare and our configuration file in .json format, we can use the rsmcompare command-line script to run our comparison experiment.

$ cd examples/rsmcompare
$ rsmcompare config_rsmcompare.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmcompare
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see an HTML file named ASAP2_vs_ASAP2_report.html. This is the final rsmcompare comparison report.

Examine the report

Our experiment report contains all the information we would need to compare the new model to the baseline model. It includes:

  1. Comparison of feature distributions between the two experiments.
  2. Comparison of model coefficients between the two experiments.
  3. Comparison of model performance between the two experiments.

Note

Since we are comparing the experiment to itself, the comparison is not very interesting, e.g., the differences between various values will always be 0.

Input

rsmcompare requires a single argument to run an experiment: the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmcompare will use the current directory as the output directory.

Here are all the arguments to the rsmcompare command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where the report files for this comparison will be stored.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmcompare experiment. Here’s an example configuration file for rsmcompare.

There are seven required fields and the rest are all optional.

comparison_id

An identifier for the comparison experiment that will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

experiment_id_old

An identifier for the “baseline” experiment. This ID should be identical to the experiment_id used when the baseline experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

experiment_id_new

An identifier for the experiment with the “new” model (e.g., the model with new feature(s)). This ID should be identical to the experiment_id used when the experiment was run, whether rsmtool or rsmeval. The results for this experiment will be listed first in the comparison report.

experiment_dir_old

The directory with the results for the “baseline” experiment. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

experiment_dir_new

The directory with the results for the experiment with the new model. This directory is the output directory that was used for the experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

description_old

A brief description of the “baseline” experiment. The description can contain spaces and punctuation.

description_new

A brief description of the experiment with the new model. The description can contain spaces and punctuation.

use_scaled_predictions_old (Optional)

Set to true if the “baseline” experiment used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

use_scaled_predictions_new (Optional)

Set to true if the experiment with the new model used scaled machine scores for confusion matrices, score distributions, subgroup analyses, etc. Defaults to false.

Warning

For rsmtool and rsmeval, primary evaluation analyses are computed on both raw and scaled scores, but some analyses (e.g., the confusion matrix) are only computed for either raw or re-scaled scores based on the value of use_scaled_predictions. rsmcompare uses the existing outputs and does not perform any additional evaluations. Therefore if this field was set to true in the original experiment but is set to false for rsmcompare, the report will be internally inconsistent: some evaluations use raw scores whereas others will use scaled scores.

subgroups (Optional)

A list of column names indicating grouping variables used for generating analyses specific to each of those defined subgroups.For example, ["prompt, gender, native_language, test_country"].

Note

In order to include subgroups analyses in the comparison report, both experiments must have been run with the same set of subgroups.

general_sections (Optional)

RSMTool provides pre-defined sections for rsmcompare (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

  • feature_descriptives: Compares the descriptive statistics for all raw feature values included in the model:

    • a table showing mean, standard deviation, skewness and kurtosis;
    • a table showing the number of truncated outliers for each feature; and
    • a table with percentiles and outliers;
    • a table with correlations between raw feature values and human score in each model and the correlation between the values of the same feature in these two models. Note that this table only includes features and responses which occur in both training sets.
  • features_by_group: Shows boxplots for both experiments with distributions of raw feature values by each of the subgroups specified in the configuration file.

  • preprocessed_features: Compares analyses of preprocessed features:

    • histograms showing the distributions of preprocessed features values;
    • the correlation matrix between all features and the human score;
    • a table showing marginal correlations between all features and the human score; and
    • a table showing partial correlations between all features and the human score.
  • preprocessed_features_by_group: Compares analyses of preprocessed features by subgroups: marginal and partial correlations between each feature and human score for each subgroup.

  • consistency: Compares metrics for human-human agreement, the difference (‘degradation’) between the human-human and human-system agreement, and the disattenuated correlations for the whole dataset and by each of the subgroups specified in the configuration file.

  • score_distributions:

    • tables showing the distributions for both human and machine scores; and
    • confusion matrices for human and machine scores.
  • model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.

  • evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.

  • pca: Shows the results of principal components analysis on the processed feature values for the new model only:

    • the principal components themselves;
    • the variances; and
    • a Scree plot.
  • notes: Notes explaining the terminology used in comparison reports.

  • sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

special_sections (Optional)

A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

  1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
  2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
  3. All special sections specified using special_sections.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

Output

rsmcompare produces the comparison report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file) in the output directory.

rsmsummarize - Compare multiple scoring models

RSMTool provides the rsmsummarize command-line utility to compare multiple models and to generate a comparison report. Unlike rsmcompare which creates a detailed comparison report between the two models, rsmsummarize can be used to create a more general overview of multiple models.

rsmsummarize can be used to compare:

  1. Multiple rsmtool experiments, or
  2. Multiple rsmeval experiments, or
  3. A mix of rsmtool and rsmeval experiments (in this case, only the evaluation analyses will be compared).

Note

It is strongly recommend that the original experiments as well as the summary experiment are all done using the same version of RSMTool.

Tutorial

For this tutorial, you first need to install RSMTool and make sure the conda environment in which you installed it is activated.

Workflow

rsmsummarize is designed to compare several existing rsmtool or rsmeval experiments. To use rsmsummarize you need:

  1. Two or more experiments that were run using rsmtool or rsmeval.
  2. Create an experiment configuration file describing the comparison experiment you would like to run.
  3. Run that configuration file with rsmsummarize and generate the comparison experiment HTML report.
  4. Examine HTML report to compare the models.

Note that the above workflow does not use the customization features of rsmsummarize, e.g., choosing which sections to include in the report or adding custom analyses sections etc. However, we will stick with this workflow for our tutorial since it is likely to be the most common use case.

ASAP Example

We are going to use the same example from 2012 Kaggle competition on automated essay scoring that we used for the rsmtool tutorial.

Run rsmtool and rsmeval experiments

rsmsummarize compares the results of the two or more existing rsmtool (or rsmeval) experiments. For this tutorial, we will compare model trained in the rsmtool tutorial to the evaluations we obtained in the rsmeval tutorial.

Note

If you have not already completed these tutorials, please do so now. You may need to complete them again if you deleted the output files.

Create a configuration file

The next step is to create an experiment configuration file in .json format.

1
2
3
4
5
{
  "summary_id": "model_comparison",
  "description": "a comparison of the results of the rsmtool sample experiment, rsmeval sample experiment and once again the rsmtool sample experiment",
  "experiment_dirs": ["../rsmtool", "../rsmeval", "../rsmtool"]
}

Let’s take a look at the options in our configuration file.

  • Line 2: We provide the summary_id for the comparison. This will be used to generate the name of the final report.
  • Line 3: We give a short description of this comparison experiment. This will be shown in the report.
  • Line 4: We also give the list of paths to the directories containing the outputs of the experiments we want to compare.

Documentation for all of the available configuration options is available here.

Run the experiment

Now that we have the list of the experiments we want to compare and our configuration file in .json format, we can use the rsmsummarize command-line script to run our comparison experiment.

$ cd examples/rsmsummarize
$ rsmsummarize config_rsmsummarize.json

This should produce output like:

Output directory: /Users/nmadnani/work/rsmtool/examples/rsmsummarize
Starting report generation
Merging sections
Exporting HTML
Executing notebook with kernel: python3

Once the run finishes, you will see an HTML file named model_comparison_report.html. This is the final rsmsummarize summary report.

Examine the report

Our experiment report contains the overview of main aspects of model performance. It includes:

  1. Brief description of all experiments.
  2. Information about model parameters and model fit for all rsmtool experiments.
  3. Model performance for all experiments.

Note

Some of the information such as model fit and model parameters are only available for rsmtool experiments.

Input

rsmsummarize requires a single argument to run an experiment: the path to a configuration file. You can specify which models you want to compare and the name of the report by supplying the path to a configuration file. It can also take an output directory as an optional second argument. If the latter is not specified, rsmsummarize will use the current directory as the output directory.

Here are all the arguments to the rsmsummarize command-line script.

config_file

The JSON configuration file for this experiment.

output_dir (optional)

The output directory where the report and intermediate .csv files for this comparison will be stored.

-f, --force

If specified, the contents of the output directory will be overwritten even if it already contains the output of another rsmsummarize experiment.

-h, --help

Show help message and exist.

-V, --version

Show version number and exit.

Experiment configuration file

This is a file in .json format that provides overall configuration options for an rsmsummarize experiment. Here’s an example configuration file for rsmsummarize.

There are two required fields and the rest are all optional.

summary_id

An identifier for the rsmsummarize experiment. This will be used to name the report. It can be any combination of alphanumeric values, must not contain spaces, and must not be any longer than 200 characters.

experiment_dirs

The list of the directories with the results of the experiment. These directories should be the output directories used for each experiment and should contain subdirectories output and figure generated by rsmtool or rsmeval.

description (Optional)

A brief description of the summary. The description can contain spaces and punctuation.

file_format (Optional)

The format of the intermediate files generated by rsmsummarize. Options are csv, tsv, or xlsx. Defaults to csv if this is not specified.

Note

In the rsmsummarize context, the file_format parameter refers to the format of the intermediate files generated by rsmsummarize, not the intermediate files generated by the original experiment(s) being summarized. The format of these files does not have to match the format of the files generated by the original experiment(s).

general_sections (Optional)

RSMTool provides pre-defined sections for rsmsummarize (listed below) and, by default, all of them are included in the report. However, you can choose a subset of these pre-defined sections by specifying a list as the value for this field.

  • preprocessed_features : compares marginal and partial correlations between all features and the human score, and optionally response length if this was computed for any of the models.
  • model: Compares the parameters of the two regression models. For linear models, it also includes the standardized and relative coefficients.
  • evaluation: Compares the standard set of evaluations recommended for scoring models on the evaluation data.
  • intermediate_file_paths: Shows links to all of the intermediate files that were generated while running the summary.
  • sysinfo: Shows all Python packages along with versions installed in the current environment while generating the report.

custom_sections (Optional)

A list of custom, user-defined sections to be included into the final report. These are IPython notebooks (.ipynb files) created by the user. The list must contains paths to the notebook files, either absolute or relative to the configuration file. All custom notebooks have access to some pre-defined variables.

special_sections (Optional)

A list specifying special ETS-only comparison sections to be included into the final report. These sections are available only to ETS employees via the rsmextra package.

section_order (Optional)

A list containing the order in which the sections in the report should be generated. Any specified order must explicitly list:

  1. Either all pre-defined sections if a value for the general_sections field is not specified OR the sections specified using general_sections, and
  2. All custom section names specified using custom_ sections, i.e., file prefixes only, without the path and without the .ipynb extension, and
  3. All special sections specified using special_sections.

use_thumbnails (Optional)

If set to true, the images in the HTML will be set to clickable thumbnails rather than full-sized images. Upon clicking the thumbnail, the full-sized images will be displayed in a separate tab in the browser. If set to false, full-sized images will be displayed as usual. Defaults to false.

Output

rsmsummarize produces a set of folders in the output directory.

report

This folder contains the final rsmsummarize report in HTML format as well as in the form of a Jupyter notebook (a .ipynb file).

output

This folder contains all of the intermediate files produced as part of the various analyses performed, saved as .csv files. rsmsummarize will also save in this folder a copy of the configuration file. Fields not specified in the original configuration file will be pre-populated with default values.

figure

This folder contains all of the figures that may be generated as part of the various analyses performed, saved as .svg files. Note that no figures are generated by the existing rsmsummarize notebooks.

Intermediate files

Although the primary output of RSMSummarize is an HTML report, we also want the user to be able to conduct additional analyses outside of RSMTool. To this end, all of the tables produced in the experiment report are saved as files in the format as specified by file_format parameter in the output directory. The following sections describe all of the intermediate files that are produced.

Note

The names of all files begin with the summary_id provided by the user in the experiment configuration file.

Marginal and partial correlations with score

filenames: margcor_score_all_data, pcor_score_all_data, `pcor_score_no_length_all_data

The first file contains the marginal correlations between each pre-processed feature and human score. The second file contains the partial correlation between each pre-processed feature and human score after controlling for all other features. The third file contains the partial correlations between each pre-processed feature and human score after controlling for response length, if length_column was specified in the configuration file.

Model information

  • model_summary

This file contains the main information about the models included into the report including:

  • Total number of features
  • Total number of features with non-negative coefficients
  • The learner
  • The label used to train the model
  • betas: standardized coefficients (for built-in models only).
  • model_fit: R squared and adjusted R squared computed on the training set. Note that these values are always computed on raw predictions without any trimming or rounding.

Note

If the report includes a combination of rsmtool and rsmeval experiments, the summary tables with model information will only include rsmtool experiments since no model information is available for rsmeval experiments.

Evaluation metrics

  • eval_short - descriptives for predicted and human scores (mean, std.dev etc.) and association metrics (correlation, quadartic weighted kappa, SMD etc.) for specific score types chosen based on recommendations by Williamson (2012). Specifically, the following columns are included (the raw or scale version is chosen depending on the value of the use_scaled_predictions in the configuration file).

    • h_mean
    • h_sd
    • corr
    • sys_mean [raw/scale trim]
    • sys_sd [raw/scale trim]
    • SMD [raw/scale trim]
    • adj_agr [raw/scale trim_round]
    • exact_agr [raw/scale trim_round]
    • kappa [raw/scale trim_round]
    • wtkappa [raw/scale trim_round]
    • sys_mean [raw/scale trim_round]
    • sys_sd [raw/scale trim_round]
    • SMD [raw/scale trim_round]
    • R2 [raw/scale trim]
    • RMSE [raw/scale trim]

Note

Please note that for raw scores, SMD values are likely to be affected by possible differences in scale.