Overview of RSMTool Pipeline

The following figure gives an overview of the RSMTool pipeline:

RSMTool processing pipeline

As its primary input, RSMTool takes a data file containing a table with numeric, non-sparse features and a human scores for all responses, pre-processes them and lets you train a regression-based Scoring Model to predict the human score from the features. Available regression models include Ridge, SVR, AdaBoost, and Random Forests, among many others.

This trained model can then be used to generate scores for a held-out evaluation data whose feature values are pre-processed using the same Pre-processing Parameters. In addition to the raw scores predicted by the model, the Prediction Analysis component of the pipline generates several additional post-processed scores that are commonly used in automated scoring.

The primary output of RSMTool is a comprehensive, customizable HTML statistical report that contains the multiple analyses required for a comprehensive evaluation of an automated scoring model including descriptive analyses for all features, model analyses, subgroup comparisons, as well as several different evaluation measures illustrating model efficacy [1]. More Details about these analyses can be found in a separate technical paper.

In addition to the HTML report, RSMTool also saves the intermediate outputs of all of the performed analyses as CSV files.

Input file format

The input files containing feature values and scores for all responses in training and evaluation data should be in tabular format with features and scores stored in columns and each row correponding to a single response.

RSMTool supports input files in .csv, .tsv or xls/.xlsx format. For Excel spreadsheets all data must be stored in the first sheet. The format of the file is determined based on the extension. In all cases the output files will be saved in .csv format.

Feature pre-processing

Data filtering

  1. Remove all training and evaluation responses that have non-numeric for any of the features (see column selection methods for different ways to select features).
  2. Remove all training and evaluation responses with non-numeric values for human scores.
  3. Optionally remove all training and evaluation responses with zero values for human scores. Zero scored responses are usually removed since in many scoring rubrics, zero scores usually indicate non-scorable responses.
  4. Remove all features with values that do not change across responses (i.e., those with a standard deviation close to 0).

Data preprocessing

  1. Truncate/clamp any outlier feature values, where outliers are defined as \(\mu \pm 4*\sigma\), where \(\mu\) is the mean and \(\sigma\) is the standard deviation.
  2. Apply pre-specified transformations to feature values.
  3. Flip the signs for feature values if necessary.
  4. Standardize all transformed feature values into z-scores.

Pre-processing parameters

Any held-out evaluation data on which the model is to be evaluated needs to be pre-processed in the same way as the training data. Therefore, the following parameters are computed on the training set, saved to disk, and re-used when pre-processing the evaluation set:

  • Mean and standard deviation of raw feature values. These are used to compute floor and ceiling for truncating any outliers in the evaluation set;
  • Any transformation and sign changes that were applied;
  • Mean and standard deviation of transformed feature values. These are used to convert feature values in the evaluation set to z-scores.

Score post-processing

RSMTool computes six different versions of scores commonly used in different applications of automated scoring:

raw

The raw predictions generated by the model.

raw_trim

The raw predictions “trimmed” to be in the score range acceptable for the item. The scores are trimmed to be within the following range: \(score_{min} - 0.49998\) and \(score_{max} + 0.49998\), where \(score_{min}\) and \(score_{max}\) are the lowest and highest points on the scoring scale respectively.

This approach represents a compromise: it provides scores that are real-valued and, therefore, provide more information than human scores that are likely to be integer-valued. However, it also ensures that the scores fall within the expected scale.

raw_trim_round

The raw_trim predictions rounded to the nearest integer.

Note

The rounding is done using rint function from numpy. See numpy documentation for treatment of values such as 1.5.

scale

The raw predictions rescaled to match the human score distribution on the training set. The raw scores are first converted to z-scores using the mean and standard deviation of the machine scores predicted for the training set. The z-scores are then converted back to “scaled” scores using the mean and standard deviation of the human scores, also computed on the training set.

scale_trim

The scaled scores trimmed in the same way as raw_trim scores.

scale_trim_round

The scale_trim scores scores rounded to the nearest integer.

Footnotes

[1]The primary evaluation analyses in the RSMTool report are conducted for all six types of scores. For some additional evaluations , the user can pick between raw and scaled scores.