learner
Module¶
Provides easy-to-use wrapper around scikit-learn.
author: | Michael Heilman (mheilman@ets.org) |
---|---|
author: | Nitin Madnani (nmadnani@ets.org) |
author: | Dan Blanchard (dblanchard@ets.org) |
author: | Aoife Cahill (acahill@ets.org) |
organization: | ETS |
-
class
skll.learner.
FilteredLeaveOneGroupOut
(keep, example_ids)[source]¶ Bases:
sklearn.model_selection._split.LeaveOneGroupOut
Version of
LeaveOneGroupOut
cross-validation iterator that only outputs indices of instances with IDs in a prespecified set.Parameters: - keep (set of str) – A set of IDs to keep.
- example_ids (list of str, of length n_samples) – A list of example IDs.
-
split
(X, y, groups)[source]¶ Generate indices to split data into training and test set.
Parameters: - X (array-like, with shape (n_samples, n_features)) – Training data, where n_samples is the number of samples and n_features is the number of features.
- y (array-like, of length n_samples) – The target variable for supervised learning problems.
- groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.
Yields: - train_index (np.array) – The training set indices for that split.
- test_index (np.array) – The testing set indices for that split.
-
class
skll.learner.
Learner
(model_type, probability=False, feature_scaling='none', model_kwargs=None, pos_label_str=None, min_feature_count=1, sampler=None, sampler_kwargs=None, custom_learner_path=None, logger=None)[source]¶ Bases:
object
A simpler learner interface around many scikit-learn classification and regression functions.
Parameters: - model_type (str) – Name of estimator to create (e.g.,
'LogisticRegression'
). See the skll package documentation for valid options. - probability (bool, optional) – Should learner return probabilities of all
labels (instead of just label with highest probability)?
Defaults to
False
. - feature_scaling (str, optional) – How to scale the features, if at all. Options are - ‘with_std’: scale features using the standard deviation - ‘with_mean’: center features using the mean - ‘both’: do both scaling as well as centering - ‘none’: do neither scaling nor centering Defaults to ‘none’.
- model_kwargs (dict, optional) – A dictionary of keyword arguments to pass to the
initializer for the specified model.
Defaults to
None
. - pos_label_str (str, optional) – The string for the positive label in the binary
classification setting. Otherwise, an arbitrary
label is picked.
Defaults to
None
. - min_feature_count (int, optional) – The minimum number of examples a feature must have a nonzero value in to be included. Defaults to 1.
- sampler (str, optional) – The sampler to use for kernel approximation, if desired.
Valid values are
- ‘AdditiveChi2Sampler’
- ‘Nystroem’
- ‘RBFSampler’
- ‘SkewedChi2Sampler’
Defaults to
None
. - sampler_kwargs (dict, optional) – A dictionary of keyword arguments to pass to the
initializer for the specified sampler.
Defaults to
None
. - custom_learner_path (str, optional) – Path to module where a custom classifier is defined.
Defaults to
None
. - logger (logging object, optional) – A logging object. If
None
is passed, get logger from__name__
. Defaults toNone
.
-
cross_validate
(examples, stratified=True, cv_folds=10, grid_search=False, grid_search_folds=3, grid_jobs=None, grid_objective='f1_score_micro', output_metrics=[], prediction_prefix=None, param_grid=None, shuffle=False, save_cv_folds=False, use_custom_folds_for_grid_search=True)[source]¶ Cross-validates a given model on the training examples.
Parameters: - examples (skll.FeatureSet) – The
FeatureSet
instance to cross-validate learner performance on. - stratified (bool, optional) – Should we stratify the folds to ensure an even
distribution of labels for each fold?
Defaults to
True
. - cv_folds (int, optional) – The number of folds to use for cross-validation, or a mapping from example IDs to folds. Defaults to 10.
- grid_search (bool, optional) – Should we do grid search when training each fold?
Note: This will make this take much longer.
Defaults to
False
. - grid_search_folds (int or dict, optional) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds. Defaults to 3.
- grid_jobs (int, optional) – The number of jobs to run in parallel when doing the
grid search. If
None
or 0, the number of grid search folds will be used. Defaults toNone
. - grid_objective (str, optional) – The name of the objective function to use when
doing the grid search.
Defaults to
'f1_score_micro'
. - output_metrics (list of str, optional) – List of additional metric names to compute in addition to the metric used for grid search. Empty by default. Defaults to an empty list.
- prediction_prefix (str, optional) – If saving the predictions, this is the
prefix that will be used for the filename.
It will be followed by
"_predictions.tsv"
Defaults toNone
. - param_grid (list of dicts, optional) – The parameter grid to traverse.
Defaults to
None
. - shuffle (bool, optional) – Shuffle examples before splitting into folds for CV.
Defaults to
False
. - save_cv_folds (bool, optional) – Whether to save the cv fold ids or not?
Defaults to
False
. - use_custom_folds_for_grid_search (bool, optional) – If
cv_folds
is a custom dictionary, butgrid_search_folds
is not, perhaps due to user oversight, should the same custom dictionary automatically be used for the inner grid-search cross-validation? Defaults toTrue
.
Returns: - results (list of 6-tuples) – The confusion matrix, overall accuracy, per-label PRFs, model parameters, objective function score, and evaluation metrics (if any) for each fold.
- grid_search_scores (list of floats) – The grid search scores for each fold.
- skll_fold_ids (dict) – A dictionary containing the test-fold number for each id
if
save_cv_folds
isTrue
, otherwiseNone
.
Raises: ValueError
– If labels are not encoded as strings.- examples (skll.FeatureSet) – The
-
evaluate
(examples, prediction_prefix=None, append=False, grid_objective=None, output_metrics=[])[source]¶ Evaluates a given model on a given dev or test
FeatureSet
.Parameters: - examples (skll.FeatureSet) – The
FeatureSet
instance to evaluate the performance of the model on. - prediction_prefix (str, optional) – If saving the predictions, this is the
prefix that will be used for the filename.
It will be followed by
"_predictions.tsv"
Defaults toNone
. - append (bool, optional) – Should we append the current predictions to the file if
it exists?
Defaults to
False
. - grid_objective (function, optional) – The objective function that was used when doing
the grid search.
Defaults to
None
. - output_metrics (list of str, optional) – List of additional metric names to compute in addition to grid objective. Empty by default. Defaults to an empty list.
Returns: res – The confusion matrix, the overall accuracy, the per-label PRFs, the model parameters, the grid search objective function score, and the additional evaluation metrics, if any.
Return type: 6-tuple
- examples (skll.FeatureSet) – The
-
classmethod
from_file
(learner_path)[source]¶ Load a saved
Learner
instance from a file path.Parameters: learner_path (str) – The path to a saved
Learner
instance file.Returns: learner – The
Learner
instance loaded from the file.Return type: Raises: ValueError
– If the pickled object is not aLearner
instance.ValueError
– If the pickled version of theLearner
instance is out of date.
-
learning_curve
(examples, cv_folds=10, train_sizes=array([ 0.1, 0.325, 0.55, 0.775, 1. ]), metric='f1_score_micro')[source]¶ Generates learning curves for a given model on the training examples via cross-validation. Adapted from the scikit-learn code for learning curve generation (cf.``sklearn.model_selection.learning_curve``).
Parameters: - examples (skll.FeatureSet) – The
FeatureSet
instance to generate the learning curve on. - cv_folds (int, optional) – The number of folds to use for cross-validation, or a mapping from example IDs to folds. Defaults to 10.
- train_sizes (list of float or int, optional) – Relative or absolute numbers of training examples
that will be used to generate the learning curve.
If the type is float, it is regarded as a fraction
of the maximum size of the training set (that is
determined by the selected validation method),
i.e. it has to be within (0, 1]. Otherwise it
is interpreted as absolute sizes of the training
sets. Note that for classification the number of
samples usually have to be big enough to contain
at least one sample from each class.
Defaults to
np.linspace(0.1, 1.0, 5)
. - metric (str, optional) – The name of the metric function to use
when computing the train and test scores
for the learning curve. (default: ‘f1_score_micro’)
Defaults to
'f1_score_micro'
.
Returns: - train_scores (list of float) – The scores for the training set.
- test_scores (list of float) – The scores on the test set.
- num_examples (list of int) – The numbers of training examples used to generate the curve
- examples (skll.FeatureSet) – The
-
load
(learner_path)[source]¶ Replace the current learner instance with a saved learner.
Parameters: learner_path (str) – The path to a saved learner object file to load.
-
model
¶ The underlying scikit-learn model
-
model_kwargs
¶ A dictionary of the underlying scikit-learn model’s keyword arguments
-
model_params
¶ Model parameters (i.e., weights) for a
LinearModel
(e.g.,Ridge
) regression and liblinear models.Returns: - res (dict) – A dictionary of labeled weights.
- intercept (dict) – A dictionary of intercept(s).
Raises: ValueError
– If the instance does not support model parameters.
-
model_type
¶ The model type (i.e., the class)
-
predict
(examples, prediction_prefix=None, append=False, class_labels=False)[source]¶ Uses a given model to generate predictions on a given
FeatureSet
.Parameters: - examples (skll.FeatureSet) – The
FeatureSet
instance to predict labels for. - prediction_prefix (str, optional) – If saving the predictions, this is the prefix that will be used for
the filename. It will be followed by
"_predictions.tsv"
Defaults toNone
. - append (bool, optional) – Should we append the current predictions to the file if it exists?
Defaults to
False
. - class_labels (bool, optional) – For classifier, should we convert class indices to their (str) labels?
Defaults to
False
.
Returns: yhat – The predictions returned by the
Learner
instance.Return type: array-like
Raises: MemoryError
– If process runs out of memory when converting to dense.- examples (skll.FeatureSet) – The
-
probability
¶ Should learner return probabilities of all labels (instead of just label with highest probability)?
-
save
(learner_path)[source]¶ Save the
Learner
instance to a file.Parameters: learner_path (str) – The path to save the Learner
instance to.
-
train
(examples, param_grid=None, grid_search_folds=3, grid_search=True, grid_objective='f1_score_micro', grid_jobs=None, shuffle=False, create_label_dict=True)[source]¶ Train a classification model and return the model, score, feature vectorizer, scaler, label dictionary, and inverse label dictionary.
Parameters: - examples (skll.FeatureSet) – The
FeatureSet
instance to use for training. - param_grid (list of dicts, optional) – The parameter grid to search through for grid
search. If
None
, a default parameter grid will be used. Defaults toNone
. - grid_search_folds (int or dict, optional) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds. Defaults to 3.
- grid_search (bool, optional) – Should we do grid search?
Defaults to
True
. - grid_objective (str, optional) – The name of the objective function to use when
doing the grid search.
Defaults to
'f1_score_micro'
. - grid_jobs (int, optional) – The number of jobs to run in parallel when doing the
grid search. If
None
or 0, the number of grid search folds will be used. Defaults toNone
. - shuffle (bool, optional) – Shuffle examples (e.g., for grid search CV.)
Defaults to
False
. - create_label_dict (bool, optional) – Should we create the label dictionary? This
dictionary is used to map between string
labels and their corresponding numerical
values. This should only be done once per
experiment, so when
cross_validate
callstrain
,create_label_dict
gets set toFalse
. Defaults toTrue
.
Returns: grid_score – The best grid search objective function score, or 0 if we’re not doing grid search.
Return type: float
Raises: ValueError
– If grid_objective is not a valid grid objective.MemoryError
– If process runs out of memory converting training data to dense.ValueError
– If FeatureHasher is used with MultinomialNB.
- examples (skll.FeatureSet) – The
- model_type (str) – Name of estimator to create (e.g.,
-
class
skll.learner.
SelectByMinCount
(min_count=1)[source]¶ Bases:
sklearn.feature_selection.univariate_selection.SelectKBest
Select features occurring in more (and/or fewer than) than a specified number of examples in the training data (or a CV training fold).
Parameters: min_count (int, optional) – The minimum feature count to select. Defaults to 1.
-
skll.learner.
rescaled
(cls)[source]¶ Decorator to create regressors that store a min and a max for the training data and make sure that predictions fall within that range. It also stores the means and SDs of the gold standard and the predictions on the training set to rescale the predictions (e.g., as in e-rater).
Parameters: cls (BaseEstimator) – An estimator class to add rescaling to. Returns: cls – Modified version of estimator class with rescaled functions added. Return type: BaseEstimator Raises: ValueError
– If classifier cannot be rescaled (i.e. is not a regressor).