PREDICT.classification package

Submodules

PREDICT.classification.RankedSVM module

PREDICT.classification.RankedSVM.RankSVM_test(test_data, num_class, Weights, Bias, SVs, svm='Poly', gamma=0.05, coefficient=0.05, degree=3)
PREDICT.classification.RankedSVM.RankSVM_test_original(test_data, test_target, Weights, Bias, SVs, svm='Poly', gamma=0.05, coefficient=0.05, degree=3)
PREDICT.classification.RankedSVM.RankSVM_train(train_data, train_target, cost=1, lambda_tol=1e-06, norm_tol=0.0001, max_iter=500, svm='Poly', gamma=0.05, coefficient=0.05, degree=3)
PREDICT.classification.RankedSVM.RankSVM_train_old(train_data, train_target, cost=1, lambda_tol=1e-06, norm_tol=0.0001, max_iter=500, svm='Poly', gamma=0.05, coefficient=0.05, degree=3)
Weights,Bias,SVs = RankSVM_train(train_data,train_target,cost,lambda_tol,norm_tol,max_iter,svm,gamma,coefficient,degree)

Description

RankSVM_train takes,

train_data - An MxN array, the ith instance of training instance is stored in train_data[i,:] train_target - A QxM array, if the ith training instance belongs to the jth class, then train_target[j,i] equals +1, otherwise train_target(j,i) equals -1

svm - svm gives the type of svm used in training, which can take the value of ‘RBF’, ‘Poly’ or ‘Linear’; svm.para gives the corresponding parameters used for the svm:
  1. if svm is ‘RBF’, then gamma gives the value of gamma, where the kernel is exp(-Gamma*|x[i]-x[j]|^2)
  1. if svm is ‘Poly’, then three values are used gamma, coefficient, and degree respectively, where the kernel is (gamma*<x[i],x[j]>+coefficient)^degree.
  2. if svm is ‘Linear’, then svm is [].

cost - The value of ‘C’ used in the SVM, default=1 lambda_tol - The tolerance value for lambda described in the appendix of [1]; default value is 1e-6 norm_tol - The tolerance value for difference between alpha(p+1) and alpha(p) described in the appendix of [1]; default value is 1e-4 max_iter - The maximum number of iterations for RankSVM, default=500

and returns,
Weights - The value for beta[ki] as described in the appendix of [1] is stored in Weights[k,i] Bias - The value for b[i] as described in the appendix of [1] is stored in Bias[1,i] SVs - The ith support vector is stored in SVs[:,i]

For more details,please refer to [1] and [2].

PREDICT.classification.RankedSVM.is_empty(any_structure)
PREDICT.classification.RankedSVM.neg_dual_func(Lambda, Alpha_old, Alpha_new, c_value, kernel, num_training, num_class, Label, not_Label, Label_size, size_alpha)

PREDICT.classification.construct_classifier module

PREDICT.classification.construct_classifier.construct_SVM(config, image_features, regression=False)

Constructs a SVM classifier

Args:

config (dict): Dictionary of the required config settings mutation_data (dict): Mutation data that should be classified features (pandas dataframe): A pandas dataframe containing the features

to be used for classification
Returns:
SVM/SVR classifier, parameter grid
PREDICT.classification.construct_classifier.construct_classifier(config, image_features)

Interface to create classification

Different classifications can be created using this common interface

config: dict, mandatory
Contains the required config settings. See the Github Wiki for all available fields.
Returns:
Constructed classifier

PREDICT.classification.crossval module

PREDICT.classification.crossval.crossval(config, label_data, image_features, classifier, param_grid={}, use_fastr=False, fastr_plugin=None, tempsave=False, fixedsplits=None, ensemble={'Use': False}, outputfolder=None, modus='singlelabel')

Constructs multiple individual classifiers based on the label settings

config: dict, mandatory
Dictionary with config settings. See the Github Wiki for the available fields and formatting.
label_data: dict, mandatory

Should contain the following: patient_IDs (list): IDs of the patients, used to keep track of test and

training sets, and genetic data
mutation_label (list): List of lists, where each list contains the
mutations status for that patient for each mutations
mutation_name (list): Contains the different mutations that are stored
in the mutation_label
image_features: numpy array, mandatory
Consists of a tuple of two lists for each patient: (feature_values, feature_labels)
classifier: sklearn classifier
The untrained classifier used for training.
param_grid: dictionary, optional
Contains the parameters and their values wich are used in the grid or randomized search hyperparamater optimization. See the construct_classifier function for some examples.
use_fastr: boolean, default False

If False, parallel execution through Joblib is used for fast execution of the hyperparameter optimization. Especially suited for execution on mutlicore (H)PC’s. The settings used are specified in the config.ini file in the IOparser folder, which you can adjust to your system.

If True, fastr is used to split the hyperparameter optimization in separate jobs. Parameters for the splitting can be specified in the config file. Especially suited for clusters.

fastr_plugin: string, default None
Determines which plugin is used for fastr executions. When None, uses the default plugin from the fastr config.
tempsave: boolean, default False
If True, create a .hdf5 file after each cross validation containing the classifier and results from that that split. This is written to the GSOut folder in your fastr output mount. If False, only the result of all combined cross validations will be saved to a .hdf5 file. This will also be done if set to True.
fixedsplits: string, optional
By default, random split cross validation is used to train and evaluate the machine learning methods. Optionally, you can provide a .xlsx file containing fixed splits to be used. See the Github Wiki for the format.
ensemble: dictionary, optional
Contains the configuration for constructing an ensemble.
modus: string, default ‘singlelabel’
Determine whether one-vs-all classification (or regression) for each single label is used (‘singlelabel’) or if multilabel classification is performed (‘multilabel’).
panda_data: pandas dataframe
Contains all information on the trained classifier.
PREDICT.classification.crossval.nocrossval(config, label_data_train, label_data_test, image_features_train, image_features_test, classifier, param_grid, use_fastr=False, fastr_plugin=None, ensemble={'Use': False}, modus='singlelabel')

Constructs multiple individual classifiers based on the label settings

Arguments:

config (Dict): Dictionary with config settings label_data (Dict): should contain: patient_IDs (list): IDs of the patients, used to keep track of test and

training sets, and genetic data
mutation_label (list): List of lists, where each list contains the
mutations status for that patient for each mutations
mutation_name (list): Contains the different mutations that are stored
in the mutation_label
image_features (numpy array): Consists of a tuple of two lists for each patient:
(feature_values, feature_labels)
ensemble: dictionary, optional
Contains the configuration for constructing an ensemble.
modus: string, default ‘singlelabel’
Determine whether one-vs-all classification (or regression) for each single label is used (‘singlelabel’) or if multilabel classification is performed (‘multilabel’).
Returns:
classifier_data (pandas dataframe)
PREDICT.classification.crossval.singleiteration(X_train, Y_train, PID_train, feature_labels, classifier, param_grid, config_hyperopt, use_SMOTE=False, SMOTE_ratio=1, SMOTE_neighbors=10, n_cores=4, N_jobs=4, random_state=None, use_fastr=False, fastr_plugin='LinearExecution', use_ensemble=False, use_oversampling=True)

Perform a single iteration of a cross validation.

PREDICT.classification.estimators module

class PREDICT.classification.estimators.RankedSVM(cost=1, lambda_tol=1e-06, norm_tol=0.0001, max_iter=500, svm='Poly', gamma=0.05, coefficient=0.05, degree=3)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

An example classifier which implements a 1-NN algorithm.

demo_param : str, optional
A parameter used for demonstation of how to pass and store paramters.
X_ : array, shape = [n_samples, n_features]
The input passed during fit()
y_ : array, shape = [n_samples]
The labels passed during fit()
fit(X, y)

A reference implementation of a fitting function for a classifier.

X : array-like, shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values. An array of int.
self : object
Returns self.
predict(X, y=None)

A reference implementation of a prediction for a classifier.

X : array-like of shape = [n_samples, n_features]
The input samples.
y : array of int of shape = [n_samples]
The label for each sample is the label of the closest sample seen udring fit.
predict_proba(X, y)

A reference implementation of a prediction for a classifier.

X : array-like of shape = [n_samples, n_features]
The input samples.
y : array of int of shape = [n_samples]
The label for each sample is the label of the closest sample seen udring fit.

PREDICT.classification.metrics module

PREDICT.classification.metrics.multi_class_auc(y_truth, y_score)
PREDICT.classification.metrics.multi_class_auc_score(y_truth, y_score)
PREDICT.classification.metrics.pairwise_auc(y_truth, y_score, class_i, class_j)
PREDICT.classification.metrics.performance_multilabel(y_truth, y_prediction, y_score=None, beta=1)

Multiclass performance metrics.

y_truth and y_prediction should both be lists with the multiclass label of each object, e.g.

y_truth = [0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 2] ### Groundtruth y_prediction = [0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 2] ### Predicted labels

Calculation of accuracy accorading to formula suggested in CAD Dementia Grand Challege http://caddementia.grand-challenge.org Calculation of Multi Class AUC according to classpy: https://bitbucket.org/bigr_erasmusmc/classpy/src/master/classpy/multi_class_auc.py

PREDICT.classification.metrics.performance_singlelabel(y_truth, y_prediction, y_score, regression=False)

Singleclass performance metrics

PREDICT.classification.parameter_optimization module

PREDICT.classification.parameter_optimization.random_search_parameters(features, labels, N_iter, test_size, classifier, param_grid, scoring_method, n_jobspercore=200, use_fastr=False, n_cores=1, fastr_plugin=None)

Train a classifier and simultaneously optimizes hyperparameters using a randomized search.

Arguments:

features: numpy array containing the training features. labels: list containing the object labels to be trained on. N_iter: integer listing the number of iterations to be used in the

hyperparameter optimization.
test_size: float listing the test size percentage used in the cross
validation.

classifier: sklearn classifier to be tested param_grid: dictionary containing all possible hyperparameters and their

values or distrubitions.
scoring_method: string defining scoring method used in optimization,
e.g. f1_weighted for a SVM.
n_jobsperscore: integer listing the number of jobs that are ran on a
single core when using the fastr randomized search.
use_fastr: Boolean determining of either fastr or joblib should be used
for the opimization.
fastr_plugin: determines which plugin is used for fastr executions.
When None, uses the default plugin from the fastr config.
Returns:
random_search: sklearn randomsearch object containing the results.

Module contents