data
Package¶
data.featureset
Module¶
Classes related to storing/merging feature sets.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
author: | Nitin Madnani (nmadnani@ets.org) |
author: | Jeremy Biggs (jbiggs@ets.org) |
organization: | ETS |
-
class
skll.data.featureset.
FeatureSet
(name, ids, labels=None, features=None, vectorizer=None)[source]¶ Bases:
object
Encapsulation of all of the features, values, and metadata about a given set of data. This replaces ExamplesTuple from older versions of SKLL.
Parameters: - name (str) – The name of this feature set.
- ids (np.array) – Example IDs for this set.
- labels (np.array, optional) – labels for this set.
Defaults to
None
. - feature (list of dict or array-like, optional) – The features for each instance represented as either a
list of dictionaries or an array-like (if vectorizer is
also specified).
Defaults to
None
. - vectorizer (DictVectorizer or FeatureHasher, optional) – Vectorizer which will be used to generate the feature matrix.
Defaults to
None
.
Warning
FeatureSets can only be equal if the order of the instances is identical because these are stored as lists/arrays. Since scikit-learn’s DictVectorizer automatically sorts the underlying feature matrix if it is sparse, we do not do any sorting before checking for equality. This is not a problem because we _always_ use sparse matrices with DictVectorizer when creating FeatureSets.
Notes
If ids, labels, and/or features are not None, the number of rows in each array must be equal.
-
filter
(ids=None, labels=None, features=None, inverse=False)[source]¶ Removes or keeps features and/or examples from the Featureset depending on the parameters. Filtering is done in-place.
Parameters: - ids (list of str/float, optional) – Examples to keep in the FeatureSet. If None, no ID
filtering takes place.
Defaults to
None
. - labels (list of str/float, optional) – Labels that we want to retain examples for. If None,
no label filtering takes place.
Defaults to
None
. - features (list of str, optional) – Features to keep in the FeatureSet. To help with
filtering string-valued features that were converted
to sequences of boolean features when read in, any
features in the FeatureSet that contain a = will be
split on the first occurrence and the prefix will be
checked to see if it is in features.
If None, no feature filtering takes place.
Cannot be used if FeatureSet uses a FeatureHasher for
vectorization.
Defaults to
None
. - inverse (bool, optional) – Instead of keeping features and/or examples in lists,
remove them.
Defaults to
False
.
Raises: ValueError
– If attempting to use features to filter aFeatureSet
that uses aFeatureHasher
vectorizer.- ids (list of str/float, optional) – Examples to keep in the FeatureSet. If None, no ID
filtering takes place.
Defaults to
-
filtered_iter
(ids=None, labels=None, features=None, inverse=False)[source]¶ A version of __iter__ that retains only the specified features and/or examples from the output.
Parameters: - ids (list of str/float, optional) – Examples to keep in the
FeatureSet
. IfNone
, no ID filtering takes place. Defaults toNone
. - labels (list of str/float, optional) – Labels that we want to retain examples for. If
None
, no label filtering takes place. Defaults toNone
. - features (list of str, optional) – Features to keep in the
FeatureSet
. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in theFeatureSet
that contain a = will be split on the first occurrence and the prefix will be checked to see if it is infeatures
. If None, no feature filtering takes place. Cannot be used ifFeatureSet
uses a FeatureHasher for vectorization. Defaults toNone
. - inverse (bool, optional) – Instead of keeping features and/or examples in lists,
remove them.
Defaults to
False
.
Yields: - id_ (str) – The ID of the example.
- label_ (str) – The label of the example.
- feat_dict (dict) – The feature dictionary, with feature name as the key and example value as the value.
Raises: ValueError
– If the vectorizer is not a DictVectorizer.- ids (list of str/float, optional) – Examples to keep in the
-
static
from_data_frame
(df, name, labels_column=None, vectorizer=None)[source]¶ Helper function to create a
FeatureSet
instance from a pandas.DataFrame. Will raise an Exception if pandas is not installed in your environment. Theids
in theFeatureSet
will be the index from the given frame.Parameters: - df (pd.DataFrame) – The pandas.DataFrame object to use as a
FeatureSet
. - name (str) – The name of the output
FeatureSet
instance. - labels_column (str, optional) – The name of the column containing the labels (data to predict).
Defaults to
None
. - vectorizer (DictVectorizer or FeatureHasher, optional) – Vectorizer which will be used to generate the feature matrix.
Defaults to
None
.
Returns: feature_set – A
FeatureSet
instance generated from from the given data frame.Return type: - df (pd.DataFrame) – The pandas.DataFrame object to use as a
-
has_labels
¶ Check if
FeatureSet
has finite labels.Returns: has_labels – Whether or not this FeatureSet has any finite labels. Return type: bool
-
static
split_by_ids
(fs, ids_for_split1, ids_for_split2=None)[source]¶ Split the
FeatureSet
into two newFeatureSet
instances based on the given IDs for the two splits.Parameters: - fs (skll.FeatureSet) – The
FeatureSet
instance to split. - ids_for_split1 (list of int) – A list of example IDs which will be split out into
the first
FeatureSet
instance. Note that the FeatureSet instance will respect the order of the specified IDs. - ids_for_split2 (list of int, optional) – An optional ist of example IDs which will be
split out into the second
FeatureSet
instance. Note that theFeatureSet
instance will respect the order of the specified IDs. If this is not specified, then the secondFeatureSet
instance will contain the complement of the first set of IDs sorted in ascending order. Defaults toNone
.
Returns: - fs1 (skll.FeatureSet) – The first
FeatureSet
. - fs2 (skll.FeatureSet) – The second
FeatureSet
.
- fs (skll.FeatureSet) – The
data.readers
Module¶
Handles loading data from various types of data files.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
author: | Michael Heilman (mheilman@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
organization: | ETS |
-
class
skll.data.readers.
ARFFReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
instance from an ARFF file.If example/instance IDs are included in the files, they must be specified in the
id
column.Also, there must be a column with the name specified by
label_col
if the data is labeled, and this column must be the final one (as it is in Weka).Parameters: - path_or_list (str) – The path to the ARFF file.
- kwargs (dict, optional) – Other arguments to the Reader object.
-
static
split_with_quotes
(s, delimiter=' ', quote_char="'", escape_char='\\')[source]¶ A replacement for string.split that won’t split delimiters enclosed in quotes.
Parameters: - s (str) – The string with quotes to split
- delimiter (str, optional) – The delimiter to split on.
Defaults to
' '
. - quote_char (str, optional) – The quote character to ignore.
Defaults to
"'"
. - escape_char (str, optional) – The escape character.
Defaults to
'\'
.
-
class
skll.data.readers.
CSVReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
instance from a CSV file.If example/instance IDs are included in the files, they must be specified in the
id
column.Also, there must be a column with the name specified by
label_col
if the data is labeled.Parameters: - path_or_list (str) – The path to a comma-delimited file.
- kwargs (dict, optional) – Other arguments to the Reader object.
-
class
skll.data.readers.
DelimitedReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.Reader
Reader for creating a
FeatureSet
instance from a delimited (CSV/TSV) file.If example/instance IDs are included in the files, they must be specified in the
id
column.For ARFF, CSV, and TSV files, there must be a column with the name specified by
label_col
if the data is labeled. For ARFF files, this column must also be the final one (as it is in Weka).Parameters: - path_or_list (str) – The path to a delimited file.
- dialect (str) – The dialect of to pass on to the underlying CSV reader.
Defaults to
'excel-tab'
. - kwargs (dict, optional) – Other arguments to the Reader object.
-
class
skll.data.readers.
DictListReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]¶ Bases:
skll.data.readers.Reader
This class is to facilitate programmatic use of
Learner.predict()
and other methods that takeFeatureSet
objects as input. It iterates over examples in the same way as otherReader
classes, but uses a list of example dictionaries instead of a path to a file.-
read
()[source]¶ Read examples from list of dictionaries.
Returns: feature_set – FeatureSet representing the list of dictionaries we read in. Return type: skll.FeatureSet
-
-
class
skll.data.readers.
LibSVMReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
instance from a LibSVM/LibLinear/SVMLight file.We use a specially formatted comment for storing example IDs, class names, and feature names, which are normally not supported by the format. The comment is not mandatory, but without it, your labels and features will not have names. The comment is structured as follows:
ExampleID | 1=FirstClass | 1=FirstFeature 2=SecondFeature
-
class
skll.data.readers.
MegaMReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
instance from a MegaM -fvals file.If example/instance IDs are included in the files, they must be specified as a comment line directly preceding the line with feature values.
-
class
skll.data.readers.
NDJReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
instance from a JSONlines/NDJ file.If example/instance IDs are included in the files, they must be specified as the “id” key in each JSON dictionary.
-
class
skll.data.readers.
Reader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]¶ Bases:
object
A helper class to make picklable iterators out of example dictionary generators.
Parameters: - path_or_list (str or list of dict) – Path or a list of example dictionaries.
- quiet (bool, optional) – Do not print “Loading…” status message to stderr.
Defaults to
True
. - ids_to_floats (bool, optional) – Convert IDs to float to save memory. Will raise error
if we encounter an a non-numeric ID.
Defaults to
False
. - label_col (str, optional) – Name of the column which contains the class labels
for ARFF/CSV/TSV files. If no column with that name
exists, or
None
is specified, the data is considered to be unlabelled. Defaults to'y'
. - id_col (str, optional) – Name of the column which contains the instance IDs.
If no column with that name exists, or
None
is specified, example IDs will be automatically generated. Defaults to'id'
. - class_map (dict, optional) – Mapping from original class labels to new ones. This is
mainly used for collapsing multiple labels into a single
class. Anything not in the mapping will be kept the same.
Defaults to
None
. - sparse (bool, optional) – Whether or not to store the features in a numpy CSR
matrix when using a DictVectorizer to vectorize the
features.
Defaults to
True
. - feature_hasher (bool, optional) – Whether or not a FeatureHasher should be used to
vectorize the features.
Defaults to
False
. - num_features (int, optional) – If using a FeatureHasher, how many features should the
resulting matrix have? You should set this to a power
of 2 greater than the actual number of features to
avoid collisions.
Defaults to
None
. - logger (logging.Logger, optional) – A logger instance to use to log messages instead of creating
a new one by default.
Defaults to
None
.
-
classmethod
for_path
(path_or_list, **kwargs)[source]¶ Instantiate the appropriate Reader sub-class based on the file extension of the given path. Or use a dictionary reader if the input is a list of dictionaries.
Parameters: - path_or_list (str or list of dicts) – A path or list of example dictionaries.
- kwargs (dict, optional) – The arguments to the Reader object being instantiated.
Returns: reader – A new instance of the Reader sub-class that is appropriate for the given path.
Return type: Raises: ValueError
– If file does not have a valid extension.
-
read
()[source]¶ Loads examples in the .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, or .tsv formats.
Returns: feature_set –
FeatureSet
instance representing the input file.Return type: Raises: ValueError
– Ifids_to_floats
is True, but IDs cannot be converted.ValueError
– If no features are found.ValueError
– If the example IDs are not unique.
-
class
skll.data.readers.
TSVReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
instance from a TSV file.If example/instance IDs are included in the files, they must be specified in the
id
column.Also there must be a column with the name specified by
label_col
if the data is labeled.Parameters: - path_or_list (str) – The path to the TSV file.
- kwargs (dict, optional) – Other arguments to the Reader object.
-
skll.data.readers.
safe_float
(text, replace_dict=None, logger=None)[source]¶ Attempts to convert a string to an int, and then a float, but if neither is possible, returns the original string value.
Parameters: - text (str) – The text to convert.
- replace_dict (dict, optional) – Mapping from text to replacement text values. This is
mainly used for collapsing multiple labels into a
single class. Replacing happens before conversion to
floats. Anything not in the mapping will be kept the
same.
Defaults to
None
. - logger (logging.Logger) – The Logger instance to use to log messages. Used instead of
creating a new Logger instance by default.
Defaults to
None
.
Returns: text – The text value converted to int or float, if possible
Return type: int or float or str
data.writers
Module¶
Handles loading data from various types of data files.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
author: | Michael Heilman (mheilman@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
organization: | ETS |
-
class
skll.data.writers.
ARFFWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out FeatureSets as ARFF files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.arff
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - relation (str, optional) – The name of the relation in the ARFF file.
Defaults to
'skll_relation'
. - regression (bool, optional) – Is this an ARFF file to be used for regression?
Defaults to
False
. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
CSVWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out
FeatureSet
instances as CSV files.Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
DelimitedFileWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as TSV/CSV files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - quiet (bool) – Do not print “Writing…” status message to stderr.
Defaults to
True
. - label_col (str) – Name of the column which contains the class labels
for ARFF/CSV/TSV files. If no column with that name
exists, or
None
is specified, the data is considered to be unlabelled. Defaults to'y'
. - id_col (str) – Name of the column which contains the instance IDs.
If no column with that name exists, or
None
is specified, example IDs will be automatically generated. Defaults to'id'
. - dialect (str) – Name of the column which contains the class labels for CSV/TSV files.
- logger (logging.Logger) – A logger instance to use to log messages instead of creating
a new one by default.
Defaults to
None
. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
LibSVMWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as LibSVM/SVMLight files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.libsvm
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
MegaMWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as MegaM files.
-
class
skll.data.writers.
NDJWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as .jsonlines/.ndj files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.ndj
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
TSVWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out FeatureSets as TSV files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.tsv
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - kwargs (dict, optional) – The arguments to the
Writer
object being instantiated.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
Writer
(path, feature_set, **kwargs)[source]¶ Bases:
object
Helper class for writing out FeatureSets to files on disk.
Parameters: - path (str) – A path to the feature file we would like to create. The suffix
to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the file. - quiet (bool) – Do not print “Writing…” status message to stderr.
Defaults to
True
. - requires_binary (bool) – Whether or not the Writer must open the
file in binary mode for writing with Python 2.
Defaults to
False
. - subsets (dict (str to list of str)) – A mapping from subset names to lists of feature names
that are included in those sets. If given, a feature
file will be written for every subset (with the name
containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping. Defaults toNone
. - logger (logging.Logger) – A logger instance to use to log messages instead of creating
a new one by default.
Defaults to
None
.
-
classmethod
for_path
(path, feature_set, **kwargs)[source]¶ Retrieve object of
Writer
sub-class that is appropriate for given path.Parameters: - path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (skll.FeatureSet) – The
FeatureSet
instance to dump to the output file. - kwargs (dict) – The keyword arguments for
for_path
are the same as the initializer for the desiredWriter
subclass.
Returns: writer – New instance of the Writer sub-class that is appropriate for the given path.
Return type: - path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
- path (str) – A path to the feature file we would like to create. The suffix
to this filename must be