Module pept.tracking.peptml

Source code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-


# File   : __init__.py
# License: License: GNU v3.0
# Author : Andrei Leonard Nicusan <a.l.nicusan@bham.ac.uk>
# Date   : 22.08.2019


from    .peptml     import  Cutpoints
from    .peptml     import  HDBSCANClusterer


__all__ = [
    "Cutpoints",
    "HDBSCANClusterer"
]


__author__ =        "Andrei Leonard Nicusan"
__credits__ =       ["Andrei Leonard Nicusan", "Kit Windows-Yule", "Sam Manger"]
__license__ =       "GNU v3.0"
__version__ =       "0.1"
__maintainer__ =    "Andrei Leonard Nicusan"
__email__ =         "a.l.nicusan@bham.ac.uk"
__status__ =        "Development"

Sub-modules

pept.tracking.peptml.extensions
pept.tracking.peptml.peptml

The peptml package implements a hierarchical density-based clustering algorithm for general Positron Emission Particle Tracking (PEPT) …

Classes

class Cutpoints (line_data, max_distance, cutoffs=None, verbose=True)

A class that transforms LoRs into cutpoints for clustering.

The Cutpoints class transforms LoRs (individual numpy arrays or full LineData) into cutpoints (individual numpy arrays or full PointData) that can then be passed to HDBSCANClusterer.

Under typical usage, the class is instantiated with the LoR data (required to be an instance of LineData) and transforms it into an instance of PointData that stores the found cutpoints.

For more control over the operations, the class also provides a static method (i.e. it can be used without instantiating the class) find_cutpoints_sample that receives a generic numpy array of LoRs (one 'sample') and returns a numpy array of cutpoints.

Parameters

line_data : instance of LineData
The LoRs for which the cutpoints will be computed. It is required to be an instance of LineData.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs : list-like of length 6, optional
A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using get_cutoffs.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is True.

Attributes

line_data : instance of LineData
The LoRs for which the cutpoints will be computed. It is required to be an instance of LineData.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs : list-like of length 6
A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances.

sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData Extra attributes and methods are inherited from the base class PointData.

Raises

Exception
If line_data is not an instance of LineData.
TypeError
If cutoffs is not a one-dimensional array with values formatted as [min_x, max_x, min_y, max_y, min_z, max_z].

Example usage

Compute the cutpoints for a LineData instance: >>> line_data = pept.LineData(example_data) >>> cutpts = peptml.Cutpoints(line_data, 0.1)

Compute the cutpoints for a single sample: >>> sample = line_data[0] >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation

Source code
class Cutpoints(pept.PointData):
    '''A class that transforms LoRs into *cutpoints* for clustering.

    The `Cutpoints` class transforms LoRs (individual numpy arrays or full `LineData`)
    into cutpoints (individual numpy arrays or full `PointData`) that can then be passed
    to `HDBSCANClusterer`.

    Under typical usage, the class is instantiated with the LoR data (required to be
    an instance of `LineData`) and transforms it into an instance of `PointData` that
    stores the found cutpoints.

    For more control over the operations, the class also provides a static method (i.e.
    it can be used without instantiating the class) `find_cutpoints_sample` that
    receives a generic numpy array of LoRs (one 'sample') and returns a numpy array
    of cutpoints.

    Parameters
    ----------
    line_data : instance of pept.LineData
        The LoRs for which the cutpoints will be computed. It is required to be an
        instance of `pept.LineData`.
    max_distance : float
        The maximum distance between any two lines so that their cutpoint will be
        considered.
    cutoffs : list-like of length 6, optional
        A list (or equivalent) of the cutoff distances for every axis, formatted as
        [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
        fall within these cutoff distances. The default is None, in which case they
        are automatically computed using `get_cutoffs`.
    verbose : bool, optional
        Provide extra information when computing the cutpoints: time the operation
        and show a progress bar. The default is `True`.

    Attributes
    ----------
    line_data : instance of pept.LineData
        The LoRs for which the cutpoints will be computed. It is required to be an
        instance of `pept.LineData`.
    max_distance : float
        The maximum distance between any two lines so that their cutpoint will be
        considered.
    cutoffs : list-like of length 6
        A list (or equivalent) of the cutoff distances for every axis, formatted as
        [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
        fall within these cutoff distances.
    sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData
        Extra attributes and methods are inherited from the base class `PointData`.

    Raises
    ------
    Exception
        If `line_data` is not an instance of `pept.LineData`.
    TypeError
        If `cutoffs` is not a one-dimensional array with values formatted as
        `[min_x, max_x, min_y, max_y, min_z, max_z]`.

    Example usage
    -------------
    Compute the cutpoints for a `LineData` instance:
        >>> line_data = pept.LineData(example_data)
        >>> cutpts = peptml.Cutpoints(line_data, 0.1)

    Compute the cutpoints for a single sample:
        >>> sample = line_data[0]
        >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation

    '''

    def __init__(self,
                 line_data,
                 max_distance,
                 cutoffs = None,
                 verbose = True):

        # Find the cutpoints when instantiated. The method
        # also initialises the instance as a `PointData` subclass.
        self.find_cutpoints(
            line_data,
            max_distance,
            cutoffs = cutoffs,
            verbose = verbose
        )


    @property
    def line_data(self):
        '''The LoRs for which the cutpoints are computed.

        line_data : instance of pept.LineData

        '''

        return self._line_data


    @line_data.setter
    def line_data(self, new_line_data):
        ''' The LoRs for which the cutpoints are computed.

        Parameters
        ----------
        line_data : instance of pept.LineData
            The LoRs for which the cutpoints will be computed. It is required to be an
            instance of `pept.LineData`.

        Raises
        ------
        Exception
            If `line_data` is not an instance of `pept.LineData`.

        '''

        # Check line_data is an instance (or a subclass!) of pept.LineData
        if not isinstance(line_data, pept.LineData):
            raise Exception('[ERROR]: line_data should be an instance of pept.LineData')

        self._line_data = line_data


    @property
    def max_distance(self):
        '''The maximum distance between any pair of lines for which the cutpoint is considered.

        max_distance : float

        '''
        return self._max_distance


    @max_distance.setter
    def max_distance(self, new_max_distance):
        '''The maximum distance between any pair of lines for which the cutpoint is considered.

        max_distance : float
            The maximum distance between any two lines so that their cutpoint will be
            considered.

        '''
        self._max_distance = new_max_distance


    @property
    def cutoffs(self):
        '''Only consider the cutpoints which fall within these cutoff distances.

        A list (or equivalent) of the cutoff distances for every axis, formatted as
        [x_min, x_max, y_min, y_max, z_min, z_max].

        cutoffs : (6) list or equivalent

        '''

        return self._cutoffs


    @cutoffs.setter
    def cutoffs(self, new_cutoffs):
        '''Only consider the cutpoints which fall within these cutoff distances.

        A list (or equivalent) of the cutoff distances for every axis, formatted as
        [x_min, x_max, y_min, y_max, z_min, z_max].

        Parameters
        ----------
        new_cutoffs : list-like of length 6, optional
            A list (or equivalent) of the cutoff distances for every axis, formatted as
            [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
            fall within these cutoff distances. The default is None, in which case they
            are automatically computed using `get_cutoffs`.

        Raises
        ------
        TypeError
            If `cutoffs` is not a one-dimensional array with values formatted as
            `[min_x, max_x, min_y, max_y, min_z, max_z]`.

        '''

        cutoffs = np.asarray(new_cutoffs, order = 'C', dtype = float)
        if cutoffs.ndim != 1 or len(cutoffs) != 6:
            raise TypeError('\n[ERROR]: new_cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')

        self._cutoffs = cutoffs


    @staticmethod
    def get_cutoffs(sample):
        '''Compute the cutoffs from a sample of LoR data.

        This is a static method, meaning it can be called without
        instantiating the `Cutpoints` class. It computes the cutoffs
        from the minimum and maximum values of the LoRs in `sample`
        in each dimension.

        Parameters
        ----------
        sample : (N, 7) numpy.ndarray
            A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
            such that every line is defined by the points `[x1, y1, z1]` and
            `[x2, y2, z2]`.
        Returns
        -------
        cutoffs : (6) numpy.ndarray
            The computed cutoffs for each dimension, formatted as `[x_min, x_max,
            y_min, y_max, z_min, z_max]`.

        Raises
        ------
        TypeError
            If `sample` is not a numpy array with shape (N, 7).

        '''

        # Check sample has shape (N, 7)
        if sample.ndim != 2 or sample.shape[1] != 7:
            raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))

        # Compute cutoffs for cutpoints as the (min, max) values of the lines
        # Minimum value of the two points that define a line
        min_x = min(sample[:, 1].min(),
                    sample[:, 4].min())
        # Maximum value of the two points that define a line
        max_x = max(sample[:, 1].max(),
                    sample[:, 4].max())

        # Minimum value of the two points that define a line
        min_y = min(sample[:, 2].min(),
                    sample[:, 5].min())
        # Maximum value of the two points that define a line
        max_y = max(sample[:, 2].max(),
                    sample[:, 5].max())

        # Minimum value of the two points that define a line
        min_z = min(sample[:, 3].min(),
                    sample[:, 6].min())
        # Maximum value of the two points that define a line
        max_z = max(sample[:, 3].max(),
                    sample[:, 6].max())

        cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z])
        return cutoffs


    @staticmethod
    def find_cutpoints_sample(sample, max_distance, cutoffs = None):
        '''Find the cutpoints in a sample of LoRs.

        This is a static method, meaning it can be called without
        instantiating the `Cutpoints` class. It computes the cutpoints
        from a given `sample` that are associated with pairs of lines
        closer than `max_distance`.

        Parameters
        ----------
        sample : (N, 7) numpy.ndarray
            A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
            such that every line is defined by the points `[x1, y1, z1]` and
            `[x2, y2, z2]`.
        max_distance : float
            The maximum distance between any pair of lines so that their cutpoint
            will be considered.
        cutoffs : list, optional
            The cutoffs for each dimension, formatted as `[x_min, x_max,
            y_min, y_max, z_min, z_max]`. If not defined, they are computed
            automatically by calling `get_cutoffs`. The default is `None`.

        Returns
        -------
        sample_cutpoints : (N, 4) numpy.ndarray
            The computed cutpoints for the given LoRs, where each row is
            formatted as `[time, x, y, z]` for every cutpoint.

        Raises
        ------
        TypeError
            If `sample` is not a numpy array with shape (N, 7).
        TypeError
            If `cutoffs` is not a `one-dimensional array with values [min_x,
            max_x, min_y, max_y, min_z, max_z]`

        '''

        # Check sample has shape (N, 7)
        if sample.ndim != 2 or sample.shape[1] != 7:
            raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))

        if cutoffs is None:
            cutoffs = Cutpoints.get_cutoffs(sample)
        else:
            cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
            if cutoffs.ndim != 1 or len(cutoffs) != 6:
                raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')

        sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs)
        return sample_cutpoints


    def find_cutpoints(self,
                       line_data,
                       max_distance,
                       cutoffs = None,
                       verbose = False):
        '''Find the cutpoints of the samples in a `LineData` instance.

        Parameters
        ----------
        line_data : instance of pept.LineData
            The LoRs for which the cutpoints will be computed. It is required to be an
            instance of `pept.LineData`.
        max_distance : float
            The maximum distance between any two lines so that their cutpoint will be
            considered.
        cutoffs : list-like of length 6, optional
            A list (or equivalent) of the cutoff distances for every axis, formatted as
            [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
            fall within these cutoff distances. The default is None, in which case they
            are automatically computed using `get_cutoffs`.
        verbose : bool, optional
            Provide extra information when computing the cutpoints: time the operation
            and show a progress bar. The default is `False`.

        Returns
        -------
        self : the PointData instance of cutpoints
            The computed cutpoints are stored in the `Cutpoints` class, as a
            subclass of `pept.PointData`.

        Raises
        ------
        Exception
            If `line_data` is not an instance of `pept.LineData`.
        TypeError
            If `cutoffs` is not a one-dimensional array with values formatted as
            `[min_x, max_x, min_y, max_y, min_z, max_z]`.

        '''

        if verbose:
            start = time.time()

        # Check line_data is an instance (or a subclass!) of pept.LineData
        if not isinstance(line_data, pept.LineData):
            raise Exception('[ERROR]: line_data should be an instance of pept.LineData')

        self._line_data = line_data
        self._max_distance = max_distance

        # If cutoffs were not supplied, compute them
        if cutoffs is None:
            cutoffs = self.get_cutoffs(line_data.line_data)
        # Otherwise make sure they are a C-contiguous numpy array
        else:
            cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
            if cutoffs.ndim != 1 or len(cutoffs) != 6:
                raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')

        self._cutoffs = cutoffs

        # Using joblib, collect the cutpoints from every sample in a list
        # of arrays. If verbose, show progress bar using tqdm.
        if verbose:
            cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data))
        else:
            cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data)

        # cutpoints shape: (n, m, 4), where n is the number of samples, and
        # m is the number of cutpoints in the sample
        cutpoints = np.array(cutpoints)

        number_of_samples = len(cutpoints)
        cutpoints = np.vstack(np.array(cutpoints))
        number_of_cutpoints = len(cutpoints)

        # Average number of cutpoints per sample
        cutpoints_per_sample = int(number_of_cutpoints / number_of_samples)

        super().__init__(cutpoints,
                         sample_size = cutpoints_per_sample,
                         overlap = 0,
                         verbose = False)

        if verbose:
            end = time.time()
            print("\n\n\nFinding the cutpoints took {} seconds".format(end - start))

        return self

Ancestors

Static methods

def find_cutpoints_sample(sample, max_distance, cutoffs=None)

Find the cutpoints in a sample of LoRs.

This is a static method, meaning it can be called without instantiating the Cutpoints class. It computes the cutpoints from a given sample that are associated with pairs of lines closer than max_distance.

Parameters

sample : (N, 7) numpy.ndarray
A sample of LoRs, where each row is [time, x1, y1, z1, x2, y2, z2], such that every line is defined by the points [x1, y1, z1] and [x2, y2, z2].
max_distance : float
The maximum distance between any pair of lines so that their cutpoint will be considered.
cutoffs : list, optional
The cutoffs for each dimension, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. If not defined, they are computed automatically by calling get_cutoffs. The default is None.

Returns

sample_cutpoints : (N, 4) numpy.ndarray
The computed cutpoints for the given LoRs, where each row is formatted as [time, x, y, z] for every cutpoint.

Raises

TypeError
If sample is not a numpy array with shape (N, 7).
TypeError
If cutoffs is not a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]
Source code
@staticmethod
def find_cutpoints_sample(sample, max_distance, cutoffs = None):
    '''Find the cutpoints in a sample of LoRs.

    This is a static method, meaning it can be called without
    instantiating the `Cutpoints` class. It computes the cutpoints
    from a given `sample` that are associated with pairs of lines
    closer than `max_distance`.

    Parameters
    ----------
    sample : (N, 7) numpy.ndarray
        A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
        such that every line is defined by the points `[x1, y1, z1]` and
        `[x2, y2, z2]`.
    max_distance : float
        The maximum distance between any pair of lines so that their cutpoint
        will be considered.
    cutoffs : list, optional
        The cutoffs for each dimension, formatted as `[x_min, x_max,
        y_min, y_max, z_min, z_max]`. If not defined, they are computed
        automatically by calling `get_cutoffs`. The default is `None`.

    Returns
    -------
    sample_cutpoints : (N, 4) numpy.ndarray
        The computed cutpoints for the given LoRs, where each row is
        formatted as `[time, x, y, z]` for every cutpoint.

    Raises
    ------
    TypeError
        If `sample` is not a numpy array with shape (N, 7).
    TypeError
        If `cutoffs` is not a `one-dimensional array with values [min_x,
        max_x, min_y, max_y, min_z, max_z]`

    '''

    # Check sample has shape (N, 7)
    if sample.ndim != 2 or sample.shape[1] != 7:
        raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))

    if cutoffs is None:
        cutoffs = Cutpoints.get_cutoffs(sample)
    else:
        cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
        if cutoffs.ndim != 1 or len(cutoffs) != 6:
            raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')

    sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs)
    return sample_cutpoints
def get_cutoffs(sample)

Compute the cutoffs from a sample of LoR data.

This is a static method, meaning it can be called without instantiating the Cutpoints class. It computes the cutoffs from the minimum and maximum values of the LoRs in sample in each dimension.

Parameters

sample : (N, 7) numpy.ndarray
A sample of LoRs, where each row is [time, x1, y1, z1, x2, y2, z2], such that every line is defined by the points [x1, y1, z1] and [x2, y2, z2].

Returns

cutoffs : (6) numpy.ndarray
The computed cutoffs for each dimension, formatted as [x_min, x_max, y_min, y_max, z_min, z_max].

Raises

TypeError
If sample is not a numpy array with shape (N, 7).
Source code
@staticmethod
def get_cutoffs(sample):
    '''Compute the cutoffs from a sample of LoR data.

    This is a static method, meaning it can be called without
    instantiating the `Cutpoints` class. It computes the cutoffs
    from the minimum and maximum values of the LoRs in `sample`
    in each dimension.

    Parameters
    ----------
    sample : (N, 7) numpy.ndarray
        A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
        such that every line is defined by the points `[x1, y1, z1]` and
        `[x2, y2, z2]`.
    Returns
    -------
    cutoffs : (6) numpy.ndarray
        The computed cutoffs for each dimension, formatted as `[x_min, x_max,
        y_min, y_max, z_min, z_max]`.

    Raises
    ------
    TypeError
        If `sample` is not a numpy array with shape (N, 7).

    '''

    # Check sample has shape (N, 7)
    if sample.ndim != 2 or sample.shape[1] != 7:
        raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))

    # Compute cutoffs for cutpoints as the (min, max) values of the lines
    # Minimum value of the two points that define a line
    min_x = min(sample[:, 1].min(),
                sample[:, 4].min())
    # Maximum value of the two points that define a line
    max_x = max(sample[:, 1].max(),
                sample[:, 4].max())

    # Minimum value of the two points that define a line
    min_y = min(sample[:, 2].min(),
                sample[:, 5].min())
    # Maximum value of the two points that define a line
    max_y = max(sample[:, 2].max(),
                sample[:, 5].max())

    # Minimum value of the two points that define a line
    min_z = min(sample[:, 3].min(),
                sample[:, 6].min())
    # Maximum value of the two points that define a line
    max_z = max(sample[:, 3].max(),
                sample[:, 6].max())

    cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z])
    return cutoffs

Instance variables

var cutoffs

Only consider the cutpoints which fall within these cutoff distances.

A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max].

cutoffs : (6) list or equivalent

Source code
@property
def cutoffs(self):
    '''Only consider the cutpoints which fall within these cutoff distances.

    A list (or equivalent) of the cutoff distances for every axis, formatted as
    [x_min, x_max, y_min, y_max, z_min, z_max].

    cutoffs : (6) list or equivalent

    '''

    return self._cutoffs
var line_data

The LoRs for which the cutpoints are computed.

line_data : instance of pept.LineData

Source code
@property
def line_data(self):
    '''The LoRs for which the cutpoints are computed.

    line_data : instance of pept.LineData

    '''

    return self._line_data
var max_distance

The maximum distance between any pair of lines for which the cutpoint is considered.

max_distance : float

Source code
@property
def max_distance(self):
    '''The maximum distance between any pair of lines for which the cutpoint is considered.

    max_distance : float

    '''
    return self._max_distance

Methods

def find_cutpoints(self, line_data, max_distance, cutoffs=None, verbose=False)

Find the cutpoints of the samples in a LineData instance.

Parameters

line_data : instance of LineData
The LoRs for which the cutpoints will be computed. It is required to be an instance of LineData.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs : list-like of length 6, optional
A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using get_cutoffs.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is False.

Returns

self : the PointData instance of cutpoints
The computed cutpoints are stored in the Cutpoints class, as a subclass of PointData.

Raises

Exception
If line_data is not an instance of LineData.
TypeError
If cutoffs is not a one-dimensional array with values formatted as [min_x, max_x, min_y, max_y, min_z, max_z].
Source code
def find_cutpoints(self,
                   line_data,
                   max_distance,
                   cutoffs = None,
                   verbose = False):
    '''Find the cutpoints of the samples in a `LineData` instance.

    Parameters
    ----------
    line_data : instance of pept.LineData
        The LoRs for which the cutpoints will be computed. It is required to be an
        instance of `pept.LineData`.
    max_distance : float
        The maximum distance between any two lines so that their cutpoint will be
        considered.
    cutoffs : list-like of length 6, optional
        A list (or equivalent) of the cutoff distances for every axis, formatted as
        [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
        fall within these cutoff distances. The default is None, in which case they
        are automatically computed using `get_cutoffs`.
    verbose : bool, optional
        Provide extra information when computing the cutpoints: time the operation
        and show a progress bar. The default is `False`.

    Returns
    -------
    self : the PointData instance of cutpoints
        The computed cutpoints are stored in the `Cutpoints` class, as a
        subclass of `pept.PointData`.

    Raises
    ------
    Exception
        If `line_data` is not an instance of `pept.LineData`.
    TypeError
        If `cutoffs` is not a one-dimensional array with values formatted as
        `[min_x, max_x, min_y, max_y, min_z, max_z]`.

    '''

    if verbose:
        start = time.time()

    # Check line_data is an instance (or a subclass!) of pept.LineData
    if not isinstance(line_data, pept.LineData):
        raise Exception('[ERROR]: line_data should be an instance of pept.LineData')

    self._line_data = line_data
    self._max_distance = max_distance

    # If cutoffs were not supplied, compute them
    if cutoffs is None:
        cutoffs = self.get_cutoffs(line_data.line_data)
    # Otherwise make sure they are a C-contiguous numpy array
    else:
        cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
        if cutoffs.ndim != 1 or len(cutoffs) != 6:
            raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')

    self._cutoffs = cutoffs

    # Using joblib, collect the cutpoints from every sample in a list
    # of arrays. If verbose, show progress bar using tqdm.
    if verbose:
        cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data))
    else:
        cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data)

    # cutpoints shape: (n, m, 4), where n is the number of samples, and
    # m is the number of cutpoints in the sample
    cutpoints = np.array(cutpoints)

    number_of_samples = len(cutpoints)
    cutpoints = np.vstack(np.array(cutpoints))
    number_of_cutpoints = len(cutpoints)

    # Average number of cutpoints per sample
    cutpoints_per_sample = int(number_of_cutpoints / number_of_samples)

    super().__init__(cutpoints,
                     sample_size = cutpoints_per_sample,
                     overlap = 0,
                     verbose = False)

    if verbose:
        end = time.time()
        print("\n\n\nFinding the cutpoints took {} seconds".format(end - start))

    return self

Inherited members

class HDBSCANClusterer (min_cluster_size=20, min_samples=None, allow_single_cluster=False)

HDBSCAN-based clustering for cutpoints from LoRs.

This class is a wrapper around the hdbscan package, providing tools for parallel clustering of samples of cutpoints. It can return PointData classes which can be easily manipulated or visualised.

Parameters

min_cluster_size : int, optional
    (Taken from hdbscan's documentation): The minimum size of clusters;
    single linkage splits that contain fewer points than this will be
    considered points “falling out” of a cluster rather than a cluster
    splitting into two new clusters. The default is 20.
min_samples : int, optional
    (Taken from hdbscan's documentation): The number of samples in a
    neighbourhood for a point to be considered a core point. The default
    is None, being set automatically to the `min_cluster_size`.
allow_single_cluster : bool, optional
    (Taken from hdbscan's documentation): By default HDBSCAN* will not
    produce a single cluster, setting this to True will override this and
    allow single cluster results in the case that you feel this is a valid
    result for your dataset. For PEPT, set this to True if you only have
    one tracer in the dataset. Otherwise, leave it to False, as it will
    provide higher accuracy.

Attributes

min_cluster_size : int
    (Taken from hdbscan's documentation): The minimum size of clusters;
    single linkage splits that contain fewer points than this will be
    considered points “falling out” of a cluster rather than a cluster
    splitting into two new clusters. The default is 20.
min_samples : int
    (Taken from hdbscan's documentation): The number of samples in a
    neighbourhood for a point to be considered a core point. The default
    is None, being set automatically to the `min_cluster_size`.
allow_single_cluster : bool
    (Taken from hdbscan's documentation): By default HDBSCAN* will not
    produce a single cluster, setting this to True will override this and
    allow single cluster results in the case that you feel this is a valid
    result for your dataset. For PEPT, set this to True if you only have
    one tracer in the dataset. Otherwise, leave it to False, as it will
    provide higher accuracy.
Source code
class HDBSCANClusterer:
    '''HDBSCAN-based clustering for cutpoints from LoRs.

    This class is a wrapper around the `hdbscan` package, providing tools for
    parallel clustering of samples of cutpoints. It can return `PointData`
    classes which can be easily manipulated or visualised.

    Parameters
    ----------
        min_cluster_size : int, optional
            (Taken from hdbscan's documentation): The minimum size of clusters;
            single linkage splits that contain fewer points than this will be
            considered points “falling out” of a cluster rather than a cluster
            splitting into two new clusters. The default is 20.
        min_samples : int, optional
            (Taken from hdbscan's documentation): The number of samples in a
            neighbourhood for a point to be considered a core point. The default
            is None, being set automatically to the `min_cluster_size`.
        allow_single_cluster : bool, optional
            (Taken from hdbscan's documentation): By default HDBSCAN* will not
            produce a single cluster, setting this to True will override this and
            allow single cluster results in the case that you feel this is a valid
            result for your dataset. For PEPT, set this to True if you only have
            one tracer in the dataset. Otherwise, leave it to False, as it will
            provide higher accuracy.

    Attributes
    ----------
        min_cluster_size : int
            (Taken from hdbscan's documentation): The minimum size of clusters;
            single linkage splits that contain fewer points than this will be
            considered points “falling out” of a cluster rather than a cluster
            splitting into two new clusters. The default is 20.
        min_samples : int
            (Taken from hdbscan's documentation): The number of samples in a
            neighbourhood for a point to be considered a core point. The default
            is None, being set automatically to the `min_cluster_size`.
        allow_single_cluster : bool
            (Taken from hdbscan's documentation): By default HDBSCAN* will not
            produce a single cluster, setting this to True will override this and
            allow single cluster results in the case that you feel this is a valid
            result for your dataset. For PEPT, set this to True if you only have
            one tracer in the dataset. Otherwise, leave it to False, as it will
            provide higher accuracy.

    '''

    def __init__(self,
                 min_cluster_size = 20,
                 min_samples = None,
                 allow_single_cluster = False):

        if 0 < min_cluster_size < 2:
            print("\n[WARNING]: min_cluster_size was set to 2, as it was {} < 2\n".format(min_cluster_size))
            min_cluster_size = 2

        self.clusterer = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size,
                                         min_samples = min_samples,
                                         core_dist_n_jobs = -1,
                                         allow_single_cluster = allow_single_cluster)


    @property
    def min_cluster_size(self):
        return self.clusterer.min_cluster_size


    @min_cluster_size.setter
    def min_cluster_size(self, new_min_cluster_size):
        self.clusterer.min_cluster_size = new_min_cluster_size


    @property
    def min_samples(self):
        return self.clusterer.min_cluster_size


    @min_samples.setter
    def min_samples(self, new_min_samples):
        self.clusterer.min_samples = new_min_samples


    @property
    def allow_single_cluster(self):
        return self.clusterer.allow_single_cluster


    @allow_single_cluster.setter
    def allow_single_cluster(self, option):
        self.clusterer.allow_single_cluster = option


    def fit_sample(self,
                   sample,
                   store_labels = False,
                   noise = False,
                   as_array = True,
                   verbose = False):
        '''Fit one sample of cutpoints and return the cluster centres and
        (optionally) the labelled cutpoints.

        Parameters
        ----------
        sample : (N, M >= 4) numpy.ndarray
            The sample of points that will be clustered. Every point corresponds to
            a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]`
            are used for clustering.
        store_labels : bool, optional
            If set to True, the clustered cutpoints are returned along with the centres
            of the clusters. Setting it to False speeds up the clustering. The default
            is False.
        noise : bool, optional
            If set to True, the clustered cutpoints also include the points classified
            as noise. Only has an effect if `store_labels` is set to True. The default
            is False.
        as_array : bool, optional
            If set to True, the centres of the clusters and the clustered cutpoints are
            returned as numpy arrays. If set to False, they are returned inside
            instances of `pept.PointData`.
        verbose : bool, optional
            Provide extra information when computing the cutpoints: time the operation
            and show a progress bar. The default is `False`.

        Returns
        -------
        centres : numpy.ndarray or pept.PointData
            The centroids of every cluster found. They are computed as the average
            of every column of `[time, x, y, z, etc]` of the clustered points. Another
            column is added to the initial data in `sample`, signifying the cluster
            size - the number of points included in the cluster. If `as_array` is
            set to True, it is a numpy array, otherwise the centres are stored
            in a pept.PointData instance.
        clustered_cutpoints : numpy.ndarray or pept.PointData
            The points in `sample` that fall in every cluster. A new column is added
            to the points in `sample` that signifies the label of cluster that the
            point was associated with: all points in cluster number 3 will have the
            number 3 as the last element in their row. The points classified as noise
            have the number -1 associated. If `as_array` is set to True, it is a numpy
            array, otherwise the clustered cutpoints are stored in a pept.PointData
            instance.

        Raises
        ------
        TypeError
            If `sample` is not a numpy array of shape (N, M), where M >= 4.

        '''

        if verbose:
            start = time.time()

        # sample row: [time, x, y, z]
        if sample.ndim != 2 or sample.shape[1] < 4:
            raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape))

        # Only cluster based on [x, y, z]
        labels = self.clusterer.fit_predict(sample[:, 1:4])
        max_label = labels.max()

        centres = []
        clustered_cutpoints = []

        # the centre of a cluster is the average of the time, x, y, z columns
        # and the number of points of that cluster
        # centres row: [time, x, y, z, ..etc.., cluster_size]
        for i in range(0, max_label + 1):
            # Average time, x, y, z of cluster of label i
            centres_row = np.mean(sample[labels == i], axis = 0)
            # Append the number of points of label i => cluster_size
            centres_row = np.append(centres_row, (labels == i).sum())
            centres.append(centres_row)

        centres = np.array(centres)

        if not as_array:
            centres = pept.PointData(centres,
                                     sample_size = 0,
                                     overlap = 0,
                                     verbose = False)

        # Return all cutpoints as a list of numpy arrays for every label
        # where the last column of an array is the label
        if store_labels:
            # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label]
            if noise:
                cutpoints = sample[labels == -1]
                cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1)
                clustered_cutpoints.append(cutpoints)

            for i in range(0, max_label + 1):
                cutpoints = sample[labels == i]
                cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1)
                clustered_cutpoints.append(cutpoints)

            clustered_cutpoints = np.vstack(np.array(clustered_cutpoints))

            if not as_array:
                clustered_cutpoints = pept.PointData(clustered_cutpoints,
                                                     sample_size = 0,
                                                     overlap = 0,
                                                     verbose = False)

        if verbose:
            end = time.time()
            print("Fitting one sample took {} seconds".format(end - start))

        return [centres, clustered_cutpoints]


    def fit_cutpoints(self,
                      cutpoints,
                      store_labels = False,
                      noise = False,
                      verbose = True):
        '''Fit cutpoints (an instance of `PointData`) and return the cluster
        centres and (optionally) the labelled cutpoints.

        Parameters
        ----------
        cutpoints : an instance of `pept.PointData`
            The samples of points that will be clustered. In every sample, every point
            corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only
            columns `[1, 2, 3]` are used for clustering.
        store_labels : bool, optional
            If set to True, the clustered cutpoints are returned along with the centres
            of the clusters. Setting it to False speeds up the clustering. The default
            is False.
        noise : bool, optional
            If set to True, the clustered cutpoints also include the points classified
            as noise. Only has an effect if `store_labels` is set to True. The default
            is False.
        verbose : bool, optional
            Provide extra information when computing the cutpoints: time the operation
            and show a progress bar. The default is `False`.

        Returns
        -------
        centres : pept.PointData
            The centroids of every cluster found. They are computed as the average
            of every column of `[time, x, y, z, etc]` of the clustered points. Another
            column is added to the initial data in `sample`, signifying the cluster
            size - the number of points included in the cluster.
        clustered_cutpoints : numpy.ndarray or pept.PointData
            The points in `sample` that fall in every cluster. A new column is added
            to the points in `sample` that signifies the label of cluster that the
            point was associated with: all points in cluster number 3 will have the
            number 3 as the last element in their row. The points classified as noise
            have the number -1 associated.

        Raises
        ------
        Exception
            If `cutpoints` is not an instance (or a subclass) of `pept.PointData`.

        '''

        if verbose:
            start = time.time()

        if not isinstance(cutpoints, pept.PointData):
            raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)')

        # Fit all samples in `cutpoints` in parallel using joblib
        # Collect all outputs as a list. If verbose, show progress bar with
        # tqdm
        if verbose:
            data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
                                                store_labels = store_labels,
                                                noise = noise,
                                                as_array = True) for sample in tqdm(cutpoints))
        else:
            data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
                                                store_labels = store_labels,
                                                noise = noise,
                                                as_array = True) for sample in cutpoints)

        # Access joblib.Parallel output as list comprehensions
        centres = np.array([row[0] for row in data_list if len(row[0]) != 0])
        if len(centres) != 0:
            centres = pept.PointData(np.vstack(centres),
                                     sample_size = 0,
                                     overlap = 0,
                                     verbose = False)

        if store_labels:
            clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0])
            clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)),
                                                 sample_size = 0,
                                                 overlap = 0,
                                                 verbose = False)

        if verbose:
            end = time.time()
            print("Fitting cutpoints took {} seconds".format(end - start))

        if store_labels:
            return [centres, clustered_cutpoints]
        else:
            return [centres, []]

Instance variables

var allow_single_cluster
Source code
@property
def allow_single_cluster(self):
    return self.clusterer.allow_single_cluster
var min_cluster_size
Source code
@property
def min_cluster_size(self):
    return self.clusterer.min_cluster_size
var min_samples
Source code
@property
def min_samples(self):
    return self.clusterer.min_cluster_size

Methods

def fit_cutpoints(self, cutpoints, store_labels=False, noise=False, verbose=True)

Fit cutpoints (an instance of PointData) and return the cluster centres and (optionally) the labelled cutpoints.

Parameters

cutpoints : an instance of <a title="pept.PointData" href="../../index.html#pept.PointData">PointData</a>
The samples of points that will be clustered. In every sample, every point corresponds to a row and is formatted as [time, x, y, z, etc]. Only columns [1, 2, 3] are used for clustering.
store_labels : bool, optional
If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise : bool, optional
If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if store_labels is set to True. The default is False.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is False.

Returns

centres : pept.PointData
The centroids of every cluster found. They are computed as the average of every column of [time, x, y, z, etc] of the clustered points. Another column is added to the initial data in sample, signifying the cluster size - the number of points included in the cluster.
clustered_cutpoints : numpy.ndarray or pept.PointData
The points in sample that fall in every cluster. A new column is added to the points in sample that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated.

Raises

Exception
If cutpoints is not an instance (or a subclass) of pept.PointData.
Source code
def fit_cutpoints(self,
                  cutpoints,
                  store_labels = False,
                  noise = False,
                  verbose = True):
    '''Fit cutpoints (an instance of `PointData`) and return the cluster
    centres and (optionally) the labelled cutpoints.

    Parameters
    ----------
    cutpoints : an instance of `pept.PointData`
        The samples of points that will be clustered. In every sample, every point
        corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only
        columns `[1, 2, 3]` are used for clustering.
    store_labels : bool, optional
        If set to True, the clustered cutpoints are returned along with the centres
        of the clusters. Setting it to False speeds up the clustering. The default
        is False.
    noise : bool, optional
        If set to True, the clustered cutpoints also include the points classified
        as noise. Only has an effect if `store_labels` is set to True. The default
        is False.
    verbose : bool, optional
        Provide extra information when computing the cutpoints: time the operation
        and show a progress bar. The default is `False`.

    Returns
    -------
    centres : pept.PointData
        The centroids of every cluster found. They are computed as the average
        of every column of `[time, x, y, z, etc]` of the clustered points. Another
        column is added to the initial data in `sample`, signifying the cluster
        size - the number of points included in the cluster.
    clustered_cutpoints : numpy.ndarray or pept.PointData
        The points in `sample` that fall in every cluster. A new column is added
        to the points in `sample` that signifies the label of cluster that the
        point was associated with: all points in cluster number 3 will have the
        number 3 as the last element in their row. The points classified as noise
        have the number -1 associated.

    Raises
    ------
    Exception
        If `cutpoints` is not an instance (or a subclass) of `pept.PointData`.

    '''

    if verbose:
        start = time.time()

    if not isinstance(cutpoints, pept.PointData):
        raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)')

    # Fit all samples in `cutpoints` in parallel using joblib
    # Collect all outputs as a list. If verbose, show progress bar with
    # tqdm
    if verbose:
        data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
                                            store_labels = store_labels,
                                            noise = noise,
                                            as_array = True) for sample in tqdm(cutpoints))
    else:
        data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
                                            store_labels = store_labels,
                                            noise = noise,
                                            as_array = True) for sample in cutpoints)

    # Access joblib.Parallel output as list comprehensions
    centres = np.array([row[0] for row in data_list if len(row[0]) != 0])
    if len(centres) != 0:
        centres = pept.PointData(np.vstack(centres),
                                 sample_size = 0,
                                 overlap = 0,
                                 verbose = False)

    if store_labels:
        clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0])
        clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)),
                                             sample_size = 0,
                                             overlap = 0,
                                             verbose = False)

    if verbose:
        end = time.time()
        print("Fitting cutpoints took {} seconds".format(end - start))

    if store_labels:
        return [centres, clustered_cutpoints]
    else:
        return [centres, []]
def fit_sample(self, sample, store_labels=False, noise=False, as_array=True, verbose=False)

Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints.

Parameters

sample : (N, M >= 4) numpy.ndarray
The sample of points that will be clustered. Every point corresponds to a row and is formatted as [time, x, y, z, etc]. Only columns [1, 2, 3] are used for clustering.
store_labels : bool, optional
If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise : bool, optional
If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if store_labels is set to True. The default is False.
as_array : bool, optional
If set to True, the centres of the clusters and the clustered cutpoints are returned as numpy arrays. If set to False, they are returned inside instances of PointData.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is False.

Returns

centres : numpy.ndarray or PointData
The centroids of every cluster found. They are computed as the average of every column of [time, x, y, z, etc] of the clustered points. Another column is added to the initial data in sample, signifying the cluster size - the number of points included in the cluster. If as_array is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance.
clustered_cutpoints : numpy.ndarray or PointData
The points in sample that fall in every cluster. A new column is added to the points in sample that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. If as_array is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance.

Raises

TypeError
If sample is not a numpy array of shape (N, M), where M >= 4.
Source code
def fit_sample(self,
               sample,
               store_labels = False,
               noise = False,
               as_array = True,
               verbose = False):
    '''Fit one sample of cutpoints and return the cluster centres and
    (optionally) the labelled cutpoints.

    Parameters
    ----------
    sample : (N, M >= 4) numpy.ndarray
        The sample of points that will be clustered. Every point corresponds to
        a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]`
        are used for clustering.
    store_labels : bool, optional
        If set to True, the clustered cutpoints are returned along with the centres
        of the clusters. Setting it to False speeds up the clustering. The default
        is False.
    noise : bool, optional
        If set to True, the clustered cutpoints also include the points classified
        as noise. Only has an effect if `store_labels` is set to True. The default
        is False.
    as_array : bool, optional
        If set to True, the centres of the clusters and the clustered cutpoints are
        returned as numpy arrays. If set to False, they are returned inside
        instances of `pept.PointData`.
    verbose : bool, optional
        Provide extra information when computing the cutpoints: time the operation
        and show a progress bar. The default is `False`.

    Returns
    -------
    centres : numpy.ndarray or pept.PointData
        The centroids of every cluster found. They are computed as the average
        of every column of `[time, x, y, z, etc]` of the clustered points. Another
        column is added to the initial data in `sample`, signifying the cluster
        size - the number of points included in the cluster. If `as_array` is
        set to True, it is a numpy array, otherwise the centres are stored
        in a pept.PointData instance.
    clustered_cutpoints : numpy.ndarray or pept.PointData
        The points in `sample` that fall in every cluster. A new column is added
        to the points in `sample` that signifies the label of cluster that the
        point was associated with: all points in cluster number 3 will have the
        number 3 as the last element in their row. The points classified as noise
        have the number -1 associated. If `as_array` is set to True, it is a numpy
        array, otherwise the clustered cutpoints are stored in a pept.PointData
        instance.

    Raises
    ------
    TypeError
        If `sample` is not a numpy array of shape (N, M), where M >= 4.

    '''

    if verbose:
        start = time.time()

    # sample row: [time, x, y, z]
    if sample.ndim != 2 or sample.shape[1] < 4:
        raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape))

    # Only cluster based on [x, y, z]
    labels = self.clusterer.fit_predict(sample[:, 1:4])
    max_label = labels.max()

    centres = []
    clustered_cutpoints = []

    # the centre of a cluster is the average of the time, x, y, z columns
    # and the number of points of that cluster
    # centres row: [time, x, y, z, ..etc.., cluster_size]
    for i in range(0, max_label + 1):
        # Average time, x, y, z of cluster of label i
        centres_row = np.mean(sample[labels == i], axis = 0)
        # Append the number of points of label i => cluster_size
        centres_row = np.append(centres_row, (labels == i).sum())
        centres.append(centres_row)

    centres = np.array(centres)

    if not as_array:
        centres = pept.PointData(centres,
                                 sample_size = 0,
                                 overlap = 0,
                                 verbose = False)

    # Return all cutpoints as a list of numpy arrays for every label
    # where the last column of an array is the label
    if store_labels:
        # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label]
        if noise:
            cutpoints = sample[labels == -1]
            cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1)
            clustered_cutpoints.append(cutpoints)

        for i in range(0, max_label + 1):
            cutpoints = sample[labels == i]
            cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1)
            clustered_cutpoints.append(cutpoints)

        clustered_cutpoints = np.vstack(np.array(clustered_cutpoints))

        if not as_array:
            clustered_cutpoints = pept.PointData(clustered_cutpoints,
                                                 sample_size = 0,
                                                 overlap = 0,
                                                 verbose = False)

    if verbose:
        end = time.time()
        print("Fitting one sample took {} seconds".format(end - start))

    return [centres, clustered_cutpoints]