Module pept.tracking.peptml
Source code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# File : __init__.py
# License: License: GNU v3.0
# Author : Andrei Leonard Nicusan <a.l.nicusan@bham.ac.uk>
# Date : 22.08.2019
from .peptml import Cutpoints
from .peptml import HDBSCANClusterer
__all__ = [
"Cutpoints",
"HDBSCANClusterer"
]
__author__ = "Andrei Leonard Nicusan"
__credits__ = ["Andrei Leonard Nicusan", "Kit Windows-Yule", "Sam Manger"]
__license__ = "GNU v3.0"
__version__ = "0.1"
__maintainer__ = "Andrei Leonard Nicusan"
__email__ = "a.l.nicusan@bham.ac.uk"
__status__ = "Development"
Sub-modules
pept.tracking.peptml.extensions
pept.tracking.peptml.peptml
-
The peptml package implements a hierarchical density-based clustering algorithm for general Positron Emission Particle Tracking (PEPT) …
Classes
class Cutpoints (line_data, max_distance, cutoffs=None, verbose=True)
-
A class that transforms LoRs into cutpoints for clustering.
The
Cutpoints
class transforms LoRs (individual numpy arrays or fullLineData
) into cutpoints (individual numpy arrays or fullPointData
) that can then be passed toHDBSCANClusterer
.Under typical usage, the class is instantiated with the LoR data (required to be an instance of
LineData
) and transforms it into an instance ofPointData
that stores the found cutpoints.For more control over the operations, the class also provides a static method (i.e. it can be used without instantiating the class)
find_cutpoints_sample
that receives a generic numpy array of LoRs (one 'sample') and returns a numpy array of cutpoints.Parameters
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
, optional- A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using
get_cutoffs
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
True
.
Attributes
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
- A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances.
sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData Extra attributes and methods are inherited from the base class
PointData
.Raises
Exception
- If
line_data
is not an instance ofLineData
. TypeError
- If
cutoffs
is not a one-dimensional array with values formatted as[min_x, max_x, min_y, max_y, min_z, max_z]
.
Example usage
Compute the cutpoints for a
LineData
instance: >>> line_data = pept.LineData(example_data) >>> cutpts = peptml.Cutpoints(line_data, 0.1)Compute the cutpoints for a single sample: >>> sample = line_data[0] >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation
Source code
class Cutpoints(pept.PointData): '''A class that transforms LoRs into *cutpoints* for clustering. The `Cutpoints` class transforms LoRs (individual numpy arrays or full `LineData`) into cutpoints (individual numpy arrays or full `PointData`) that can then be passed to `HDBSCANClusterer`. Under typical usage, the class is instantiated with the LoR data (required to be an instance of `LineData`) and transforms it into an instance of `PointData` that stores the found cutpoints. For more control over the operations, the class also provides a static method (i.e. it can be used without instantiating the class) `find_cutpoints_sample` that receives a generic numpy array of LoRs (one 'sample') and returns a numpy array of cutpoints. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `True`. Attributes ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6 A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData Extra attributes and methods are inherited from the base class `PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. Example usage ------------- Compute the cutpoints for a `LineData` instance: >>> line_data = pept.LineData(example_data) >>> cutpts = peptml.Cutpoints(line_data, 0.1) Compute the cutpoints for a single sample: >>> sample = line_data[0] >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation ''' def __init__(self, line_data, max_distance, cutoffs = None, verbose = True): # Find the cutpoints when instantiated. The method # also initialises the instance as a `PointData` subclass. self.find_cutpoints( line_data, max_distance, cutoffs = cutoffs, verbose = verbose ) @property def line_data(self): '''The LoRs for which the cutpoints are computed. line_data : instance of pept.LineData ''' return self._line_data @line_data.setter def line_data(self, new_line_data): ''' The LoRs for which the cutpoints are computed. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. ''' # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data @property def max_distance(self): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float ''' return self._max_distance @max_distance.setter def max_distance(self, new_max_distance): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. ''' self._max_distance = new_max_distance @property def cutoffs(self): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. cutoffs : (6) list or equivalent ''' return self._cutoffs @cutoffs.setter def cutoffs(self, new_cutoffs): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Parameters ---------- new_cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. Raises ------ TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' cutoffs = np.asarray(new_cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: new_cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs @staticmethod def get_cutoffs(sample): '''Compute the cutoffs from a sample of LoR data. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutoffs from the minimum and maximum values of the LoRs in `sample` in each dimension. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. Returns ------- cutoffs : (6) numpy.ndarray The computed cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) # Compute cutoffs for cutpoints as the (min, max) values of the lines # Minimum value of the two points that define a line min_x = min(sample[:, 1].min(), sample[:, 4].min()) # Maximum value of the two points that define a line max_x = max(sample[:, 1].max(), sample[:, 4].max()) # Minimum value of the two points that define a line min_y = min(sample[:, 2].min(), sample[:, 5].min()) # Maximum value of the two points that define a line max_y = max(sample[:, 2].max(), sample[:, 5].max()) # Minimum value of the two points that define a line min_z = min(sample[:, 3].min(), sample[:, 6].min()) # Maximum value of the two points that define a line max_z = max(sample[:, 3].max(), sample[:, 6].max()) cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z]) return cutoffs @staticmethod def find_cutpoints_sample(sample, max_distance, cutoffs = None): '''Find the cutpoints in a sample of LoRs. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutpoints from a given `sample` that are associated with pairs of lines closer than `max_distance`. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. max_distance : float The maximum distance between any pair of lines so that their cutpoint will be considered. cutoffs : list, optional The cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. If not defined, they are computed automatically by calling `get_cutoffs`. The default is `None`. Returns ------- sample_cutpoints : (N, 4) numpy.ndarray The computed cutpoints for the given LoRs, where each row is formatted as `[time, x, y, z]` for every cutpoint. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). TypeError If `cutoffs` is not a `one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]` ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) if cutoffs is None: cutoffs = Cutpoints.get_cutoffs(sample) else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs) return sample_cutpoints def find_cutpoints(self, line_data, max_distance, cutoffs = None, verbose = False): '''Find the cutpoints of the samples in a `LineData` instance. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- self : the PointData instance of cutpoints The computed cutpoints are stored in the `Cutpoints` class, as a subclass of `pept.PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' if verbose: start = time.time() # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data self._max_distance = max_distance # If cutoffs were not supplied, compute them if cutoffs is None: cutoffs = self.get_cutoffs(line_data.line_data) # Otherwise make sure they are a C-contiguous numpy array else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs # Using joblib, collect the cutpoints from every sample in a list # of arrays. If verbose, show progress bar using tqdm. if verbose: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data)) else: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data) # cutpoints shape: (n, m, 4), where n is the number of samples, and # m is the number of cutpoints in the sample cutpoints = np.array(cutpoints) number_of_samples = len(cutpoints) cutpoints = np.vstack(np.array(cutpoints)) number_of_cutpoints = len(cutpoints) # Average number of cutpoints per sample cutpoints_per_sample = int(number_of_cutpoints / number_of_samples) super().__init__(cutpoints, sample_size = cutpoints_per_sample, overlap = 0, verbose = False) if verbose: end = time.time() print("\n\n\nFinding the cutpoints took {} seconds".format(end - start)) return self
Ancestors
Static methods
def find_cutpoints_sample(sample, max_distance, cutoffs=None)
-
Find the cutpoints in a sample of LoRs.
This is a static method, meaning it can be called without instantiating the
Cutpoints
class. It computes the cutpoints from a givensample
that are associated with pairs of lines closer thanmax_distance
.Parameters
sample
: (N
,7
)numpy.ndarray
- A sample of LoRs, where each row is
[time, x1, y1, z1, x2, y2, z2]
, such that every line is defined by the points[x1, y1, z1]
and[x2, y2, z2]
. max_distance
:float
- The maximum distance between any pair of lines so that their cutpoint will be considered.
cutoffs
:list
, optional- The cutoffs for each dimension, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]
. If not defined, they are computed automatically by callingget_cutoffs
. The default isNone
.
Returns
sample_cutpoints
: (N
,4
)numpy.ndarray
- The computed cutpoints for the given LoRs, where each row is
formatted as
[time, x, y, z]
for every cutpoint.
Raises
TypeError
- If
sample
is not a numpy array with shape (N, 7). TypeError
- If
cutoffs
is not aone-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]
Source code
@staticmethod def find_cutpoints_sample(sample, max_distance, cutoffs = None): '''Find the cutpoints in a sample of LoRs. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutpoints from a given `sample` that are associated with pairs of lines closer than `max_distance`. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. max_distance : float The maximum distance between any pair of lines so that their cutpoint will be considered. cutoffs : list, optional The cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. If not defined, they are computed automatically by calling `get_cutoffs`. The default is `None`. Returns ------- sample_cutpoints : (N, 4) numpy.ndarray The computed cutpoints for the given LoRs, where each row is formatted as `[time, x, y, z]` for every cutpoint. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). TypeError If `cutoffs` is not a `one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]` ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) if cutoffs is None: cutoffs = Cutpoints.get_cutoffs(sample) else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs) return sample_cutpoints
def get_cutoffs(sample)
-
Compute the cutoffs from a sample of LoR data.
This is a static method, meaning it can be called without instantiating the
Cutpoints
class. It computes the cutoffs from the minimum and maximum values of the LoRs insample
in each dimension.Parameters
sample
: (N
,7
)numpy.ndarray
- A sample of LoRs, where each row is
[time, x1, y1, z1, x2, y2, z2]
, such that every line is defined by the points[x1, y1, z1]
and[x2, y2, z2]
.
Returns
cutoffs
: (6
)numpy.ndarray
- The computed cutoffs for each dimension, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]
.
Raises
TypeError
- If
sample
is not a numpy array with shape (N, 7).
Source code
@staticmethod def get_cutoffs(sample): '''Compute the cutoffs from a sample of LoR data. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutoffs from the minimum and maximum values of the LoRs in `sample` in each dimension. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. Returns ------- cutoffs : (6) numpy.ndarray The computed cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) # Compute cutoffs for cutpoints as the (min, max) values of the lines # Minimum value of the two points that define a line min_x = min(sample[:, 1].min(), sample[:, 4].min()) # Maximum value of the two points that define a line max_x = max(sample[:, 1].max(), sample[:, 4].max()) # Minimum value of the two points that define a line min_y = min(sample[:, 2].min(), sample[:, 5].min()) # Maximum value of the two points that define a line max_y = max(sample[:, 2].max(), sample[:, 5].max()) # Minimum value of the two points that define a line min_z = min(sample[:, 3].min(), sample[:, 6].min()) # Maximum value of the two points that define a line max_z = max(sample[:, 3].max(), sample[:, 6].max()) cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z]) return cutoffs
Instance variables
var cutoffs
-
Only consider the cutpoints which fall within these cutoff distances.
A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max].
cutoffs : (6) list or equivalent
Source code
@property def cutoffs(self): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. cutoffs : (6) list or equivalent ''' return self._cutoffs
var line_data
-
The LoRs for which the cutpoints are computed.
line_data : instance of pept.LineData
Source code
@property def line_data(self): '''The LoRs for which the cutpoints are computed. line_data : instance of pept.LineData ''' return self._line_data
var max_distance
-
The maximum distance between any pair of lines for which the cutpoint is considered.
max_distance : float
Source code
@property def max_distance(self): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float ''' return self._max_distance
Methods
def find_cutpoints(self, line_data, max_distance, cutoffs=None, verbose=False)
-
Find the cutpoints of the samples in a
LineData
instance.Parameters
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
, optional- A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using
get_cutoffs
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
self
:the
PointData
instance
ofcutpoints
- The computed cutpoints are stored in the
Cutpoints
class, as a subclass ofPointData
.
Raises
Exception
- If
line_data
is not an instance ofLineData
. TypeError
- If
cutoffs
is not a one-dimensional array with values formatted as[min_x, max_x, min_y, max_y, min_z, max_z]
.
Source code
def find_cutpoints(self, line_data, max_distance, cutoffs = None, verbose = False): '''Find the cutpoints of the samples in a `LineData` instance. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- self : the PointData instance of cutpoints The computed cutpoints are stored in the `Cutpoints` class, as a subclass of `pept.PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' if verbose: start = time.time() # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data self._max_distance = max_distance # If cutoffs were not supplied, compute them if cutoffs is None: cutoffs = self.get_cutoffs(line_data.line_data) # Otherwise make sure they are a C-contiguous numpy array else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs # Using joblib, collect the cutpoints from every sample in a list # of arrays. If verbose, show progress bar using tqdm. if verbose: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data)) else: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data) # cutpoints shape: (n, m, 4), where n is the number of samples, and # m is the number of cutpoints in the sample cutpoints = np.array(cutpoints) number_of_samples = len(cutpoints) cutpoints = np.vstack(np.array(cutpoints)) number_of_cutpoints = len(cutpoints) # Average number of cutpoints per sample cutpoints_per_sample = int(number_of_cutpoints / number_of_samples) super().__init__(cutpoints, sample_size = cutpoints_per_sample, overlap = 0, verbose = False) if verbose: end = time.time() print("\n\n\nFinding the cutpoints took {} seconds".format(end - start)) return self
Inherited members
class HDBSCANClusterer (min_cluster_size=20, min_samples=None, allow_single_cluster=False)
-
HDBSCAN-based clustering for cutpoints from LoRs.
This class is a wrapper around the
hdbscan
package, providing tools for parallel clustering of samples of cutpoints. It can returnPointData
classes which can be easily manipulated or visualised.Parameters
min_cluster_size : int, optional (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int, optional (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool, optional (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy.
Attributes
min_cluster_size : int (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy.
Source code
class HDBSCANClusterer: '''HDBSCAN-based clustering for cutpoints from LoRs. This class is a wrapper around the `hdbscan` package, providing tools for parallel clustering of samples of cutpoints. It can return `PointData` classes which can be easily manipulated or visualised. Parameters ---------- min_cluster_size : int, optional (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int, optional (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool, optional (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy. Attributes ---------- min_cluster_size : int (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy. ''' def __init__(self, min_cluster_size = 20, min_samples = None, allow_single_cluster = False): if 0 < min_cluster_size < 2: print("\n[WARNING]: min_cluster_size was set to 2, as it was {} < 2\n".format(min_cluster_size)) min_cluster_size = 2 self.clusterer = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size, min_samples = min_samples, core_dist_n_jobs = -1, allow_single_cluster = allow_single_cluster) @property def min_cluster_size(self): return self.clusterer.min_cluster_size @min_cluster_size.setter def min_cluster_size(self, new_min_cluster_size): self.clusterer.min_cluster_size = new_min_cluster_size @property def min_samples(self): return self.clusterer.min_cluster_size @min_samples.setter def min_samples(self, new_min_samples): self.clusterer.min_samples = new_min_samples @property def allow_single_cluster(self): return self.clusterer.allow_single_cluster @allow_single_cluster.setter def allow_single_cluster(self, option): self.clusterer.allow_single_cluster = option def fit_sample(self, sample, store_labels = False, noise = False, as_array = True, verbose = False): '''Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- sample : (N, M >= 4) numpy.ndarray The sample of points that will be clustered. Every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. as_array : bool, optional If set to True, the centres of the clusters and the clustered cutpoints are returned as numpy arrays. If set to False, they are returned inside instances of `pept.PointData`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : numpy.ndarray or pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. If `as_array` is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. If `as_array` is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance. Raises ------ TypeError If `sample` is not a numpy array of shape (N, M), where M >= 4. ''' if verbose: start = time.time() # sample row: [time, x, y, z] if sample.ndim != 2 or sample.shape[1] < 4: raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape)) # Only cluster based on [x, y, z] labels = self.clusterer.fit_predict(sample[:, 1:4]) max_label = labels.max() centres = [] clustered_cutpoints = [] # the centre of a cluster is the average of the time, x, y, z columns # and the number of points of that cluster # centres row: [time, x, y, z, ..etc.., cluster_size] for i in range(0, max_label + 1): # Average time, x, y, z of cluster of label i centres_row = np.mean(sample[labels == i], axis = 0) # Append the number of points of label i => cluster_size centres_row = np.append(centres_row, (labels == i).sum()) centres.append(centres_row) centres = np.array(centres) if not as_array: centres = pept.PointData(centres, sample_size = 0, overlap = 0, verbose = False) # Return all cutpoints as a list of numpy arrays for every label # where the last column of an array is the label if store_labels: # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label] if noise: cutpoints = sample[labels == -1] cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1) clustered_cutpoints.append(cutpoints) for i in range(0, max_label + 1): cutpoints = sample[labels == i] cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1) clustered_cutpoints.append(cutpoints) clustered_cutpoints = np.vstack(np.array(clustered_cutpoints)) if not as_array: clustered_cutpoints = pept.PointData(clustered_cutpoints, sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting one sample took {} seconds".format(end - start)) return [centres, clustered_cutpoints] def fit_cutpoints(self, cutpoints, store_labels = False, noise = False, verbose = True): '''Fit cutpoints (an instance of `PointData`) and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- cutpoints : an instance of `pept.PointData` The samples of points that will be clustered. In every sample, every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Raises ------ Exception If `cutpoints` is not an instance (or a subclass) of `pept.PointData`. ''' if verbose: start = time.time() if not isinstance(cutpoints, pept.PointData): raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)') # Fit all samples in `cutpoints` in parallel using joblib # Collect all outputs as a list. If verbose, show progress bar with # tqdm if verbose: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in tqdm(cutpoints)) else: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in cutpoints) # Access joblib.Parallel output as list comprehensions centres = np.array([row[0] for row in data_list if len(row[0]) != 0]) if len(centres) != 0: centres = pept.PointData(np.vstack(centres), sample_size = 0, overlap = 0, verbose = False) if store_labels: clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0]) clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)), sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting cutpoints took {} seconds".format(end - start)) if store_labels: return [centres, clustered_cutpoints] else: return [centres, []]
Instance variables
var allow_single_cluster
-
Source code
@property def allow_single_cluster(self): return self.clusterer.allow_single_cluster
var min_cluster_size
-
Source code
@property def min_cluster_size(self): return self.clusterer.min_cluster_size
var min_samples
-
Source code
@property def min_samples(self): return self.clusterer.min_cluster_size
Methods
def fit_cutpoints(self, cutpoints, store_labels=False, noise=False, verbose=True)
-
Fit cutpoints (an instance of
PointData
) and return the cluster centres and (optionally) the labelled cutpoints.Parameters
cutpoints
:an
instance
of<a title="pept.PointData" href="../../index.html#pept.PointData">
PointData</a>
- The samples of points that will be clustered. In every sample, every point
corresponds to a row and is formatted as
[time, x, y, z, etc]
. Only columns[1, 2, 3]
are used for clustering. store_labels
:bool
, optional- If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise
:bool
, optional- If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if
store_labels
is set to True. The default is False. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
centres
:pept.PointData
- The centroids of every cluster found. They are computed as the average
of every column of
[time, x, y, z, etc]
of the clustered points. Another column is added to the initial data insample
, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints
:numpy.ndarray
orpept.PointData
- The points in
sample
that fall in every cluster. A new column is added to the points insample
that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated.
Raises
Exception
- If
cutpoints
is not an instance (or a subclass) ofpept.PointData
.
Source code
def fit_cutpoints(self, cutpoints, store_labels = False, noise = False, verbose = True): '''Fit cutpoints (an instance of `PointData`) and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- cutpoints : an instance of `pept.PointData` The samples of points that will be clustered. In every sample, every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Raises ------ Exception If `cutpoints` is not an instance (or a subclass) of `pept.PointData`. ''' if verbose: start = time.time() if not isinstance(cutpoints, pept.PointData): raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)') # Fit all samples in `cutpoints` in parallel using joblib # Collect all outputs as a list. If verbose, show progress bar with # tqdm if verbose: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in tqdm(cutpoints)) else: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in cutpoints) # Access joblib.Parallel output as list comprehensions centres = np.array([row[0] for row in data_list if len(row[0]) != 0]) if len(centres) != 0: centres = pept.PointData(np.vstack(centres), sample_size = 0, overlap = 0, verbose = False) if store_labels: clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0]) clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)), sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting cutpoints took {} seconds".format(end - start)) if store_labels: return [centres, clustered_cutpoints] else: return [centres, []]
def fit_sample(self, sample, store_labels=False, noise=False, as_array=True, verbose=False)
-
Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints.
Parameters
sample
: (N
,M
>=4
)numpy.ndarray
- The sample of points that will be clustered. Every point corresponds to
a row and is formatted as
[time, x, y, z, etc]
. Only columns[1, 2, 3]
are used for clustering. store_labels
:bool
, optional- If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise
:bool
, optional- If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if
store_labels
is set to True. The default is False. as_array
:bool
, optional- If set to True, the centres of the clusters and the clustered cutpoints are
returned as numpy arrays. If set to False, they are returned inside
instances of
PointData
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
centres
:numpy.ndarray
orPointData
- The centroids of every cluster found. They are computed as the average
of every column of
[time, x, y, z, etc]
of the clustered points. Another column is added to the initial data insample
, signifying the cluster size - the number of points included in the cluster. Ifas_array
is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints
:numpy.ndarray
orPointData
- The points in
sample
that fall in every cluster. A new column is added to the points insample
that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Ifas_array
is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance.
Raises
TypeError
- If
sample
is not a numpy array of shape (N, M), where M >= 4.
Source code
def fit_sample(self, sample, store_labels = False, noise = False, as_array = True, verbose = False): '''Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- sample : (N, M >= 4) numpy.ndarray The sample of points that will be clustered. Every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. as_array : bool, optional If set to True, the centres of the clusters and the clustered cutpoints are returned as numpy arrays. If set to False, they are returned inside instances of `pept.PointData`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : numpy.ndarray or pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. If `as_array` is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. If `as_array` is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance. Raises ------ TypeError If `sample` is not a numpy array of shape (N, M), where M >= 4. ''' if verbose: start = time.time() # sample row: [time, x, y, z] if sample.ndim != 2 or sample.shape[1] < 4: raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape)) # Only cluster based on [x, y, z] labels = self.clusterer.fit_predict(sample[:, 1:4]) max_label = labels.max() centres = [] clustered_cutpoints = [] # the centre of a cluster is the average of the time, x, y, z columns # and the number of points of that cluster # centres row: [time, x, y, z, ..etc.., cluster_size] for i in range(0, max_label + 1): # Average time, x, y, z of cluster of label i centres_row = np.mean(sample[labels == i], axis = 0) # Append the number of points of label i => cluster_size centres_row = np.append(centres_row, (labels == i).sum()) centres.append(centres_row) centres = np.array(centres) if not as_array: centres = pept.PointData(centres, sample_size = 0, overlap = 0, verbose = False) # Return all cutpoints as a list of numpy arrays for every label # where the last column of an array is the label if store_labels: # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label] if noise: cutpoints = sample[labels == -1] cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1) clustered_cutpoints.append(cutpoints) for i in range(0, max_label + 1): cutpoints = sample[labels == i] cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1) clustered_cutpoints.append(cutpoints) clustered_cutpoints = np.vstack(np.array(clustered_cutpoints)) if not as_array: clustered_cutpoints = pept.PointData(clustered_cutpoints, sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting one sample took {} seconds".format(end - start)) return [centres, clustered_cutpoints]