Module pept.tracking.peptml.peptml
The peptml package implements a hierarchical density-based clustering algorithm for general Positron Emission Particle Tracking (PEPT).
The package includes tools for PEPT data analysis and visualisation. It
exports
two main classes that are meant to work together: Cutpoints
and HDBSCANClusterer
. Cutpoints
transforms the LoR data into cutpoints that
can then be clustered using HDBSCANClusterer
. Cutpoints
accepts any instance
of a class that inherits from LineData
- this is a requirement due to the
low-level C-API that it calls. HDBSCANClusterer
can work with both PointData
,
as well as typical numpy.ndarray
s.
The module aims to provide general classes which can then be used in a script file as the user sees fit. For example scripts, look at the base of the pept library.
peptml
requires the following packages:
- numpy
- joblib for multithreaded operations (such as cutpoints-finding)
- tqdm for showing progress bars
- plotly >= 4.1.0 for plotly-based plotting
- hdbscan for clustering cutpoints and centres
- time for verbose timing of operations
It was successfuly used at the University of Birmingham to analyse real Fluorine-18 tracers in air.
If you use this package, you should cite the following paper: [TODO: paper signature].
Source code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# pept is a Python library that unifies Positron Emission Particle
# Tracking (PEPT) research, including tracking, simulation, data analysis
# and visualisation tools
#
# Copyright (C) 2019 Andrei Leonard Nicusan
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# File : peptml.py
# License: License: GNU v3.0
# Author : Andrei Leonard Nicusan <a.l.nicusan@bham.ac.uk>
# Date : 28.08.2019
'''The *peptml* package implements a hierarchical density-based clustering
algorithm for general Positron Emission Particle Tracking (PEPT).
The package includes tools for PEPT data analysis and visualisation. It exports
two main classes that are meant to work together: `Cutpoints`
and `HDBSCANClusterer`. `Cutpoints` transforms the LoR data into *cutpoints* that
can then be clustered using `HDBSCANClusterer`. `Cutpoints` accepts any instance
of a class that inherits from `LineData` - this is a requirement due to the
low-level C-API that it calls. `HDBSCANClusterer` can work with both `PointData`,
as well as typical `numpy.ndarray`s.
The module aims to provide general classes which can
then be used in a script file as the user sees fit. For example scripts,
look at the base of the pept library.
`peptml` requires the following packages:
* **numpy**
* **joblib** for multithreaded operations (such as cutpoints-finding)
* **tqdm** for showing progress bars
* **plotly >= 4.1.0** for plotly-based plotting
* **hdbscan** for clustering cutpoints and centres
* **time** for verbose timing of operations
It was successfuly used at the University of Birmingham to analyse real
Fluorine-18 tracers in air.
If you use this package, you should cite
the following paper: [TODO: paper signature].
'''
import time
import sys
import numpy as np
from scipy.spatial import cKDTree
from joblib import Parallel, delayed
from tqdm import tqdm
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# Fix a deprecation warning inside the sklearn library
try:
sys.modules['sklearn.externals.six'] = __import__('six')
sys.modules['sklearn.externals.joblib'] = __import__('joblib')
import hdbscan
except ImportError:
import hdbscan
import pept
from .extensions.find_cutpoints_api import find_cutpoints_api
class Cutpoints(pept.PointData):
'''A class that transforms LoRs into *cutpoints* for clustering.
The `Cutpoints` class transforms LoRs (individual numpy arrays or full `LineData`)
into cutpoints (individual numpy arrays or full `PointData`) that can then be passed
to `HDBSCANClusterer`.
Under typical usage, the class is instantiated with the LoR data (required to be
an instance of `LineData`) and transforms it into an instance of `PointData` that
stores the found cutpoints.
For more control over the operations, the class also provides a static method (i.e.
it can be used without instantiating the class) `find_cutpoints_sample` that
receives a generic numpy array of LoRs (one 'sample') and returns a numpy array
of cutpoints.
Parameters
----------
line_data : instance of pept.LineData
The LoRs for which the cutpoints will be computed. It is required to be an
instance of `pept.LineData`.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be
considered.
cutoffs : list-like of length 6, optional
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using `get_cutoffs`.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is `True`.
Attributes
----------
line_data : instance of pept.LineData
The LoRs for which the cutpoints will be computed. It is required to be an
instance of `pept.LineData`.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be
considered.
cutoffs : list-like of length 6
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances.
sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData
Extra attributes and methods are inherited from the base class `PointData`.
Raises
------
Exception
If `line_data` is not an instance of `pept.LineData`.
TypeError
If `cutoffs` is not a one-dimensional array with values formatted as
`[min_x, max_x, min_y, max_y, min_z, max_z]`.
Example usage
-------------
Compute the cutpoints for a `LineData` instance:
>>> line_data = pept.LineData(example_data)
>>> cutpts = peptml.Cutpoints(line_data, 0.1)
Compute the cutpoints for a single sample:
>>> sample = line_data[0]
>>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation
'''
def __init__(self,
line_data,
max_distance,
cutoffs = None,
verbose = True):
# Find the cutpoints when instantiated. The method
# also initialises the instance as a `PointData` subclass.
self.find_cutpoints(
line_data,
max_distance,
cutoffs = cutoffs,
verbose = verbose
)
@property
def line_data(self):
'''The LoRs for which the cutpoints are computed.
line_data : instance of pept.LineData
'''
return self._line_data
@line_data.setter
def line_data(self, new_line_data):
''' The LoRs for which the cutpoints are computed.
Parameters
----------
line_data : instance of pept.LineData
The LoRs for which the cutpoints will be computed. It is required to be an
instance of `pept.LineData`.
Raises
------
Exception
If `line_data` is not an instance of `pept.LineData`.
'''
# Check line_data is an instance (or a subclass!) of pept.LineData
if not isinstance(line_data, pept.LineData):
raise Exception('[ERROR]: line_data should be an instance of pept.LineData')
self._line_data = line_data
@property
def max_distance(self):
'''The maximum distance between any pair of lines for which the cutpoint is considered.
max_distance : float
'''
return self._max_distance
@max_distance.setter
def max_distance(self, new_max_distance):
'''The maximum distance between any pair of lines for which the cutpoint is considered.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be
considered.
'''
self._max_distance = new_max_distance
@property
def cutoffs(self):
'''Only consider the cutpoints which fall within these cutoff distances.
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max].
cutoffs : (6) list or equivalent
'''
return self._cutoffs
@cutoffs.setter
def cutoffs(self, new_cutoffs):
'''Only consider the cutpoints which fall within these cutoff distances.
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max].
Parameters
----------
new_cutoffs : list-like of length 6, optional
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using `get_cutoffs`.
Raises
------
TypeError
If `cutoffs` is not a one-dimensional array with values formatted as
`[min_x, max_x, min_y, max_y, min_z, max_z]`.
'''
cutoffs = np.asarray(new_cutoffs, order = 'C', dtype = float)
if cutoffs.ndim != 1 or len(cutoffs) != 6:
raise TypeError('\n[ERROR]: new_cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')
self._cutoffs = cutoffs
@staticmethod
def get_cutoffs(sample):
'''Compute the cutoffs from a sample of LoR data.
This is a static method, meaning it can be called without
instantiating the `Cutpoints` class. It computes the cutoffs
from the minimum and maximum values of the LoRs in `sample`
in each dimension.
Parameters
----------
sample : (N, 7) numpy.ndarray
A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
such that every line is defined by the points `[x1, y1, z1]` and
`[x2, y2, z2]`.
Returns
-------
cutoffs : (6) numpy.ndarray
The computed cutoffs for each dimension, formatted as `[x_min, x_max,
y_min, y_max, z_min, z_max]`.
Raises
------
TypeError
If `sample` is not a numpy array with shape (N, 7).
'''
# Check sample has shape (N, 7)
if sample.ndim != 2 or sample.shape[1] != 7:
raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))
# Compute cutoffs for cutpoints as the (min, max) values of the lines
# Minimum value of the two points that define a line
min_x = min(sample[:, 1].min(),
sample[:, 4].min())
# Maximum value of the two points that define a line
max_x = max(sample[:, 1].max(),
sample[:, 4].max())
# Minimum value of the two points that define a line
min_y = min(sample[:, 2].min(),
sample[:, 5].min())
# Maximum value of the two points that define a line
max_y = max(sample[:, 2].max(),
sample[:, 5].max())
# Minimum value of the two points that define a line
min_z = min(sample[:, 3].min(),
sample[:, 6].min())
# Maximum value of the two points that define a line
max_z = max(sample[:, 3].max(),
sample[:, 6].max())
cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z])
return cutoffs
@staticmethod
def find_cutpoints_sample(sample, max_distance, cutoffs = None):
'''Find the cutpoints in a sample of LoRs.
This is a static method, meaning it can be called without
instantiating the `Cutpoints` class. It computes the cutpoints
from a given `sample` that are associated with pairs of lines
closer than `max_distance`.
Parameters
----------
sample : (N, 7) numpy.ndarray
A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`,
such that every line is defined by the points `[x1, y1, z1]` and
`[x2, y2, z2]`.
max_distance : float
The maximum distance between any pair of lines so that their cutpoint
will be considered.
cutoffs : list, optional
The cutoffs for each dimension, formatted as `[x_min, x_max,
y_min, y_max, z_min, z_max]`. If not defined, they are computed
automatically by calling `get_cutoffs`. The default is `None`.
Returns
-------
sample_cutpoints : (N, 4) numpy.ndarray
The computed cutpoints for the given LoRs, where each row is
formatted as `[time, x, y, z]` for every cutpoint.
Raises
------
TypeError
If `sample` is not a numpy array with shape (N, 7).
TypeError
If `cutoffs` is not a `one-dimensional array with values [min_x,
max_x, min_y, max_y, min_z, max_z]`
'''
# Check sample has shape (N, 7)
if sample.ndim != 2 or sample.shape[1] != 7:
raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape))
if cutoffs is None:
cutoffs = Cutpoints.get_cutoffs(sample)
else:
cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
if cutoffs.ndim != 1 or len(cutoffs) != 6:
raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')
sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs)
return sample_cutpoints
def find_cutpoints(self,
line_data,
max_distance,
cutoffs = None,
verbose = False):
'''Find the cutpoints of the samples in a `LineData` instance.
Parameters
----------
line_data : instance of pept.LineData
The LoRs for which the cutpoints will be computed. It is required to be an
instance of `pept.LineData`.
max_distance : float
The maximum distance between any two lines so that their cutpoint will be
considered.
cutoffs : list-like of length 6, optional
A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using `get_cutoffs`.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is `False`.
Returns
-------
self : the PointData instance of cutpoints
The computed cutpoints are stored in the `Cutpoints` class, as a
subclass of `pept.PointData`.
Raises
------
Exception
If `line_data` is not an instance of `pept.LineData`.
TypeError
If `cutoffs` is not a one-dimensional array with values formatted as
`[min_x, max_x, min_y, max_y, min_z, max_z]`.
'''
if verbose:
start = time.time()
# Check line_data is an instance (or a subclass!) of pept.LineData
if not isinstance(line_data, pept.LineData):
raise Exception('[ERROR]: line_data should be an instance of pept.LineData')
self._line_data = line_data
self._max_distance = max_distance
# If cutoffs were not supplied, compute them
if cutoffs is None:
cutoffs = self.get_cutoffs(line_data.line_data)
# Otherwise make sure they are a C-contiguous numpy array
else:
cutoffs = np.asarray(cutoffs, order = 'C', dtype = float)
if cutoffs.ndim != 1 or len(cutoffs) != 6:
raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n')
self._cutoffs = cutoffs
# Using joblib, collect the cutpoints from every sample in a list
# of arrays. If verbose, show progress bar using tqdm.
if verbose:
cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data))
else:
cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data)
# cutpoints shape: (n, m, 4), where n is the number of samples, and
# m is the number of cutpoints in the sample
cutpoints = np.array(cutpoints)
number_of_samples = len(cutpoints)
cutpoints = np.vstack(np.array(cutpoints))
number_of_cutpoints = len(cutpoints)
# Average number of cutpoints per sample
cutpoints_per_sample = int(number_of_cutpoints / number_of_samples)
super().__init__(cutpoints,
sample_size = cutpoints_per_sample,
overlap = 0,
verbose = False)
if verbose:
end = time.time()
print("\n\n\nFinding the cutpoints took {} seconds".format(end - start))
return self
def findMeanError(truePositions, foundPositions):
tree = cKDTree(truePositions)
meanError = 0
meanErrorX = 0
meanErrorY = 0
meanErrorZ = 0
n = 0
for centre in foundPositions:
d, index = tree.query(centre, k = 1, n_jobs = -1)
meanError += np.linalg.norm(centre - truePositions[index])
meanErrorX += np.abs(centre[0] - truePositions[index][0])
meanErrorY += np.abs(centre[1] - truePositions[index][1])
meanErrorZ += np.abs(centre[2] - truePositions[index][2])
n += 1
meanError /= n
meanErrorX /= n
meanErrorY /= n
meanErrorZ /= n
return [meanError, meanErrorX, meanErrorY, meanErrorZ]
class HDBSCANClusterer:
'''HDBSCAN-based clustering for cutpoints from LoRs.
This class is a wrapper around the `hdbscan` package, providing tools for
parallel clustering of samples of cutpoints. It can return `PointData`
classes which can be easily manipulated or visualised.
Parameters
----------
min_cluster_size : int, optional
(Taken from hdbscan's documentation): The minimum size of clusters;
single linkage splits that contain fewer points than this will be
considered points “falling out” of a cluster rather than a cluster
splitting into two new clusters. The default is 20.
min_samples : int, optional
(Taken from hdbscan's documentation): The number of samples in a
neighbourhood for a point to be considered a core point. The default
is None, being set automatically to the `min_cluster_size`.
allow_single_cluster : bool, optional
(Taken from hdbscan's documentation): By default HDBSCAN* will not
produce a single cluster, setting this to True will override this and
allow single cluster results in the case that you feel this is a valid
result for your dataset. For PEPT, set this to True if you only have
one tracer in the dataset. Otherwise, leave it to False, as it will
provide higher accuracy.
Attributes
----------
min_cluster_size : int
(Taken from hdbscan's documentation): The minimum size of clusters;
single linkage splits that contain fewer points than this will be
considered points “falling out” of a cluster rather than a cluster
splitting into two new clusters. The default is 20.
min_samples : int
(Taken from hdbscan's documentation): The number of samples in a
neighbourhood for a point to be considered a core point. The default
is None, being set automatically to the `min_cluster_size`.
allow_single_cluster : bool
(Taken from hdbscan's documentation): By default HDBSCAN* will not
produce a single cluster, setting this to True will override this and
allow single cluster results in the case that you feel this is a valid
result for your dataset. For PEPT, set this to True if you only have
one tracer in the dataset. Otherwise, leave it to False, as it will
provide higher accuracy.
'''
def __init__(self,
min_cluster_size = 20,
min_samples = None,
allow_single_cluster = False):
if 0 < min_cluster_size < 2:
print("\n[WARNING]: min_cluster_size was set to 2, as it was {} < 2\n".format(min_cluster_size))
min_cluster_size = 2
self.clusterer = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size,
min_samples = min_samples,
core_dist_n_jobs = -1,
allow_single_cluster = allow_single_cluster)
@property
def min_cluster_size(self):
return self.clusterer.min_cluster_size
@min_cluster_size.setter
def min_cluster_size(self, new_min_cluster_size):
self.clusterer.min_cluster_size = new_min_cluster_size
@property
def min_samples(self):
return self.clusterer.min_cluster_size
@min_samples.setter
def min_samples(self, new_min_samples):
self.clusterer.min_samples = new_min_samples
@property
def allow_single_cluster(self):
return self.clusterer.allow_single_cluster
@allow_single_cluster.setter
def allow_single_cluster(self, option):
self.clusterer.allow_single_cluster = option
def fit_sample(self,
sample,
store_labels = False,
noise = False,
as_array = True,
verbose = False):
'''Fit one sample of cutpoints and return the cluster centres and
(optionally) the labelled cutpoints.
Parameters
----------
sample : (N, M >= 4) numpy.ndarray
The sample of points that will be clustered. Every point corresponds to
a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]`
are used for clustering.
store_labels : bool, optional
If set to True, the clustered cutpoints are returned along with the centres
of the clusters. Setting it to False speeds up the clustering. The default
is False.
noise : bool, optional
If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if `store_labels` is set to True. The default
is False.
as_array : bool, optional
If set to True, the centres of the clusters and the clustered cutpoints are
returned as numpy arrays. If set to False, they are returned inside
instances of `pept.PointData`.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is `False`.
Returns
-------
centres : numpy.ndarray or pept.PointData
The centroids of every cluster found. They are computed as the average
of every column of `[time, x, y, z, etc]` of the clustered points. Another
column is added to the initial data in `sample`, signifying the cluster
size - the number of points included in the cluster. If `as_array` is
set to True, it is a numpy array, otherwise the centres are stored
in a pept.PointData instance.
clustered_cutpoints : numpy.ndarray or pept.PointData
The points in `sample` that fall in every cluster. A new column is added
to the points in `sample` that signifies the label of cluster that the
point was associated with: all points in cluster number 3 will have the
number 3 as the last element in their row. The points classified as noise
have the number -1 associated. If `as_array` is set to True, it is a numpy
array, otherwise the clustered cutpoints are stored in a pept.PointData
instance.
Raises
------
TypeError
If `sample` is not a numpy array of shape (N, M), where M >= 4.
'''
if verbose:
start = time.time()
# sample row: [time, x, y, z]
if sample.ndim != 2 or sample.shape[1] < 4:
raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape))
# Only cluster based on [x, y, z]
labels = self.clusterer.fit_predict(sample[:, 1:4])
max_label = labels.max()
centres = []
clustered_cutpoints = []
# the centre of a cluster is the average of the time, x, y, z columns
# and the number of points of that cluster
# centres row: [time, x, y, z, ..etc.., cluster_size]
for i in range(0, max_label + 1):
# Average time, x, y, z of cluster of label i
centres_row = np.mean(sample[labels == i], axis = 0)
# Append the number of points of label i => cluster_size
centres_row = np.append(centres_row, (labels == i).sum())
centres.append(centres_row)
centres = np.array(centres)
if not as_array:
centres = pept.PointData(centres,
sample_size = 0,
overlap = 0,
verbose = False)
# Return all cutpoints as a list of numpy arrays for every label
# where the last column of an array is the label
if store_labels:
# Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label]
if noise:
cutpoints = sample[labels == -1]
cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1)
clustered_cutpoints.append(cutpoints)
for i in range(0, max_label + 1):
cutpoints = sample[labels == i]
cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1)
clustered_cutpoints.append(cutpoints)
clustered_cutpoints = np.vstack(np.array(clustered_cutpoints))
if not as_array:
clustered_cutpoints = pept.PointData(clustered_cutpoints,
sample_size = 0,
overlap = 0,
verbose = False)
if verbose:
end = time.time()
print("Fitting one sample took {} seconds".format(end - start))
return [centres, clustered_cutpoints]
def fit_cutpoints(self,
cutpoints,
store_labels = False,
noise = False,
verbose = True):
'''Fit cutpoints (an instance of `PointData`) and return the cluster
centres and (optionally) the labelled cutpoints.
Parameters
----------
cutpoints : an instance of `pept.PointData`
The samples of points that will be clustered. In every sample, every point
corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only
columns `[1, 2, 3]` are used for clustering.
store_labels : bool, optional
If set to True, the clustered cutpoints are returned along with the centres
of the clusters. Setting it to False speeds up the clustering. The default
is False.
noise : bool, optional
If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if `store_labels` is set to True. The default
is False.
verbose : bool, optional
Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is `False`.
Returns
-------
centres : pept.PointData
The centroids of every cluster found. They are computed as the average
of every column of `[time, x, y, z, etc]` of the clustered points. Another
column is added to the initial data in `sample`, signifying the cluster
size - the number of points included in the cluster.
clustered_cutpoints : numpy.ndarray or pept.PointData
The points in `sample` that fall in every cluster. A new column is added
to the points in `sample` that signifies the label of cluster that the
point was associated with: all points in cluster number 3 will have the
number 3 as the last element in their row. The points classified as noise
have the number -1 associated.
Raises
------
Exception
If `cutpoints` is not an instance (or a subclass) of `pept.PointData`.
'''
if verbose:
start = time.time()
if not isinstance(cutpoints, pept.PointData):
raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)')
# Fit all samples in `cutpoints` in parallel using joblib
# Collect all outputs as a list. If verbose, show progress bar with
# tqdm
if verbose:
data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
store_labels = store_labels,
noise = noise,
as_array = True) for sample in tqdm(cutpoints))
else:
data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample,
store_labels = store_labels,
noise = noise,
as_array = True) for sample in cutpoints)
# Access joblib.Parallel output as list comprehensions
centres = np.array([row[0] for row in data_list if len(row[0]) != 0])
if len(centres) != 0:
centres = pept.PointData(np.vstack(centres),
sample_size = 0,
overlap = 0,
verbose = False)
if store_labels:
clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0])
clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)),
sample_size = 0,
overlap = 0,
verbose = False)
if verbose:
end = time.time()
print("Fitting cutpoints took {} seconds".format(end - start))
if store_labels:
return [centres, clustered_cutpoints]
else:
return [centres, []]
class TrajectorySeparation:
def __init__(self, centres, pointsToCheck = 25, maxDistance = 20, maxClusterDiff = 500):
# centres row: [time, x, y, z, clusterSize]
# Make sure the trajectory is memory-contiguous for efficient
# KDTree partitioning
self.centres = np.ascontiguousarray(centres)
self.pointsToCheck = pointsToCheck
self.maxDistance = maxDistance
self.maxClusterDiff = maxClusterDiff
# For every point in centres, save a set of the trajectory
# indices of the trajectories that they are part of
# eg. centres[2] is part of trajectories 0 and 1 =>
# trajectoryIndices[2] = {0, 1}
# Initialise a vector of empty sets of size len(centres)
self.trajectoryIndices = np.array([ set() for i in range(len(self.centres)) ])
# For every trajectory found, save a list of the indices of
# the centres that are part of that trajectory
# eg. trajectory 1 is comprised of centres 3, 5 and 8 =>
# centresIndices[1] = [3, 5, 8]
self.centresIndices = [[]]
# Maximum trajectory index
self.maxIndex = 0
def findTrajectories(self):
for i, currentPoint in enumerate(self.centres):
if i == 0:
# Add the first point to trajectory 0
self.trajectoryIndices[0].add(self.maxIndex)
self.centresIndices[self.maxIndex].append(0)
self.maxIndex += 1
continue
# Search for the closest previous pointsToCheck points
# within a given maxDistance
startIndex = i - self.pointsToCheck
endIndex = i
if startIndex < 0:
startIndex = 0
# Construct a KDTree from the x, y, z (1:4) of the
# selected points. Get the indices for all the points within
# maxDistance of the currentPoint
tree = cKDTree(self.centres[startIndex:endIndex, 1:4])
closestIndices = tree.query_ball_point(currentPoint[1:4], self.maxDistance, n_jobs=-1)
closestIndices = np.array(closestIndices) + startIndex
# If no point was found, it is a new trajectory. Continue
if len(closestIndices) == 0:
self.trajectoryIndices[i].add(self.maxIndex)
self.centresIndices.append([i])
self.maxIndex += 1
continue
# For every close point found, search for all the trajectory indices
# - If all trajectory indices sets are equal and of a single value
# then currentPoint is part of the same trajectory
# - If all trajectory indices sets are equal, but of more values,
# then currentPoint diverged from an intersection of trajectories
# and is part of a single trajectory => separate it
#
# - If every pair of trajectory indices sets is not disjoint, then
# currentPoint is only one of them
# - If there exists a pair of trajectory indices sets that is
# disjoint, then currentPoint is part of all of them
# Select the trajectories of all the points that were found
# to be the closest
closestTrajectories = self.trajectoryIndices[closestIndices]
#print("closestTrajectories:")
#print(closestTrajectories)
# If all the closest points are part of the same trajectory
# (just one!), then the currentPoint is part of it too
if (np.all(closestTrajectories == closestTrajectories[0]) and
len(closestTrajectories[0]) == 1):
self.trajectoryIndices[i] = closestTrajectories[0]
self.centresIndices[ next(iter(closestTrajectories[0])) ].append(i)
continue
# Otherwise, check the points based on their cluster size
else:
# Create a list of all the trajectories that were found to
# intersect
#print('\nIntersection:')
closestTrajIndices = list( set().union(*closestTrajectories) )
#print("ClosestTrajIndices:")
#print(closestTrajIndices)
# For each close trajectory, calculate the mean cluster size
# of the last lastPoints points
lastPoints = 50
# Keep track of the mean cluster size that is the closest to
# the currentPoint's clusterSize
currentClusterSize = currentPoint[4]
#print("currentClusterSize = {}".format(currentClusterSize))
closestTrajIndex = -1
clusterSizeDiff = self.maxClusterDiff
for trajIndex in closestTrajIndices:
#print("trajIndex = {}".format(trajIndex))
trajCentres = self.centres[ self.centresIndices[trajIndex] ]
#print("trajCentres:")
#print(trajCentres)
meanClusterSize = trajCentres[-lastPoints:][:, 4].mean()
#print("meanClusterSize = {}".format(meanClusterSize))
#print("clusterSizeDiff = {}".format(clusterSizeDiff))
#print("abs diff = {}".format(np.abs( currentClusterSize - meanClusterSize )))
if np.abs( currentClusterSize - meanClusterSize ) < clusterSizeDiff:
closestTrajIndex = trajIndex
clusterSizeDiff = np.abs( currentClusterSize - meanClusterSize )
if closestTrajIndex == -1:
#self.trajectoryIndices[i] = set(closestTrajIndices)
#for trajIndex in closestTrajIndices:
# self.centresIndices[trajIndex].append(i)
print("\n**** -1 ****\n")
break
else:
#print("ClosestTrajIndex found = {}".format(closestTrajIndex))
self.trajectoryIndices[i] = set([closestTrajIndex])
self.centresIndices[closestTrajIndex].append(i)
'''
# If the current point is not part of any trajectory, assign it
# the maxIndex and increment it
if len(self.trajectoryIndices[i]) == 0:
self.trajectoryIndices[i].append(self.maxIndex)
self.maxIndex += 1
print(self.trajectoryIndices[i])
print(self.maxIndex)
# Construct a KDTree from the numberOfPoints in front of
# the current point
tree = cKDTree(self.trajectory[(i + 1):(i + self.numberOfPoints + 2)][1:4])
# For every trajectory that the current point is part of,
# find the closest points in front of it
numberOfIntersections = len(self.trajectoryIndices[i])
dist, nextPointsIndices = tree.query(currentPoint, k=numberOfIntersections, distance_upper_bound=self.maxDistance, n_jobs=-1)
print(nextPointsIndices)
# If the current point is part of more trajectories,
# an intersection happened. Call subroutine to part
# the trajectories
if numberOfIntersections > 1:
for j in range(0, len(self.trajectoryIndices[i])):
trajIndex = self.trajectoryIndices[i][j]
self.trajectoryIndices[i + 1 + nextPointsIndices[j]].append(trajIndex)
else:
self.trajectoryIndices[i + 1 + nextPointsIndices].append(self.trajectoryIndices[i][0])
print(self.trajectoryIndices)
'''
def getTrajectories(self):
self.individualTrajectories = []
for trajCentres in self.centresIndices:
self.individualTrajectories.append(self.centres[trajCentres])
self.individualTrajectories = np.array(self.individualTrajectories)
return self.individualTrajectories
'''
self.individualTrajectories = [ [] for i in range(0, self.maxIndex + 1) ]
for i in range(0, len(self.trajectoryIndices)):
for trajIndex in self.trajectoryIndices[i]:
self.individualTrajectories[trajIndex].append(self.centres[i])
self.individualTrajectories = np.array(self.individualTrajectories)
for i in range(len(self.individualTrajectories)):
if len(self.individualTrajectories[i]) > 0:
self.individualTrajectories[i] = np.vstack(self.individualTrajectories[i])
return self.individualTrajectories
'''
def plotTrajectoriesAltAxes(self, ax):
trajectories = self.getTrajectories()
for traj in trajectories:
if len(traj) > 0:
ax.scatter(traj[:, 3], traj[:, 1], traj[:, 2], marker='D', s=10)
Functions
def findMeanError(truePositions, foundPositions)
-
Source code
def findMeanError(truePositions, foundPositions): tree = cKDTree(truePositions) meanError = 0 meanErrorX = 0 meanErrorY = 0 meanErrorZ = 0 n = 0 for centre in foundPositions: d, index = tree.query(centre, k = 1, n_jobs = -1) meanError += np.linalg.norm(centre - truePositions[index]) meanErrorX += np.abs(centre[0] - truePositions[index][0]) meanErrorY += np.abs(centre[1] - truePositions[index][1]) meanErrorZ += np.abs(centre[2] - truePositions[index][2]) n += 1 meanError /= n meanErrorX /= n meanErrorY /= n meanErrorZ /= n return [meanError, meanErrorX, meanErrorY, meanErrorZ]
Classes
class Cutpoints (line_data, max_distance, cutoffs=None, verbose=True)
-
A class that transforms LoRs into cutpoints for clustering.
The
Cutpoints
class transforms LoRs (individual numpy arrays or fullLineData
) into cutpoints (individual numpy arrays or fullPointData
) that can then be passed toHDBSCANClusterer
.Under typical usage, the class is instantiated with the LoR data (required to be an instance of
LineData
) and transforms it into an instance ofPointData
that stores the found cutpoints.For more control over the operations, the class also provides a static method (i.e. it can be used without instantiating the class)
find_cutpoints_sample
that receives a generic numpy array of LoRs (one 'sample') and returns a numpy array of cutpoints.Parameters
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
, optional- A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using
get_cutoffs
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
True
.
Attributes
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
- A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances.
sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData Extra attributes and methods are inherited from the base class
PointData
.Raises
Exception
- If
line_data
is not an instance ofLineData
. TypeError
- If
cutoffs
is not a one-dimensional array with values formatted as[min_x, max_x, min_y, max_y, min_z, max_z]
.
Example usage
Compute the cutpoints for a
LineData
instance: >>> line_data = pept.LineData(example_data) >>> cutpts = peptml.Cutpoints(line_data, 0.1)Compute the cutpoints for a single sample: >>> sample = line_data[0] >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation
Source code
class Cutpoints(pept.PointData): '''A class that transforms LoRs into *cutpoints* for clustering. The `Cutpoints` class transforms LoRs (individual numpy arrays or full `LineData`) into cutpoints (individual numpy arrays or full `PointData`) that can then be passed to `HDBSCANClusterer`. Under typical usage, the class is instantiated with the LoR data (required to be an instance of `LineData`) and transforms it into an instance of `PointData` that stores the found cutpoints. For more control over the operations, the class also provides a static method (i.e. it can be used without instantiating the class) `find_cutpoints_sample` that receives a generic numpy array of LoRs (one 'sample') and returns a numpy array of cutpoints. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `True`. Attributes ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6 A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. sample_size, overlap, number_of_lines, etc. : inherited from pept.PointData Extra attributes and methods are inherited from the base class `PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. Example usage ------------- Compute the cutpoints for a `LineData` instance: >>> line_data = pept.LineData(example_data) >>> cutpts = peptml.Cutpoints(line_data, 0.1) Compute the cutpoints for a single sample: >>> sample = line_data[0] >>> cutpts_sample = peptml.Cutpoints.find_cutpoints_sample(sample) # no class instantiation ''' def __init__(self, line_data, max_distance, cutoffs = None, verbose = True): # Find the cutpoints when instantiated. The method # also initialises the instance as a `PointData` subclass. self.find_cutpoints( line_data, max_distance, cutoffs = cutoffs, verbose = verbose ) @property def line_data(self): '''The LoRs for which the cutpoints are computed. line_data : instance of pept.LineData ''' return self._line_data @line_data.setter def line_data(self, new_line_data): ''' The LoRs for which the cutpoints are computed. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. ''' # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data @property def max_distance(self): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float ''' return self._max_distance @max_distance.setter def max_distance(self, new_max_distance): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. ''' self._max_distance = new_max_distance @property def cutoffs(self): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. cutoffs : (6) list or equivalent ''' return self._cutoffs @cutoffs.setter def cutoffs(self, new_cutoffs): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Parameters ---------- new_cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. Raises ------ TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' cutoffs = np.asarray(new_cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: new_cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs @staticmethod def get_cutoffs(sample): '''Compute the cutoffs from a sample of LoR data. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutoffs from the minimum and maximum values of the LoRs in `sample` in each dimension. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. Returns ------- cutoffs : (6) numpy.ndarray The computed cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) # Compute cutoffs for cutpoints as the (min, max) values of the lines # Minimum value of the two points that define a line min_x = min(sample[:, 1].min(), sample[:, 4].min()) # Maximum value of the two points that define a line max_x = max(sample[:, 1].max(), sample[:, 4].max()) # Minimum value of the two points that define a line min_y = min(sample[:, 2].min(), sample[:, 5].min()) # Maximum value of the two points that define a line max_y = max(sample[:, 2].max(), sample[:, 5].max()) # Minimum value of the two points that define a line min_z = min(sample[:, 3].min(), sample[:, 6].min()) # Maximum value of the two points that define a line max_z = max(sample[:, 3].max(), sample[:, 6].max()) cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z]) return cutoffs @staticmethod def find_cutpoints_sample(sample, max_distance, cutoffs = None): '''Find the cutpoints in a sample of LoRs. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutpoints from a given `sample` that are associated with pairs of lines closer than `max_distance`. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. max_distance : float The maximum distance between any pair of lines so that their cutpoint will be considered. cutoffs : list, optional The cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. If not defined, they are computed automatically by calling `get_cutoffs`. The default is `None`. Returns ------- sample_cutpoints : (N, 4) numpy.ndarray The computed cutpoints for the given LoRs, where each row is formatted as `[time, x, y, z]` for every cutpoint. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). TypeError If `cutoffs` is not a `one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]` ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) if cutoffs is None: cutoffs = Cutpoints.get_cutoffs(sample) else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs) return sample_cutpoints def find_cutpoints(self, line_data, max_distance, cutoffs = None, verbose = False): '''Find the cutpoints of the samples in a `LineData` instance. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- self : the PointData instance of cutpoints The computed cutpoints are stored in the `Cutpoints` class, as a subclass of `pept.PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' if verbose: start = time.time() # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data self._max_distance = max_distance # If cutoffs were not supplied, compute them if cutoffs is None: cutoffs = self.get_cutoffs(line_data.line_data) # Otherwise make sure they are a C-contiguous numpy array else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs # Using joblib, collect the cutpoints from every sample in a list # of arrays. If verbose, show progress bar using tqdm. if verbose: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data)) else: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data) # cutpoints shape: (n, m, 4), where n is the number of samples, and # m is the number of cutpoints in the sample cutpoints = np.array(cutpoints) number_of_samples = len(cutpoints) cutpoints = np.vstack(np.array(cutpoints)) number_of_cutpoints = len(cutpoints) # Average number of cutpoints per sample cutpoints_per_sample = int(number_of_cutpoints / number_of_samples) super().__init__(cutpoints, sample_size = cutpoints_per_sample, overlap = 0, verbose = False) if verbose: end = time.time() print("\n\n\nFinding the cutpoints took {} seconds".format(end - start)) return self
Ancestors
Static methods
def find_cutpoints_sample(sample, max_distance, cutoffs=None)
-
Find the cutpoints in a sample of LoRs.
This is a static method, meaning it can be called without instantiating the
Cutpoints
class. It computes the cutpoints from a givensample
that are associated with pairs of lines closer thanmax_distance
.Parameters
sample
: (N
,7
)numpy.ndarray
- A sample of LoRs, where each row is
[time, x1, y1, z1, x2, y2, z2]
, such that every line is defined by the points[x1, y1, z1]
and[x2, y2, z2]
. max_distance
:float
- The maximum distance between any pair of lines so that their cutpoint will be considered.
cutoffs
:list
, optional- The cutoffs for each dimension, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]
. If not defined, they are computed automatically by callingget_cutoffs
. The default isNone
.
Returns
sample_cutpoints
: (N
,4
)numpy.ndarray
- The computed cutpoints for the given LoRs, where each row is
formatted as
[time, x, y, z]
for every cutpoint.
Raises
TypeError
- If
sample
is not a numpy array with shape (N, 7). TypeError
- If
cutoffs
is not aone-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]
Source code
@staticmethod def find_cutpoints_sample(sample, max_distance, cutoffs = None): '''Find the cutpoints in a sample of LoRs. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutpoints from a given `sample` that are associated with pairs of lines closer than `max_distance`. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. max_distance : float The maximum distance between any pair of lines so that their cutpoint will be considered. cutoffs : list, optional The cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. If not defined, they are computed automatically by calling `get_cutoffs`. The default is `None`. Returns ------- sample_cutpoints : (N, 4) numpy.ndarray The computed cutpoints for the given LoRs, where each row is formatted as `[time, x, y, z]` for every cutpoint. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). TypeError If `cutoffs` is not a `one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]` ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) if cutoffs is None: cutoffs = Cutpoints.get_cutoffs(sample) else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') sample_cutpoints = find_cutpoints_api(sample, max_distance, cutoffs) return sample_cutpoints
def get_cutoffs(sample)
-
Compute the cutoffs from a sample of LoR data.
This is a static method, meaning it can be called without instantiating the
Cutpoints
class. It computes the cutoffs from the minimum and maximum values of the LoRs insample
in each dimension.Parameters
sample
: (N
,7
)numpy.ndarray
- A sample of LoRs, where each row is
[time, x1, y1, z1, x2, y2, z2]
, such that every line is defined by the points[x1, y1, z1]
and[x2, y2, z2]
.
Returns
cutoffs
: (6
)numpy.ndarray
- The computed cutoffs for each dimension, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]
.
Raises
TypeError
- If
sample
is not a numpy array with shape (N, 7).
Source code
@staticmethod def get_cutoffs(sample): '''Compute the cutoffs from a sample of LoR data. This is a static method, meaning it can be called without instantiating the `Cutpoints` class. It computes the cutoffs from the minimum and maximum values of the LoRs in `sample` in each dimension. Parameters ---------- sample : (N, 7) numpy.ndarray A sample of LoRs, where each row is `[time, x1, y1, z1, x2, y2, z2]`, such that every line is defined by the points `[x1, y1, z1]` and `[x2, y2, z2]`. Returns ------- cutoffs : (6) numpy.ndarray The computed cutoffs for each dimension, formatted as `[x_min, x_max, y_min, y_max, z_min, z_max]`. Raises ------ TypeError If `sample` is not a numpy array with shape (N, 7). ''' # Check sample has shape (N, 7) if sample.ndim != 2 or sample.shape[1] != 7: raise TypeError('\n[ERROR]: sample should have dimensions (N, 7). Received {}\n'.format(sample.shape)) # Compute cutoffs for cutpoints as the (min, max) values of the lines # Minimum value of the two points that define a line min_x = min(sample[:, 1].min(), sample[:, 4].min()) # Maximum value of the two points that define a line max_x = max(sample[:, 1].max(), sample[:, 4].max()) # Minimum value of the two points that define a line min_y = min(sample[:, 2].min(), sample[:, 5].min()) # Maximum value of the two points that define a line max_y = max(sample[:, 2].max(), sample[:, 5].max()) # Minimum value of the two points that define a line min_z = min(sample[:, 3].min(), sample[:, 6].min()) # Maximum value of the two points that define a line max_z = max(sample[:, 3].max(), sample[:, 6].max()) cutoffs = np.array([min_x, max_x, min_y, max_y, min_z, max_z]) return cutoffs
Instance variables
var cutoffs
-
Only consider the cutpoints which fall within these cutoff distances.
A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max].
cutoffs : (6) list or equivalent
Source code
@property def cutoffs(self): '''Only consider the cutpoints which fall within these cutoff distances. A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. cutoffs : (6) list or equivalent ''' return self._cutoffs
var line_data
-
The LoRs for which the cutpoints are computed.
line_data : instance of pept.LineData
Source code
@property def line_data(self): '''The LoRs for which the cutpoints are computed. line_data : instance of pept.LineData ''' return self._line_data
var max_distance
-
The maximum distance between any pair of lines for which the cutpoint is considered.
max_distance : float
Source code
@property def max_distance(self): '''The maximum distance between any pair of lines for which the cutpoint is considered. max_distance : float ''' return self._max_distance
Methods
def find_cutpoints(self, line_data, max_distance, cutoffs=None, verbose=False)
-
Find the cutpoints of the samples in a
LineData
instance.Parameters
line_data
:instance
ofLineData
- The LoRs for which the cutpoints will be computed. It is required to be an
instance of
LineData
. max_distance
:float
- The maximum distance between any two lines so that their cutpoint will be considered.
cutoffs
:list
-like
oflength
6
, optional- A list (or equivalent) of the cutoff distances for every axis, formatted as
[x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which
fall within these cutoff distances. The default is None, in which case they
are automatically computed using
get_cutoffs
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
self
:the
PointData
instance
ofcutpoints
- The computed cutpoints are stored in the
Cutpoints
class, as a subclass ofPointData
.
Raises
Exception
- If
line_data
is not an instance ofLineData
. TypeError
- If
cutoffs
is not a one-dimensional array with values formatted as[min_x, max_x, min_y, max_y, min_z, max_z]
.
Source code
def find_cutpoints(self, line_data, max_distance, cutoffs = None, verbose = False): '''Find the cutpoints of the samples in a `LineData` instance. Parameters ---------- line_data : instance of pept.LineData The LoRs for which the cutpoints will be computed. It is required to be an instance of `pept.LineData`. max_distance : float The maximum distance between any two lines so that their cutpoint will be considered. cutoffs : list-like of length 6, optional A list (or equivalent) of the cutoff distances for every axis, formatted as [x_min, x_max, y_min, y_max, z_min, z_max]. Only consider the cutpoints which fall within these cutoff distances. The default is None, in which case they are automatically computed using `get_cutoffs`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- self : the PointData instance of cutpoints The computed cutpoints are stored in the `Cutpoints` class, as a subclass of `pept.PointData`. Raises ------ Exception If `line_data` is not an instance of `pept.LineData`. TypeError If `cutoffs` is not a one-dimensional array with values formatted as `[min_x, max_x, min_y, max_y, min_z, max_z]`. ''' if verbose: start = time.time() # Check line_data is an instance (or a subclass!) of pept.LineData if not isinstance(line_data, pept.LineData): raise Exception('[ERROR]: line_data should be an instance of pept.LineData') self._line_data = line_data self._max_distance = max_distance # If cutoffs were not supplied, compute them if cutoffs is None: cutoffs = self.get_cutoffs(line_data.line_data) # Otherwise make sure they are a C-contiguous numpy array else: cutoffs = np.asarray(cutoffs, order = 'C', dtype = float) if cutoffs.ndim != 1 or len(cutoffs) != 6: raise TypeError('\n[ERROR]: cutoffs should be a one-dimensional array with values [min_x, max_x, min_y, max_y, min_z, max_z]\n') self._cutoffs = cutoffs # Using joblib, collect the cutpoints from every sample in a list # of arrays. If verbose, show progress bar using tqdm. if verbose: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in tqdm(line_data)) else: cutpoints = Parallel(n_jobs = -1, prefer = 'threads')(delayed(self.find_cutpoints_sample)(sample, max_distance, cutoffs) for sample in line_data) # cutpoints shape: (n, m, 4), where n is the number of samples, and # m is the number of cutpoints in the sample cutpoints = np.array(cutpoints) number_of_samples = len(cutpoints) cutpoints = np.vstack(np.array(cutpoints)) number_of_cutpoints = len(cutpoints) # Average number of cutpoints per sample cutpoints_per_sample = int(number_of_cutpoints / number_of_samples) super().__init__(cutpoints, sample_size = cutpoints_per_sample, overlap = 0, verbose = False) if verbose: end = time.time() print("\n\n\nFinding the cutpoints took {} seconds".format(end - start)) return self
Inherited members
class HDBSCANClusterer (min_cluster_size=20, min_samples=None, allow_single_cluster=False)
-
HDBSCAN-based clustering for cutpoints from LoRs.
This class is a wrapper around the
hdbscan
package, providing tools for parallel clustering of samples of cutpoints. It can returnPointData
classes which can be easily manipulated or visualised.Parameters
min_cluster_size : int, optional (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int, optional (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool, optional (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy.
Attributes
min_cluster_size : int (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy.
Source code
class HDBSCANClusterer: '''HDBSCAN-based clustering for cutpoints from LoRs. This class is a wrapper around the `hdbscan` package, providing tools for parallel clustering of samples of cutpoints. It can return `PointData` classes which can be easily manipulated or visualised. Parameters ---------- min_cluster_size : int, optional (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int, optional (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool, optional (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy. Attributes ---------- min_cluster_size : int (Taken from hdbscan's documentation): The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters. The default is 20. min_samples : int (Taken from hdbscan's documentation): The number of samples in a neighbourhood for a point to be considered a core point. The default is None, being set automatically to the `min_cluster_size`. allow_single_cluster : bool (Taken from hdbscan's documentation): By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. For PEPT, set this to True if you only have one tracer in the dataset. Otherwise, leave it to False, as it will provide higher accuracy. ''' def __init__(self, min_cluster_size = 20, min_samples = None, allow_single_cluster = False): if 0 < min_cluster_size < 2: print("\n[WARNING]: min_cluster_size was set to 2, as it was {} < 2\n".format(min_cluster_size)) min_cluster_size = 2 self.clusterer = hdbscan.HDBSCAN(min_cluster_size = min_cluster_size, min_samples = min_samples, core_dist_n_jobs = -1, allow_single_cluster = allow_single_cluster) @property def min_cluster_size(self): return self.clusterer.min_cluster_size @min_cluster_size.setter def min_cluster_size(self, new_min_cluster_size): self.clusterer.min_cluster_size = new_min_cluster_size @property def min_samples(self): return self.clusterer.min_cluster_size @min_samples.setter def min_samples(self, new_min_samples): self.clusterer.min_samples = new_min_samples @property def allow_single_cluster(self): return self.clusterer.allow_single_cluster @allow_single_cluster.setter def allow_single_cluster(self, option): self.clusterer.allow_single_cluster = option def fit_sample(self, sample, store_labels = False, noise = False, as_array = True, verbose = False): '''Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- sample : (N, M >= 4) numpy.ndarray The sample of points that will be clustered. Every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. as_array : bool, optional If set to True, the centres of the clusters and the clustered cutpoints are returned as numpy arrays. If set to False, they are returned inside instances of `pept.PointData`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : numpy.ndarray or pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. If `as_array` is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. If `as_array` is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance. Raises ------ TypeError If `sample` is not a numpy array of shape (N, M), where M >= 4. ''' if verbose: start = time.time() # sample row: [time, x, y, z] if sample.ndim != 2 or sample.shape[1] < 4: raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape)) # Only cluster based on [x, y, z] labels = self.clusterer.fit_predict(sample[:, 1:4]) max_label = labels.max() centres = [] clustered_cutpoints = [] # the centre of a cluster is the average of the time, x, y, z columns # and the number of points of that cluster # centres row: [time, x, y, z, ..etc.., cluster_size] for i in range(0, max_label + 1): # Average time, x, y, z of cluster of label i centres_row = np.mean(sample[labels == i], axis = 0) # Append the number of points of label i => cluster_size centres_row = np.append(centres_row, (labels == i).sum()) centres.append(centres_row) centres = np.array(centres) if not as_array: centres = pept.PointData(centres, sample_size = 0, overlap = 0, verbose = False) # Return all cutpoints as a list of numpy arrays for every label # where the last column of an array is the label if store_labels: # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label] if noise: cutpoints = sample[labels == -1] cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1) clustered_cutpoints.append(cutpoints) for i in range(0, max_label + 1): cutpoints = sample[labels == i] cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1) clustered_cutpoints.append(cutpoints) clustered_cutpoints = np.vstack(np.array(clustered_cutpoints)) if not as_array: clustered_cutpoints = pept.PointData(clustered_cutpoints, sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting one sample took {} seconds".format(end - start)) return [centres, clustered_cutpoints] def fit_cutpoints(self, cutpoints, store_labels = False, noise = False, verbose = True): '''Fit cutpoints (an instance of `PointData`) and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- cutpoints : an instance of `pept.PointData` The samples of points that will be clustered. In every sample, every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Raises ------ Exception If `cutpoints` is not an instance (or a subclass) of `pept.PointData`. ''' if verbose: start = time.time() if not isinstance(cutpoints, pept.PointData): raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)') # Fit all samples in `cutpoints` in parallel using joblib # Collect all outputs as a list. If verbose, show progress bar with # tqdm if verbose: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in tqdm(cutpoints)) else: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in cutpoints) # Access joblib.Parallel output as list comprehensions centres = np.array([row[0] for row in data_list if len(row[0]) != 0]) if len(centres) != 0: centres = pept.PointData(np.vstack(centres), sample_size = 0, overlap = 0, verbose = False) if store_labels: clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0]) clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)), sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting cutpoints took {} seconds".format(end - start)) if store_labels: return [centres, clustered_cutpoints] else: return [centres, []]
Instance variables
var allow_single_cluster
-
Source code
@property def allow_single_cluster(self): return self.clusterer.allow_single_cluster
var min_cluster_size
-
Source code
@property def min_cluster_size(self): return self.clusterer.min_cluster_size
var min_samples
-
Source code
@property def min_samples(self): return self.clusterer.min_cluster_size
Methods
def fit_cutpoints(self, cutpoints, store_labels=False, noise=False, verbose=True)
-
Fit cutpoints (an instance of
PointData
) and return the cluster centres and (optionally) the labelled cutpoints.Parameters
cutpoints
:an
instance
of<a title="pept.PointData" href="../../index.html#pept.PointData">
PointData</a>
- The samples of points that will be clustered. In every sample, every point
corresponds to a row and is formatted as
[time, x, y, z, etc]
. Only columns[1, 2, 3]
are used for clustering. store_labels
:bool
, optional- If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise
:bool
, optional- If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if
store_labels
is set to True. The default is False. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
centres
:pept.PointData
- The centroids of every cluster found. They are computed as the average
of every column of
[time, x, y, z, etc]
of the clustered points. Another column is added to the initial data insample
, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints
:numpy.ndarray
orpept.PointData
- The points in
sample
that fall in every cluster. A new column is added to the points insample
that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated.
Raises
Exception
- If
cutpoints
is not an instance (or a subclass) ofpept.PointData
.
Source code
def fit_cutpoints(self, cutpoints, store_labels = False, noise = False, verbose = True): '''Fit cutpoints (an instance of `PointData`) and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- cutpoints : an instance of `pept.PointData` The samples of points that will be clustered. In every sample, every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Raises ------ Exception If `cutpoints` is not an instance (or a subclass) of `pept.PointData`. ''' if verbose: start = time.time() if not isinstance(cutpoints, pept.PointData): raise Exception('[ERROR]: cutpoints should be an instance of pept.PointData (or any class inheriting from it)') # Fit all samples in `cutpoints` in parallel using joblib # Collect all outputs as a list. If verbose, show progress bar with # tqdm if verbose: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in tqdm(cutpoints)) else: data_list = Parallel(n_jobs = -1)(delayed(self.fit_sample)(sample, store_labels = store_labels, noise = noise, as_array = True) for sample in cutpoints) # Access joblib.Parallel output as list comprehensions centres = np.array([row[0] for row in data_list if len(row[0]) != 0]) if len(centres) != 0: centres = pept.PointData(np.vstack(centres), sample_size = 0, overlap = 0, verbose = False) if store_labels: clustered_cutpoints = np.array([row[1] for row in data_list if len(row[1]) != 0]) clustered_cutpoints = pept.PointData(np.vstack(np.array(clustered_cutpoints)), sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting cutpoints took {} seconds".format(end - start)) if store_labels: return [centres, clustered_cutpoints] else: return [centres, []]
def fit_sample(self, sample, store_labels=False, noise=False, as_array=True, verbose=False)
-
Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints.
Parameters
sample
: (N
,M
>=4
)numpy.ndarray
- The sample of points that will be clustered. Every point corresponds to
a row and is formatted as
[time, x, y, z, etc]
. Only columns[1, 2, 3]
are used for clustering. store_labels
:bool
, optional- If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False.
noise
:bool
, optional- If set to True, the clustered cutpoints also include the points classified
as noise. Only has an effect if
store_labels
is set to True. The default is False. as_array
:bool
, optional- If set to True, the centres of the clusters and the clustered cutpoints are
returned as numpy arrays. If set to False, they are returned inside
instances of
PointData
. verbose
:bool
, optional- Provide extra information when computing the cutpoints: time the operation
and show a progress bar. The default is
False
.
Returns
centres
:numpy.ndarray
orPointData
- The centroids of every cluster found. They are computed as the average
of every column of
[time, x, y, z, etc]
of the clustered points. Another column is added to the initial data insample
, signifying the cluster size - the number of points included in the cluster. Ifas_array
is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints
:numpy.ndarray
orPointData
- The points in
sample
that fall in every cluster. A new column is added to the points insample
that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. Ifas_array
is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance.
Raises
TypeError
- If
sample
is not a numpy array of shape (N, M), where M >= 4.
Source code
def fit_sample(self, sample, store_labels = False, noise = False, as_array = True, verbose = False): '''Fit one sample of cutpoints and return the cluster centres and (optionally) the labelled cutpoints. Parameters ---------- sample : (N, M >= 4) numpy.ndarray The sample of points that will be clustered. Every point corresponds to a row and is formatted as `[time, x, y, z, etc]`. Only columns `[1, 2, 3]` are used for clustering. store_labels : bool, optional If set to True, the clustered cutpoints are returned along with the centres of the clusters. Setting it to False speeds up the clustering. The default is False. noise : bool, optional If set to True, the clustered cutpoints also include the points classified as noise. Only has an effect if `store_labels` is set to True. The default is False. as_array : bool, optional If set to True, the centres of the clusters and the clustered cutpoints are returned as numpy arrays. If set to False, they are returned inside instances of `pept.PointData`. verbose : bool, optional Provide extra information when computing the cutpoints: time the operation and show a progress bar. The default is `False`. Returns ------- centres : numpy.ndarray or pept.PointData The centroids of every cluster found. They are computed as the average of every column of `[time, x, y, z, etc]` of the clustered points. Another column is added to the initial data in `sample`, signifying the cluster size - the number of points included in the cluster. If `as_array` is set to True, it is a numpy array, otherwise the centres are stored in a pept.PointData instance. clustered_cutpoints : numpy.ndarray or pept.PointData The points in `sample` that fall in every cluster. A new column is added to the points in `sample` that signifies the label of cluster that the point was associated with: all points in cluster number 3 will have the number 3 as the last element in their row. The points classified as noise have the number -1 associated. If `as_array` is set to True, it is a numpy array, otherwise the clustered cutpoints are stored in a pept.PointData instance. Raises ------ TypeError If `sample` is not a numpy array of shape (N, M), where M >= 4. ''' if verbose: start = time.time() # sample row: [time, x, y, z] if sample.ndim != 2 or sample.shape[1] < 4: raise TypeError('\n[ERROR]: sample should have two dimensions (M, N), where N >= 4. Received {}\n'.format(sample.shape)) # Only cluster based on [x, y, z] labels = self.clusterer.fit_predict(sample[:, 1:4]) max_label = labels.max() centres = [] clustered_cutpoints = [] # the centre of a cluster is the average of the time, x, y, z columns # and the number of points of that cluster # centres row: [time, x, y, z, ..etc.., cluster_size] for i in range(0, max_label + 1): # Average time, x, y, z of cluster of label i centres_row = np.mean(sample[labels == i], axis = 0) # Append the number of points of label i => cluster_size centres_row = np.append(centres_row, (labels == i).sum()) centres.append(centres_row) centres = np.array(centres) if not as_array: centres = pept.PointData(centres, sample_size = 0, overlap = 0, verbose = False) # Return all cutpoints as a list of numpy arrays for every label # where the last column of an array is the label if store_labels: # Create a list of numpy arrays with rows: [t, x, y, z, ..etc.., label] if noise: cutpoints = sample[labels == -1] cutpoints = np.insert(cutpoints, cutpoints.shape[1], -1, axis = 1) clustered_cutpoints.append(cutpoints) for i in range(0, max_label + 1): cutpoints = sample[labels == i] cutpoints = np.insert(cutpoints, cutpoints.shape[1], i, axis = 1) clustered_cutpoints.append(cutpoints) clustered_cutpoints = np.vstack(np.array(clustered_cutpoints)) if not as_array: clustered_cutpoints = pept.PointData(clustered_cutpoints, sample_size = 0, overlap = 0, verbose = False) if verbose: end = time.time() print("Fitting one sample took {} seconds".format(end - start)) return [centres, clustered_cutpoints]
class TrajectorySeparation (centres, pointsToCheck=25, maxDistance=20, maxClusterDiff=500)
-
Source code
class TrajectorySeparation: def __init__(self, centres, pointsToCheck = 25, maxDistance = 20, maxClusterDiff = 500): # centres row: [time, x, y, z, clusterSize] # Make sure the trajectory is memory-contiguous for efficient # KDTree partitioning self.centres = np.ascontiguousarray(centres) self.pointsToCheck = pointsToCheck self.maxDistance = maxDistance self.maxClusterDiff = maxClusterDiff # For every point in centres, save a set of the trajectory # indices of the trajectories that they are part of # eg. centres[2] is part of trajectories 0 and 1 => # trajectoryIndices[2] = {0, 1} # Initialise a vector of empty sets of size len(centres) self.trajectoryIndices = np.array([ set() for i in range(len(self.centres)) ]) # For every trajectory found, save a list of the indices of # the centres that are part of that trajectory # eg. trajectory 1 is comprised of centres 3, 5 and 8 => # centresIndices[1] = [3, 5, 8] self.centresIndices = [[]] # Maximum trajectory index self.maxIndex = 0 def findTrajectories(self): for i, currentPoint in enumerate(self.centres): if i == 0: # Add the first point to trajectory 0 self.trajectoryIndices[0].add(self.maxIndex) self.centresIndices[self.maxIndex].append(0) self.maxIndex += 1 continue # Search for the closest previous pointsToCheck points # within a given maxDistance startIndex = i - self.pointsToCheck endIndex = i if startIndex < 0: startIndex = 0 # Construct a KDTree from the x, y, z (1:4) of the # selected points. Get the indices for all the points within # maxDistance of the currentPoint tree = cKDTree(self.centres[startIndex:endIndex, 1:4]) closestIndices = tree.query_ball_point(currentPoint[1:4], self.maxDistance, n_jobs=-1) closestIndices = np.array(closestIndices) + startIndex # If no point was found, it is a new trajectory. Continue if len(closestIndices) == 0: self.trajectoryIndices[i].add(self.maxIndex) self.centresIndices.append([i]) self.maxIndex += 1 continue # For every close point found, search for all the trajectory indices # - If all trajectory indices sets are equal and of a single value # then currentPoint is part of the same trajectory # - If all trajectory indices sets are equal, but of more values, # then currentPoint diverged from an intersection of trajectories # and is part of a single trajectory => separate it # # - If every pair of trajectory indices sets is not disjoint, then # currentPoint is only one of them # - If there exists a pair of trajectory indices sets that is # disjoint, then currentPoint is part of all of them # Select the trajectories of all the points that were found # to be the closest closestTrajectories = self.trajectoryIndices[closestIndices] #print("closestTrajectories:") #print(closestTrajectories) # If all the closest points are part of the same trajectory # (just one!), then the currentPoint is part of it too if (np.all(closestTrajectories == closestTrajectories[0]) and len(closestTrajectories[0]) == 1): self.trajectoryIndices[i] = closestTrajectories[0] self.centresIndices[ next(iter(closestTrajectories[0])) ].append(i) continue # Otherwise, check the points based on their cluster size else: # Create a list of all the trajectories that were found to # intersect #print('\nIntersection:') closestTrajIndices = list( set().union(*closestTrajectories) ) #print("ClosestTrajIndices:") #print(closestTrajIndices) # For each close trajectory, calculate the mean cluster size # of the last lastPoints points lastPoints = 50 # Keep track of the mean cluster size that is the closest to # the currentPoint's clusterSize currentClusterSize = currentPoint[4] #print("currentClusterSize = {}".format(currentClusterSize)) closestTrajIndex = -1 clusterSizeDiff = self.maxClusterDiff for trajIndex in closestTrajIndices: #print("trajIndex = {}".format(trajIndex)) trajCentres = self.centres[ self.centresIndices[trajIndex] ] #print("trajCentres:") #print(trajCentres) meanClusterSize = trajCentres[-lastPoints:][:, 4].mean() #print("meanClusterSize = {}".format(meanClusterSize)) #print("clusterSizeDiff = {}".format(clusterSizeDiff)) #print("abs diff = {}".format(np.abs( currentClusterSize - meanClusterSize ))) if np.abs( currentClusterSize - meanClusterSize ) < clusterSizeDiff: closestTrajIndex = trajIndex clusterSizeDiff = np.abs( currentClusterSize - meanClusterSize ) if closestTrajIndex == -1: #self.trajectoryIndices[i] = set(closestTrajIndices) #for trajIndex in closestTrajIndices: # self.centresIndices[trajIndex].append(i) print("\n**** -1 ****\n") break else: #print("ClosestTrajIndex found = {}".format(closestTrajIndex)) self.trajectoryIndices[i] = set([closestTrajIndex]) self.centresIndices[closestTrajIndex].append(i) ''' # If the current point is not part of any trajectory, assign it # the maxIndex and increment it if len(self.trajectoryIndices[i]) == 0: self.trajectoryIndices[i].append(self.maxIndex) self.maxIndex += 1 print(self.trajectoryIndices[i]) print(self.maxIndex) # Construct a KDTree from the numberOfPoints in front of # the current point tree = cKDTree(self.trajectory[(i + 1):(i + self.numberOfPoints + 2)][1:4]) # For every trajectory that the current point is part of, # find the closest points in front of it numberOfIntersections = len(self.trajectoryIndices[i]) dist, nextPointsIndices = tree.query(currentPoint, k=numberOfIntersections, distance_upper_bound=self.maxDistance, n_jobs=-1) print(nextPointsIndices) # If the current point is part of more trajectories, # an intersection happened. Call subroutine to part # the trajectories if numberOfIntersections > 1: for j in range(0, len(self.trajectoryIndices[i])): trajIndex = self.trajectoryIndices[i][j] self.trajectoryIndices[i + 1 + nextPointsIndices[j]].append(trajIndex) else: self.trajectoryIndices[i + 1 + nextPointsIndices].append(self.trajectoryIndices[i][0]) print(self.trajectoryIndices) ''' def getTrajectories(self): self.individualTrajectories = [] for trajCentres in self.centresIndices: self.individualTrajectories.append(self.centres[trajCentres]) self.individualTrajectories = np.array(self.individualTrajectories) return self.individualTrajectories ''' self.individualTrajectories = [ [] for i in range(0, self.maxIndex + 1) ] for i in range(0, len(self.trajectoryIndices)): for trajIndex in self.trajectoryIndices[i]: self.individualTrajectories[trajIndex].append(self.centres[i]) self.individualTrajectories = np.array(self.individualTrajectories) for i in range(len(self.individualTrajectories)): if len(self.individualTrajectories[i]) > 0: self.individualTrajectories[i] = np.vstack(self.individualTrajectories[i]) return self.individualTrajectories ''' def plotTrajectoriesAltAxes(self, ax): trajectories = self.getTrajectories() for traj in trajectories: if len(traj) > 0: ax.scatter(traj[:, 3], traj[:, 1], traj[:, 2], marker='D', s=10)
Methods
def findTrajectories(self)
-
Source code
def findTrajectories(self): for i, currentPoint in enumerate(self.centres): if i == 0: # Add the first point to trajectory 0 self.trajectoryIndices[0].add(self.maxIndex) self.centresIndices[self.maxIndex].append(0) self.maxIndex += 1 continue # Search for the closest previous pointsToCheck points # within a given maxDistance startIndex = i - self.pointsToCheck endIndex = i if startIndex < 0: startIndex = 0 # Construct a KDTree from the x, y, z (1:4) of the # selected points. Get the indices for all the points within # maxDistance of the currentPoint tree = cKDTree(self.centres[startIndex:endIndex, 1:4]) closestIndices = tree.query_ball_point(currentPoint[1:4], self.maxDistance, n_jobs=-1) closestIndices = np.array(closestIndices) + startIndex # If no point was found, it is a new trajectory. Continue if len(closestIndices) == 0: self.trajectoryIndices[i].add(self.maxIndex) self.centresIndices.append([i]) self.maxIndex += 1 continue # For every close point found, search for all the trajectory indices # - If all trajectory indices sets are equal and of a single value # then currentPoint is part of the same trajectory # - If all trajectory indices sets are equal, but of more values, # then currentPoint diverged from an intersection of trajectories # and is part of a single trajectory => separate it # # - If every pair of trajectory indices sets is not disjoint, then # currentPoint is only one of them # - If there exists a pair of trajectory indices sets that is # disjoint, then currentPoint is part of all of them # Select the trajectories of all the points that were found # to be the closest closestTrajectories = self.trajectoryIndices[closestIndices] #print("closestTrajectories:") #print(closestTrajectories) # If all the closest points are part of the same trajectory # (just one!), then the currentPoint is part of it too if (np.all(closestTrajectories == closestTrajectories[0]) and len(closestTrajectories[0]) == 1): self.trajectoryIndices[i] = closestTrajectories[0] self.centresIndices[ next(iter(closestTrajectories[0])) ].append(i) continue # Otherwise, check the points based on their cluster size else: # Create a list of all the trajectories that were found to # intersect #print('\nIntersection:') closestTrajIndices = list( set().union(*closestTrajectories) ) #print("ClosestTrajIndices:") #print(closestTrajIndices) # For each close trajectory, calculate the mean cluster size # of the last lastPoints points lastPoints = 50 # Keep track of the mean cluster size that is the closest to # the currentPoint's clusterSize currentClusterSize = currentPoint[4] #print("currentClusterSize = {}".format(currentClusterSize)) closestTrajIndex = -1 clusterSizeDiff = self.maxClusterDiff for trajIndex in closestTrajIndices: #print("trajIndex = {}".format(trajIndex)) trajCentres = self.centres[ self.centresIndices[trajIndex] ] #print("trajCentres:") #print(trajCentres) meanClusterSize = trajCentres[-lastPoints:][:, 4].mean() #print("meanClusterSize = {}".format(meanClusterSize)) #print("clusterSizeDiff = {}".format(clusterSizeDiff)) #print("abs diff = {}".format(np.abs( currentClusterSize - meanClusterSize ))) if np.abs( currentClusterSize - meanClusterSize ) < clusterSizeDiff: closestTrajIndex = trajIndex clusterSizeDiff = np.abs( currentClusterSize - meanClusterSize ) if closestTrajIndex == -1: #self.trajectoryIndices[i] = set(closestTrajIndices) #for trajIndex in closestTrajIndices: # self.centresIndices[trajIndex].append(i) print("\n**** -1 ****\n") break else: #print("ClosestTrajIndex found = {}".format(closestTrajIndex)) self.trajectoryIndices[i] = set([closestTrajIndex]) self.centresIndices[closestTrajIndex].append(i) ''' # If the current point is not part of any trajectory, assign it # the maxIndex and increment it if len(self.trajectoryIndices[i]) == 0: self.trajectoryIndices[i].append(self.maxIndex) self.maxIndex += 1 print(self.trajectoryIndices[i]) print(self.maxIndex) # Construct a KDTree from the numberOfPoints in front of # the current point tree = cKDTree(self.trajectory[(i + 1):(i + self.numberOfPoints + 2)][1:4]) # For every trajectory that the current point is part of, # find the closest points in front of it numberOfIntersections = len(self.trajectoryIndices[i]) dist, nextPointsIndices = tree.query(currentPoint, k=numberOfIntersections, distance_upper_bound=self.maxDistance, n_jobs=-1) print(nextPointsIndices) # If the current point is part of more trajectories, # an intersection happened. Call subroutine to part # the trajectories if numberOfIntersections > 1: for j in range(0, len(self.trajectoryIndices[i])): trajIndex = self.trajectoryIndices[i][j] self.trajectoryIndices[i + 1 + nextPointsIndices[j]].append(trajIndex) else: self.trajectoryIndices[i + 1 + nextPointsIndices].append(self.trajectoryIndices[i][0]) print(self.trajectoryIndices) '''
def getTrajectories(self)
-
Source code
def getTrajectories(self): self.individualTrajectories = [] for trajCentres in self.centresIndices: self.individualTrajectories.append(self.centres[trajCentres]) self.individualTrajectories = np.array(self.individualTrajectories) return self.individualTrajectories ''' self.individualTrajectories = [ [] for i in range(0, self.maxIndex + 1) ] for i in range(0, len(self.trajectoryIndices)): for trajIndex in self.trajectoryIndices[i]: self.individualTrajectories[trajIndex].append(self.centres[i]) self.individualTrajectories = np.array(self.individualTrajectories) for i in range(len(self.individualTrajectories)): if len(self.individualTrajectories[i]) > 0: self.individualTrajectories[i] = np.vstack(self.individualTrajectories[i]) return self.individualTrajectories '''
def plotTrajectoriesAltAxes(self, ax)
-
Source code
def plotTrajectoriesAltAxes(self, ax): trajectories = self.getTrajectories() for traj in trajectories: if len(traj) > 0: ax.scatter(traj[:, 3], traj[:, 1], traj[:, 2], marker='D', s=10)