Getting Started with Context¶
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.structure import Context
Quick start ¶
Context is used to encapsulate all available multimodal information towards better Anomaly Detection solutions.
Essentially Context represents the available data existing in a time window (context_horizon), mapped from multimodal representation to multivariate time-series (CD), and their relationships (CR), where the relationships are extracted using causal discovery between the time-series of CD.
.
Context is implemented as an online process, where data are collected as they arrive. The arrival of a new value from a target series triggers the creation of a new context which includes: A) The mapping of availiable multimodal sources to multivariate time-series (CR) and B) the application of a causality discovery to extract cause-effect relationships between the available data sources in form of graph (CR).
pip install PdMContext
Intantiating a context generator:¶
For Intantiation we need to pass:
- target: string representing the name of target source.
- context_horizon: the length of time window of available used to build context
- Causality_function: a function that implements a causality discovery method (look at Causality Discovery )
- mapping_functions: a dictionary of the available types of monitored data sources and their corresponding mapper (look at Mapping Function ).
con_gen=ContextGenerator(target="anomaly scores", context_horizon="8 hours",Causality_function,mapping_functions)
To feed a new sample from a data source to context we need (e.g. the temperature measured from Machine_A):
- timestamp: timestamp representing the time when the sample was collected (using pandas timestamps).
- source: A string which annotate the name of the source the sample derives.
- name: A string which represent the name of the time series related to the sample.
- type: This annotates the type (or the nature) of the sample (the type has to be one of the keys from mapping_functions dictionary)
con_gen.collect_data(timestamp=t1, source="Machine_A", name="config", type="univariate")
Passing a sample from target source trigers the creation of new context.
context=con_gen.collect_data(timestamp=t1, source="Machine_A", name="anomaly scores", type="univariate")
PdmContext.utils.structure.Context is used to represent such a context.
Mapping Functions ¶
Data sources can be any time series (producing information in time) of events, continuous values, images etc.
To formulate context from the available data sources we have to define mapping functions:
mapping_functions: A dictionary where the keys are the name of the different types of data sources and the value are python mapping classes, that handle to transform a time window of values observed from particular source (e.g. the events observed from a particular data source) and handles to transform them in a time series of the same length as a target one (the target is usually the anomaly scores).
For example, the default mapping_function used by the library is:
mapping_functions = {
"Univariate":map_univariate_to_continuous(), # represent continuous sources
"isolated": map_isolated_to_continuous(), # represent sources of events with instant impact
"configuration": map_configuration_to_continuous(), # represent sources of events with constant impact
"categorical":map_categorical_to_continuous() # represent sources of catigorical values
}
The values are classes from PdmContext.utils.mapping_functions and have the following form:
class map_name:
def __init__(self):
# possible code
pass
def map(self,target_series, occurrences,name):
"""
target_series: the time series which will be used as guide to build a continuous representation for the particular type.
occurrences: time series of observe value to be mapped
name: the name of the time series to be mapped
Both target series and occurrences has values in form of (sample, timestamp) where the sample contain the value (any data type).
return the resulting vectors (usually just one vector) and names from mapping occurrences to time series format.
"""
# possible methods
The way they transom a different sources in time series data can be seen in the following example:
- target : Detector score
- isolated: isolated event
- configuration: speed change
- Univariate: temperature, some_KPI
.
More details regarding the default mapping_functions:¶
Continuous (analog, real, Univariate series ...):¶
The map_univariate_to_continuous class handles the difference in sample rate of observed time-series which is mapped to the sample rate of a target-time-series called target series (also referred to in the code and documentation as such):
- For series with a sample rate higher than that of the target, the samples between two timestamps of target series, are aggregated (mean)
- For series with lower sample rates, repetition of their values is used.
Discrete Events:¶
The default mappers support also data which are not numeric, but related to some kind of event (events that occur in time). These are often referred to as discrete data. To this end, the Context supports two types of such events:
- isolated: Events that have an instant impact when they occur.
- configuration: Events that refer to a configuration change that has an impact after its occurrence.
- categorical: Categorical events that refer to some kind of state (i.e. day of the week: Monday,...). Here each category is treated as configuration one, and a state variable is created which signal is one after the last change of category.
The map_isolated_to_continuous, map_configuration_to_continuous, map_categorical_to_continuous classes are used of to transform the occurrence of events into continuous space time-series and add them to CD.
Generating some dummy data to test ContextGeneration ¶
import pandas as pd
from random import random
# Create artificially timestamps
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(20)]
# Create a real value time series
data1 = [random() for i in range(20)]
# create a variable of anomaly scores
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1.4, 0.6, 0.7,2,2.3,2]
# Create a configuration Event series
# (for example in the below series there was a configuration event in 9th timestamo)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,0,0,0]
# Create isolated Event time series (with occurencies in 1st and 13th timestamps)
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
categorical=["one", "one", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two", "two","three", "three", "three"]
Creating a Context Generation object ¶
Provide the name of the target series, the time window length using the context_horizon parameter, and the Causality function to calculate CR (we leave this for later).
from PdmContext.utils.causal_discovery_functions import calculate_with_pc
con_gen = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculate_with_pc, debug=False)
Iteratevly we pass the data to the Context Generator by calling collec_data() method. Each time we pass a single data sample or event from a single source. This method will return a Context object when we pass data with the name of the specified target name
source = "press"
context_list=[]
for d1, an1, t1, sp1, con1,categ in zip(data1, anomaly1, timestamps, spikes, configur,categorical):
con_gen.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="config", type="configuration")
con_gen.collect_data(timestamp=t1, source=source, name="catg", value=categ,type="categorical")
contextTemp = con_gen.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
context_list.append(contextTemp)
We can plot the last context object that was returned (this object is referred to as the last timestamp) We see in the first plot the CD part of Context and in the second plot the CR (in the form of a graph).
contextTemp.plot()
Moreover, we can plot all the context of ContextGeneration (at a more abstract level) using the plot method.
con_gen.plot(context_list,[["","anomaly1",""]])
The values on this plot refer to the target value used to build each context sample, and the colors are refered to the relationships that exist in the CR of each context object). In that example, we can see that the anomaly1 series has an increase due to the config event (as seen from the CR)
Causality Discovery ¶
The user can implement and provide to the Context Generator its own causality discovery method:
To do this simply needs to implement a Python function, that takes as a parameter:
- A list with names of time series data
- The time series data in the form of a 2D array
Example: PdmContext.utils.causal_discovery_functions.calculatewithPc
import networkx as nx
# pip install castle
from castle.algorithms import PC
def calculatewithPc(names, data):
try:
pc = PC(variant='parallel')
pc.learn(data)
except Exception as e:
print(e)
return None
learned_graph = nx.DiGraph(pc.causal_matrix)
# Relabel the nodes
MAPPING = {k: n for k, n in zip(range(len(names)), names)}
learned_graph = nx.relabel_nodes(learned_graph, MAPPING, copy=True)
edges=learned_graph.edges
return edges
Database Connections ¶
The current implementation supports connection with two databases (SQLlite3 and Influxdb) using the PdmContext.utils.dbconnector.SQLiteHandler, and PdmContext.utils.dbconnector.InfluxDBHandler
Using SQLite will create a database in the location of the main file.
Using Influxdb need to start the Influxdb services before starting: (For example in Linux: sudo service influxdb start)
Both databases Connections can be used with the implemented pipeline: PdmContext.Pipelines.ContextAndDatabase
Let's generate the same example as before but this time store it to the Database
from PdmContext.utils.dbconnector import SQLiteHandler
from PdmContext.Pipelines import ContextAndDatabase
con_gen = ContextGenerator(target="anomaly1", context_horizon="8", Causalityfunct=calculatewithPc, debug=False)
database = SQLiteHandler(db_name="ContextDatabase.db")
contextpipeline = ContextAndDatabase(context_generator_object=con_gen, databaseStore_object=database)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
data1 = [random() for i in range(len(configur))]
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(len(data1))]
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
source = "press"
context_list=[]
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
context_list.append(contextTemp)
contextpipeline.Contexter.plot(context_list)
Now we can acces the Context objexcts stored in the Database:
database = SQLiteHandler(db_name="ContextDatabase.db")
traget_name = "anomaly1"
contextlist = database.get_all_context_by_target(traget_name)
print(len(contextlist))
17
We could also plot the contexts using the helper function from PdmContext.utils.showcontext.show_Context_list
from PdmContext.utils.showcontext import show_context_list
show_context_list(contextlist, target_text=traget_name)
We can use filter to exclude some relationships from the plot (this is quite useful when many data series are involved) Although there is no practical use in our example, we will now exclude config->anomaly1 relationships and keep only anomaly1->config, which are the same samples (this is done here only for presentation purposes)
show_context_list(contextlist, target_text=traget_name, filteredges=[["config", "anomaly1", ""]])
Distance Function ¶
To compare two Context objects we need a similarity (or distance measure).
The user can implement its own distance function, which accepts only two parameters (two Context objects)
Below there is an example which uses the distance_cc (which is the SBD distance of between the CD and Jaccard similarity between the CR of the two contexts weighted by a factor a and b=1-a)
In this example we builr our own distance function my_distance(c1: Context,c2 Context) by using specific a and b
from PdmContext.utils.distances import distance_cc
def my_distance(c1:Context,c2:Context):
return distance_cc(c1,c2,a=0)
# JACCARD similarity between the CR components
print(my_distance(contextlist[0],contextlist[-1]))
print(my_distance(contextlist[-2],contextlist[-1]))
0 1.0
Clustering Context ¶
Using the PdmContext.ContextClustering.DBscanContextStreamClustering we can cluster over Context objects.
Clustering over context object has two main limitations:
- streaming interface (when we want to cluster the context objects as they arrive from the Context Generator
- appropriate distance measurement
Regarding 1) an example of a simplified DBscan algorithm for streaming data has been developed in PdmContext.ContextClustering.DBscanContextStreamClustering (we can iteratively feed cluster method using add_sample_to_cluster method)
While for 2) a sample of distance functions exist in PdmContext.utils.distances, and the user can define its own as shown previously
Creating a PdmContext.ContextClustering.DBscanContextStreamClustering object
from PdmContext.ContextClustering import DBscanContextStream
# use the distance function from before
clustering=DBscanContextStream(cluster_similarity_limit=0.7,distancefunc=my_distance)
for context_object in contextlist:
clustering.add_sample_to_cluster(context_object)
print(clustering.clusters_sets)
clustering.plot()
[[0, 1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16]]
The use of the Cluster and ContextGenerator can be also implemented using the Pipeline ContextAndClustering (PdmContext.Pipelines.ContextAndClustering)
from PdmContext.Pipelines import ContextAndClustering
con_gen_2 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc, debug=False)
clustering_2=DBscanContextStream(cluster_similarity_limit=0.7,min_points=2,distancefunc=my_distance)
contextpipeline2 = ContextAndClustering(context_generator_object=con_gen_2,Clustring_object=clustering_2)
source = "press"
context_list=[]
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline2.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline2.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
context_list.append(contextTemp)
contextpipeline2.clustering.plot()
contextpipeline2.Contexter.plot(context_list)
Pipelines ¶
There are three pipeline that wrap the Database Connector, Context Generator, and Clustering (all using the same API of collect_data as Context Generator)
- PdmContext.Pipelines.ContextAndClustering (wrap Context generator and feed its result to clustering)
- PdmContext.Pipelines import ContextAndDatabase (wrap Context generator and feed its result to the database)
- PdmContext.Pipelines import ContextAndClusteringAndDatabase (wrap Context generator and feed its result to database and clustering)
Simulators ¶
Because Context Generator works in a streaming fashion there are implemented simulators that can be used as helpers for the user (PdmContext.utils.simulate_stream)
Simulator stream ¶
Example (PdmContext.utils.simulate_stream.simulate_stream):
This simulator needs to pass three (optionally) lists and the target name.
- Time series data: which contain tuples of with shape ( name: str, timestamps:list, values:list )
- Event series data: which contain tuples of : (name: str, occurrences :list of Dates, type:str)
- Categorical event series data: which contain tuples of : (name: str, occurrences :list of Dates, categories: list, type:str)
- target name
from PdmContext.utils.simulate_stream import simulate_stream
from PdmContext.ContextGeneration import ContextGenerator
import pandas as pd
from random import random
start2 = pd.to_datetime("2023-01-01 00:00:00")
timestamps2 = [start2 + pd.Timedelta(minutes=i) for i in range(17)]
eventconf2=("config",[pd.to_datetime("2023-01-01 00:09:00")],"configuration")
spiketuples2=("spikes",[pd.to_datetime("2023-01-01 00:01:00"),pd.to_datetime("2023-01-01 00:13:00")],"isolated")
categories=("Categ",timestamps2, ["one", "one","one","one","one","two","two","two","two","two","two","three", "three", "three", "two", "two","two"],"categorical")
anomaly1tuples2=("anomaly1", [0.2, 0.3, 0.2, 0.1, 0.1, 0.8, 0.8, 0.8, 1.8, 1.7, 1.7, 1.8, 1.7, 1.8, 2, 1.6, 1.7],timestamps2)
stream=simulate_stream([anomaly1tuples2],[eventconf2,spiketuples2],[categories],"anomaly1")
contextpipeline3 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source="press"
context_list=[]
for record in stream:
#print(record)
context_res=contextpipeline3.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
if context_res is not None:
context_list.append(context_res)
contextpipeline3.plot(context_list,[["", "anomaly1", ""]])
contextpipeline3.contexts[-1].plot()
Simulator using pandas DataFrame ¶
In the case of an existing dataframe with all data.
Example (PdmContext.utils.simulate_stream.simulate_from_df):
This simulator needs to pass two list
- A dataframe
- Which columns represent events and of what type example [("column1","isolated"),("column3","configuration")]
- Target name (existing in dataframes column)
from PdmContext.utils.simulate_stream import simulate_from_df
df = pd.read_csv("dummy_data.csv",index_col=0)
df.index=pd.to_datetime(df.index)
print(df.head())
traget_name="anomaly1"
stream = simulate_from_df(df,[("configur","configuration"),("spikes","isolated")], traget_name)
contextpipeline4 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source = "press"
context_list=[]
for record in stream:
context_res=contextpipeline4.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
if context_res is not None:
context_list.append(context_res)
contextpipeline4.plot(context_list)
Interpretation ¶
Based on the edges in the Contexts CR (generated by Causal Discovery), we can try to interpret the target series behavior.
For example, having the below case of two configuration events and one isolated along with an anomaly score.
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.causal_discovery_functions import calculate_with_pc
from PdmContext.utils.simulate_stream import simulate_from_df
from random import random
import matplotlib.pyplot as plt
import pandas as pd
size=100
isoEv1=[0 for i in range(size)]
confevent1=[0 for i in range(size)]
confevent2=[0 for i in range(size)]
noise=random()/10
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(hours=i) for i in range(size)]
confevent1[31]=1
confevent2[33]=1
isoEv1[69]=1
score=[1+random()/10 for i in range(30)]+ [1+(i/5)+random()/10 for i in range(5)] +[2+random()/10 for i in range(65)]
score[70]+=1
contextgenerator=ContextGenerator(target="score",context_horizon="100",Causalityfunct=calculate_with_pc)
dfdata={
"score":score,
"confEv1":confevent1,
"confEv2":confevent2,
"isoEv1":isoEv1,
}
df=pd.DataFrame(dfdata,index=timestamps)
df.plot()
plt.show()
stream = simulate_from_df(df,eventTypes=[("isoEv1","isolated"),("confEv1","configuration"),("confEv2","configuration")],target_name="score")
source="press"
listcontexts=[]
for record in stream:
context_res=contextgenerator.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
if context_res is not None:
listcontexts.append(context_res)
contextgenerator.plot(listcontexts)
In the two plots we can observe the raw data (upper), and the interpretation (lower). Regarding the interpertation plot in the lower part, due to space limitation we observe only a part of the interpertations.
Looking closer to the time where the increse start and the when a spike occurs, we can take a better understanding of the interpertation.
Below we will plot the CD and CR part of the context. For the CD part the data series will be showed and for CR part, the graph structure of the causality discovery (left) along with interpertation (right) will be depicted.
listcontexts[35].plot()
1) confEv1@press : 2023-01-02 07:00:00 2) confEv2@press : 2023-01-02 09:00:00
We observe two interpertations, confEv2 and confEv1 where the confEv1 starting cause the target series (score), before confEv2 (based on the timestamp).
Using this timestamp we can conclude better on what may cause increase in the score.
Similarly in the next example, where the spike occured. Clearly the spike is due to the isoEv1 occurance, but the general rise in the score is still caused from confEv1 and confEv2. So the interpertation again will contain all three with a timestamps, indicating when this cause started in time.
listcontexts[70].plot()
1) confEv1@press : 2023-01-02 07:00:00 2) confEv2@press : 2023-01-02 09:00:00 3) isoEv1@press : 2023-01-03 22:00:00