It is important to note that SPOT works only on stationnary data. It means that the data distribution must not change along time.

In many real-worl usecases (finance, cyber, weather...) we do have drifting data. In these cases, SPOT can be used at the backend of a model catching that drift.

In this guide we show how we can use SPOT on the household power consumption dataset.

import pandas as pd

P = pd.read_csv("./data/household_power_consumption.txt", sep=";", na_values="?")
P.index = pd.to_datetime(P["Date"] + " " + P["Time"], format="%d/%m/%Y %H:%M:%S")
P = P.drop(columns=["Date", "Time"]).dropna()

Let us consider the Voltage measure. If we plot it, we notice that the distribution is not stationnary.

P["Voltage"].plot();

svg

Diffing

A first simple trick is to differentiate data, so as to remove the trend. In this example, we notice that the output distribution looks better.

Q = pd.DataFrame(P["Voltage"]).diff()
Q.plot();

svg

Now let us take the first 500000 values to fit a Spot instance. As we have a bunch of data, we can set "large" parameters (low q, high level and max_excess).

from libspot import Spot, ANOMALY

training_size = 500_000
training = Q.iloc[:training_size]["Voltage"].dropna().values

spot = Spot(q=1e-7, level=0.995, max_excess=5000)
spot.fit(training)

Eventually, we can run SPOT on the remaining data, storing the threshold for instance.

thresholds = {}
anomalies = {}
for row in Q.iloc[training_size:].itertuples():
    status = spot.step(row.Voltage)
    thresholds[row.Index] = spot.anomaly_threshold
    if status == ANOMALY:
        anomalies[row.Index] = row.Voltage

Q["z"] = pd.Series(thresholds)
Q["anomaly"] = pd.Series(anomalies)
ax = Q[["Voltage", "z"]].plot()
Q["anomaly"].plot(ax=ax, style='o', color="#f54900", label="Anomalies");

svg

Moving average

Another way to catch the trend is to use a moving average. For instance on 30 measures (30 minutes).

R = pd.DataFrame(P["Voltage"] - P["Voltage"].rolling(30).mean()).dropna()
R.plot();

svg

Then we can analyze the data with SPOT as before.

training_size = 500_000
training = R.iloc[:training_size]["Voltage"].dropna().values

spot = Spot(q=1e-6, level=0.99, max_excess=10000)
spot.fit(training)

thresholds = {}
anomalies = {}
for row in R.iloc[training_size:].itertuples():
    status = spot.step(row.Voltage)
    thresholds[row.Index] = spot.anomaly_threshold
    if status == ANOMALY:
        anomalies[row.Index] = row.Voltage

R["z"] = pd.Series(thresholds)
R["anomaly"] = pd.Series(anomalies)

ax = R[["Voltage", "z"]].plot()
R["anomaly"].plot(ax=ax, style='o', color="#f54900", label="Anomalies");

svg