Conciseness is as necessary as automation for anomaly detection at scale


At its very core, anomaly detection can be defined as the finding of unexpected deviations in data. In the business world, anomalies often take the form of unanticipated spikes or dips in the time series data from key performance indicators (KPIs). Anomalies can be detected manually by visually reviewing the normal patterns and trends in each metric, then setting static thresholds for each them and finally generating alerts when those thresholds are crossed. This is the common approach taken by traditional business intelligence (BI) tools.

This manual approach, however, is time-intensive and doesn’t scale to hundreds of metrics, let alone thousands or millions. Another drawback is the fact that the normal behavior for a metric can change over time, rendering the original thresholds obsolete and requiring another large investment of time to determine new threshold values.

The necessity of concise reporting

Automated detection of anomalies is clearly the way to go. However, detecting the anomalies, is only the first step: all those separate alerts need to be condensed down before you can discover the single event behind the multiple signals. This distillation is necessary, because even without a single false positive, at the scale of millions of metrics, thousands of alerts can be generated at once. Thus, practical anomaly detection systems provide concise reporting of anomalies.

Achieving conciseness manually is impractical for the same reason manual anomaly detection is: the inability to scale. Conciseness requires knowing which anomalies are related. Yet to get that knowledge, you must first know which metrics are related, and that requires building and maintaining a model of the relationships between the individual metrics. For even hundreds of KPIs that’s a combinatorial explosion far bigger than any team of data analysts can create, update and refer to as the torrent of alerts comes pouring in.

How is conciseness achieved?

Concise anomaly reporting requires automation just as much as the anomaly detection does at large scale. Actually, conciseness requires smart automation because that same combinatorial explosion mentioned above will detonate with an automated approach too. This is why machine learning algorithms are now being used not only for detecting anomalies, but also for the concise reporting of them. Sophisticated algorithms borrowed from data science, statistics and AI are being used right now to understand detected anomalies in the context of the metrics they are found in.

One example is real-time anomaly detection and analytics vendor Anodot, which uses a clever combination of univariate and multivariate approaches to achieve concise reporting of the detected anomalies. This hybrid approach defuses the combinatorial explosion by using several anomaly detection techniques (including simple ones like name-based grouping of metrics) to pre-group related metrics together, lessening the computational load on the system.

Here’s why conciseness should matter to you

Conciseness is critical in any anomaly detection system because that quality enables you to infer the real-world business event (be it a price glitch, an API breakage due to OS upgrade, etc.) from the disparate signals in your time series data. Conciseness is the bridge from what’s happening to the business metrics to what’s happening to the business itself. Once that bridge is crossed, your team can tackle the question of what to do about it.

Machine learning has already enabled the automated detection of anomalies in KPIs at the scale of today’s data-driven businesses. Now, it’s delivering even more value by putting the intelligence into business intelligence by helping enterprises see the bigger picture hidden in their data. In other words, conciseness is the lens that puts that picture into sharp focus.