Anomaly Detection

Anomaly Detection

LogPulse Anomaly Detection automatically learns normal behavior for each of your services and alerts you when metrics deviate from expected baselines. Powered by statistical analysis and AI-driven investigation, it catches issues before they become incidents.

Overview

The anomaly detection system continuously monitors your services by comparing real-time metrics against learned baselines. It uses a modified Z-score algorithm with time-of-day and day-of-week seasonality to minimize false positives while catching genuine issues early.

Automatic Baselines

Learns normal behavior from 7 days of historical data, updated hourly.

Smart Detection

Modified Z-score with IQR-based thresholds adapts to each service's patterns.

AI Investigation

Automated root cause analysis queries your logs to explain detected anomalies.

Adaptive Learning

Sensitivity auto-tunes based on your feedback to reduce noise over time.

How It Works

The detection pipeline runs continuously in the background with no manual setup required beyond activating your services.

1

Service Discovery

LogPulse automatically discovers services from your log data based on the source field.

2

Baseline Learning

Statistical baselines are computed hourly from a 7-day rolling window, grouped by day-of-week and hour-of-day.

3

Anomaly Detection

Current metrics are compared against baselines using modified Z-score. Deviations above the threshold trigger anomalies.

4

Correlation & Impact

Multiple anomalous metrics are correlated, severity is escalated, and downstream impact is analyzed via the dependency graph.

5

Notification & Response

Alerts are sent to configured channels (Slack, email, webhook, PagerDuty) filtered by minimum severity.

Service Discovery

LogPulse automatically discovers services by analyzing the source field in your logs. Any service with more than 100 log entries in the last 7 days is detected and added to your monitored services list.

Discovered services start in a pending state. You can activate, pause, or ignore them from the Detect & Response dashboard.

StatusDescription
discoveredNewly found service, not yet monitoring. Awaiting activation.
activeActively monitored. Baselines are computed and anomalies are detected.
pausedMonitoring temporarily suspended. Baselines are retained.
ignoredService is excluded from monitoring entirely.
Tip: Discovery runs hourly. New services typically appear within an hour of first sending logs.

Baselines

Baselines represent the normal behavior of a service. They are computed from historical metrics and used as the reference point for anomaly detection.

Baseline Computation

Baselines are recalculated every hour using the last 7 days of metrics from the service_metrics_5m materialized view in ClickHouse. The following statistics are computed for each metric:

StatisticDescription
MedianThe middle value — robust against outliers.
Q1 (25th percentile)Lower quartile boundary of normal range.
Q3 (75th percentile)Upper quartile boundary of normal range.
IQRInterquartile range (Q3 - Q1), measures spread of normal values.
MeanAverage value over the baseline window.
Standard DeviationStandard deviation, measures overall variability.
Sample CountNumber of data points used to compute the baseline.

Time-Slot Baselines

Baselines are computed per day-of-week (0-6) and hour-of-day (0-23), giving 168 unique time slots per metric per service. This captures daily and weekly seasonality patterns.

For example, a service that sees higher traffic on weekday mornings will have different baselines for Monday 9 AM vs. Sunday 3 AM, preventing normal traffic fluctuations from triggering false alerts.

Note: Baselines are cached in Redis with a 2-hour TTL for fast lookups during detection. They are recomputed hourly from ClickHouse.

Detection Algorithm

LogPulse uses a modified Z-score algorithm combined with multi-metric correlation to detect anomalies with high accuracy.

Modified Z-Score

The modified Z-score measures how far a value deviates from the median, normalized by the IQR. It is more robust than standard Z-scores because it uses the median and IQR instead of mean and standard deviation, making it resistant to outlier contamination.

# Modified Z-Score formula
modified_z_score = 0.6745 * (value - median) / iqr

A score of 0 means the value equals the median. Higher absolute scores indicate larger deviations from normal behavior. The 0.6745 constant normalizes the IQR to be comparable with standard deviations.

Metric Correlation

When multiple metrics for the same service are anomalous simultaneously, the system escalates the severity. If 2 or more metrics exceed their thresholds at the same time, the overall severity is upgraded by one level.

Note: Correlation helps distinguish isolated metric spikes from systemic service problems. A simultaneous rise in error_count and drop in log_count is more concerning than either alone.

Severity Levels

Anomaly severity is determined by the modified Z-score magnitude. Higher scores indicate more significant deviations from normal behavior.

SeverityZ-Score ThresholdDescription
Low> 1.5Minor deviation. Might indicate an emerging trend worth monitoring.
Medium> 2.5Notable deviation. Likely requires attention within a reasonable timeframe.
High> 3.5Significant deviation. Strongly suggests a real problem requiring prompt investigation.
Critical> 5.0Extreme deviation. Almost certainly a real incident requiring immediate action.

Configuring Services

Each monitored service can be individually configured to tune detection behavior to your needs.

Sensitivity

Sensitivity controls how easily anomalies are triggered. It acts as a multiplier on the Z-score thresholds.

LevelBehavior
LowFewer alerts. Only very significant deviations trigger anomalies. Good for noisy services.
Medium (default)Balanced detection. Suitable for most services.
HighMore alerts. Catches smaller deviations. Good for critical services where early warning matters.

You can also set per-metric sensitivity multipliers for fine-grained control. For example, increase sensitivity on error_rate while keeping log_count at default.

Detection Intervals

Control how frequently detection runs and how much data it considers for each check.

SettingOptionsDefault
Detection Interval30s, 1m, 2m, 5m1m
Detection Lookback5m, 10m, 15m, 30m10m
Tip: Shorter intervals provide faster detection but increase compute load. For most services, 1-minute interval with 10-minute lookback is a good balance.

Urgent Detection

Urgent detection provides a fast-path alert when error rates spike dramatically, bypassing the normal baseline learning phase. This is useful for catching catastrophic failures immediately.

SettingDescriptionDefault
EnabledToggle urgent detection on or off for this service.Off
Error Rate ThresholdError rate percentage that triggers an urgent alert.50%
Min Batch SizeMinimum number of recent logs required before evaluating the threshold.100

Tracked Metrics

LogPulse tracks the following metrics per service, aggregated in 5-minute windows from the ClickHouse materialized view:

MetricDescriptionUnit
log_countTotal number of log entries received.count / 5m
error_countNumber of logs with level=error or level=fatal.count / 5m
error_ratePercentage of error logs out of total log count.%
warn_countNumber of logs with level=warn or level=warning.count / 5m
warn_ratePercentage of warning logs out of total log count.%
info_countNumber of logs with level=info.count / 5m
info_ratePercentage of info logs out of total log count.%
Note: Metrics are sourced from the service_metrics_5m materialized view, which is automatically maintained by ClickHouse.

Managing Anomalies

The Detect & Response dashboard provides a central view of all detected anomalies with filtering, acknowledgment, and investigation tools.

Anomaly Lifecycle

Each anomaly moves through a defined lifecycle from detection to resolution.

StatusDescription
ActiveNewly detected anomaly. Requires attention.
AcknowledgedAn operator has seen the anomaly and is investigating.
ResolvedThe anomaly has been resolved — metrics returned to normal.
DismissedMarked as a false positive or not actionable.

Feedback & Learning

Providing feedback on anomalies helps LogPulse learn and improve over time. Each anomaly can be marked as a true positive or false positive.

FeedbackEffect
True PositiveConfirms the anomaly was a real issue. Sensitivity may increase slightly for the affected metrics.
False PositiveIndicates the alert was not actionable. Sensitivity decreases to reduce similar alerts in the future.
Tip: Consistent feedback is the most effective way to reduce alert noise. Even a few feedback entries per week significantly improve detection accuracy.

AI Investigation

LogPulse can automatically investigate anomalies using AI. When triggered (manually or via auto-investigate settings), the AI agent queries your logs using LPQL to understand what happened.

The AI investigator can:

  • Query logs around the anomaly time window to identify error patterns
  • Correlate events across multiple services
  • Identify root causes and contributing factors
  • Provide actionable recommendations for resolution
  • Learn patterns from confirmed investigations to suppress future false positives

Known Patterns

When AI investigations are confirmed with high confidence, LogPulse automatically learns patterns to suppress expected anomalies in the future.

Pattern TypeDescriptionExample
time_basedRecurring deviations at specific times.Nightly batch jobs causing log volume spikes at 2 AM.
event_basedDeviations caused by known events.Deployment rollouts causing brief error rate increases.
metric_correlationExpected metric relationships.High log volume correlating with high info_count during health checks.
Note: Patterns are learned automatically from AI investigations with confidence > 0.7. They include a match counter and are continuously validated.

Notifications

Configure response channels to receive alerts when anomalies are detected. Each channel can be filtered by minimum severity to control alert volume.

Response Channels

LogPulse supports four notification channel types:

ChannelConfiguration
EmailSend alerts to one or more email addresses. Supports custom subject templates.
SlackPost alerts to a Slack channel via incoming webhook URL.
WebhookSend JSON payloads to any HTTP endpoint. Useful for custom integrations.
PagerDutyCreate incidents via PagerDuty Events API using a routing key.

Severity Filtering

Each channel has a minimum severity setting. Only anomalies at or above that severity level trigger notifications through that channel. This lets you route critical alerts to PagerDuty while sending informational alerts to Slack.

Tip: Set up multiple channels with different severity thresholds for tiered alerting. For example: Slack for medium+, PagerDuty for critical only.

Service Dependencies

Map relationships between your services to enable impact analysis. Dependencies can be manually created or auto-discovered from log patterns.

Dependency TypeDescription
apiREST or gRPC service-to-service communication.
databaseDatabase connections (PostgreSQL, MySQL, ClickHouse, etc.).
queueMessage queue consumers/producers (RabbitMQ, Kafka, SQS, etc.).
cacheCache layer dependencies (Redis, Memcached, etc.).
storageObject storage dependencies (S3, GCS, etc.).
authAuthentication/authorization service dependencies.

Impact Analysis

When an anomaly is detected, LogPulse analyzes the dependency graph to determine the blast radius. It identifies downstream services that may be affected and upstream services that may be the root cause.

Note: The impact badge on anomaly cards shows the number of potentially affected downstream services. This helps prioritize response by focusing on anomalies with the widest blast radius.

API Reference

All anomaly detection features are accessible via the REST API. All endpoints require authentication and are scoped to the current organization.

Service Endpoints

MethodEndpointDescription
GET/api/v1/detect/servicesList all monitored services with health status and active anomaly counts.
POST/api/v1/detect/servicesRegister a new service for monitoring.
PUT/api/v1/detect/services/:idUpdate service configuration (sensitivity, intervals, thresholds).
POST/api/v1/detect/services/:id/activateActivate monitoring for a discovered service.
GET/api/v1/detect/services/healthGet overall service health summary (healthy, warning, critical counts).
GET/api/v1/detect/services/:id/baseline-statusGet visual baseline gauges with current vs. expected metrics.

Anomaly Endpoints

MethodEndpointDescription
GET/api/v1/detect/anomaliesQuery anomalies with filters (status, service, severity, time range).
GET/api/v1/detect/anomalies/:idGet detailed anomaly information including metrics and AI investigation results.
PUT/api/v1/detect/anomalies/:id/acknowledgeMark an anomaly as acknowledged.
PUT/api/v1/detect/anomalies/:id/dismissDismiss an anomaly as a false positive.
PUT/api/v1/detect/anomalies/:id/feedbackProvide feedback (true_positive or false_positive).
POST/api/v1/detect/anomalies/:id/investigateTrigger an AI-powered investigation for this anomaly.
GET/api/v1/detect/anomalies/timelineGet anomaly timeline data for visualization.

Response Channel Endpoints

MethodEndpointDescription
GET/api/v1/detect/response-channelsList all configured notification channels.
POST/api/v1/detect/response-channelsCreate a new notification channel (email, Slack, webhook, PagerDuty).
PUT/api/v1/detect/response-channels/:idUpdate channel configuration.
DELETE/api/v1/detect/response-channels/:idDelete a notification channel.

Dependency Endpoints

MethodEndpointDescription
GET/api/v1/detect/dependenciesGet the service dependency graph.
POST/api/v1/detect/dependenciesCreate a dependency relationship between two services.
DELETE/api/v1/detect/dependencies/:idRemove a dependency relationship.
POST/api/v1/detect/dependencies/:id/confirmConfirm an auto-discovered dependency.