Anomaly Detection

LogPulse Anomaly Detection automatically learns normal behavior for each of your services and alerts you when metrics deviate from expected baselines. Powered by statistical analysis and AI-assisted investigation, it surfaces issues early so you can act before they escalate.

Overview

The anomaly detection system continuously monitors your services by comparing real-time metrics against learned baselines. It uses a modified Z-score algorithm with time-of-day and day-of-week seasonality to minimize false positives while catching genuine issues early.

Automatic Baselines

Learns normal behavior from 7 days of historical data, updated hourly.

Smart Detection

Modified Z-score with IQR-based thresholds adapts to each service's patterns.

AI Investigation

AI-assisted root-cause analysis queries your logs to help explain detected anomalies.

Adaptive Learning

Sensitivity auto-tunes based on your feedback to reduce noise over time.

How It Works

The detection pipeline runs continuously in the background with no manual setup required beyond activating your services.

Service Discovery

LogPulse automatically discovers services from your log data based on the source field.

Baseline Learning

Statistical baselines are computed hourly from a 7-day rolling window, grouped by day-of-week and hour-of-day.

Anomaly Detection

Current metrics are compared against baselines using modified Z-score. Deviations above the threshold trigger anomalies.

Correlation & Impact

Multiple anomalous metrics are correlated, severity is escalated, and downstream impact is analyzed via the dependency graph.

Notification & Response

Alerts are sent to configured channels (Slack, email, webhook, PagerDuty) filtered by minimum severity.

Service Discovery

LogPulse automatically discovers services by analyzing the source field in your logs. Any service with more than 100 log entries in the last 7 days is detected and added to your monitored services list.

Discovered services start in a pending state. You can activate, pause, or ignore them from the Detect & Response dashboard.

Status	Description
discovered	Newly found service, not yet monitoring. Awaiting activation.
active	Actively monitored. Baselines are computed and anomalies are detected.
paused	Monitoring temporarily suspended. Baselines are retained.
ignored	Service is excluded from monitoring entirely.

Tip: Discovery runs hourly. New services typically appear within an hour of first sending logs.

Baselines

Baselines represent the normal behavior of a service. They are computed from historical metrics and used as the reference point for anomaly detection.

Baseline Computation

Baselines are recalculated every hour using the last 7 days of metrics from the service_metrics_5m materialized view in ClickHouse. The following statistics are computed for each metric:

Statistic	Description
Median	The middle value, robust against outliers.
Q1 (25th percentile)	Lower quartile boundary of normal range.
Q3 (75th percentile)	Upper quartile boundary of normal range.
IQR	Interquartile range (Q3 - Q1), measures spread of normal values.
Mean	Average value over the baseline window.
Standard Deviation	Standard deviation, measures overall variability.
Sample Count	Number of data points used to compute the baseline.

Time-Slot Baselines

Baselines are computed per day-of-week (0-6) and hour-of-day (0-23), giving 168 unique time slots per metric per service. This captures daily and weekly seasonality patterns.

For example, a service that sees higher traffic on weekday mornings will have different baselines for Monday 9 AM vs. Sunday 3 AM, preventing normal traffic fluctuations from triggering false alerts.

Note: Baselines are cached in Redis with a 2-hour TTL for fast lookups during detection. They are recomputed hourly from ClickHouse.

Detection Algorithm

LogPulse uses a modified Z-score algorithm combined with multi-metric correlation to detect anomalies with high accuracy.

Modified Z-Score

The modified Z-score measures how far a value deviates from the median, normalized by the IQR. It is more robust than standard Z-scores because it uses the median and IQR instead of mean and standard deviation, making it resistant to outlier contamination.

# Modified Z-Score formula
modified_z_score = 0.6745 * (value - median) / iqr

A score of 0 means the value equals the median. Higher absolute scores indicate larger deviations from normal behavior. The 0.6745 constant normalizes the IQR to be comparable with standard deviations.

Metric Correlation

When multiple metrics for the same service are anomalous simultaneously, the system escalates the severity. If 2 or more metrics exceed their thresholds at the same time, the overall severity is upgraded by one level.

Note: Correlation helps distinguish isolated metric spikes from systemic service problems. A simultaneous rise in error_count and drop in log_count is more concerning than either alone.

Severity Levels

Anomaly severity is determined by the modified Z-score magnitude. Higher scores indicate more significant deviations from normal behavior.

Severity	Z-Score Threshold	Description
Low	> 1.5	Minor deviation. Might indicate an emerging trend worth monitoring.
Medium	> 2.5	Notable deviation. Likely requires attention within a reasonable timeframe.
High	> 3.5	Significant deviation. Strongly suggests a real problem requiring prompt investigation.
Critical	> 5.0	Extreme deviation. Almost certainly a real incident requiring immediate action.

Configuring Services

Each monitored service can be individually configured to tune detection behavior to your needs.

Sensitivity

Sensitivity controls how easily anomalies are triggered. It acts as a multiplier on the Z-score thresholds.

Level	Behavior
Low	Fewer alerts. Only very significant deviations trigger anomalies. Good for noisy services.
Medium (default)	Balanced detection. Suitable for most services.
High	More alerts. Catches smaller deviations. Good for critical services where early warning matters.

You can also set per-metric sensitivity multipliers for fine-grained control. For example, increase sensitivity on error_rate while keeping log_count at default.

Detection Intervals

Control how frequently detection runs and how much data it considers for each check.

Setting	Options	Default
Detection Interval	30s, 1m, 2m, 5m	1m
Detection Lookback	5m, 10m, 15m, 30m	10m

Tip: Shorter intervals provide faster detection but increase compute load. For most services, 1-minute interval with 10-minute lookback is a good balance.

Urgent Detection

Urgent detection provides a fast-path alert when error rates spike dramatically, bypassing the normal baseline learning phase. This is useful for catching catastrophic failures immediately.

Setting	Description	Default
Enabled	Toggle urgent detection on or off for this service.	Off
Error Rate Threshold	Error rate percentage that triggers an urgent alert.	50%
Min Batch Size	Minimum number of recent logs required before evaluating the threshold.	100

Tracked Metrics

LogPulse tracks the following metrics per service, aggregated in 5-minute windows from the ClickHouse materialized view:

Metric	Description	Unit
log_count	Total number of log entries received.	count / 5m
error_count	Number of logs with level=error or level=fatal.	count / 5m
error_rate	Percentage of error logs out of total log count.	%
warn_count	Number of logs with level=warn or level=warning.	count / 5m
warn_rate	Percentage of warning logs out of total log count.	%
info_count	Number of logs with level=info.	count / 5m
info_rate	Percentage of info logs out of total log count.	%

Note: Metrics are sourced from the service_metrics_5m materialized view, which is automatically maintained by ClickHouse.

Managing Anomalies

The Detect & Response dashboard provides a central view of all detected anomalies with filtering, acknowledgment, and investigation tools.

Anomaly Lifecycle

Each anomaly moves through a defined lifecycle from detection to resolution.

Status	Description
Active	Newly detected anomaly. Requires attention.
Acknowledged	An operator has seen the anomaly and is investigating.
Resolved	The anomaly has been resolved: metrics returned to normal.
Dismissed	Marked as a false positive or not actionable.

Feedback & Learning

Providing feedback on anomalies helps LogPulse learn and improve over time. Each anomaly can be marked as a true positive or false positive.

Feedback	Effect
True Positive	Confirms the anomaly was a real issue. Sensitivity may increase slightly for the affected metrics.
False Positive	Indicates the alert was not actionable. Sensitivity decreases to reduce similar alerts in the future.

Tip: Consistent feedback is the most effective way to reduce alert noise. Even a few feedback entries per week significantly improve detection accuracy.

AI Investigation

LogPulse can automatically investigate anomalies using AI. When triggered (manually or via auto-investigate settings), the AI agent queries your logs using LPQL to understand what happened.

The AI investigator can:

Query logs around the anomaly time window to identify error patterns
Correlate events across multiple services
Identify root causes and contributing factors
Provide actionable recommendations for resolution
Learn patterns from confirmed investigations to suppress future false positives

Known Patterns

When AI investigations are confirmed with high confidence, LogPulse automatically learns patterns to suppress expected anomalies in the future.

Pattern Type	Description	Example
time_based	Recurring deviations at specific times.	Nightly batch jobs causing log volume spikes at 2 AM.
event_based	Deviations caused by known events.	Deployment rollouts causing brief error rate increases.
metric_correlation	Expected metric relationships.	High log volume correlating with high info_count during health checks.

Note: Patterns are learned automatically from AI investigations with confidence > 0.7. They include a match counter and are continuously validated.

Notifications

Configure response channels to receive alerts when anomalies are detected. Each channel can be filtered by minimum severity to control alert volume.

Response Channels

LogPulse supports four notification channel types:

Channel	Configuration
Email	Send alerts to one or more email addresses. Supports custom subject templates.
Slack	Post alerts to a Slack channel via incoming webhook URL.
Webhook	Send JSON payloads to any HTTP endpoint. Useful for custom integrations.
PagerDuty	Create incidents via PagerDuty Events API using a routing key.

Severity Filtering

Each channel has a minimum severity setting. Only anomalies at or above that severity level trigger notifications through that channel. This lets you route critical alerts to PagerDuty while sending informational alerts to Slack.

Tip: Set up multiple channels with different severity thresholds for tiered alerting. For example: Slack for medium+, PagerDuty for critical only.

Service Dependencies

Map relationships between your services to enable impact analysis. Dependencies can be manually created or auto-discovered from log patterns.

Dependency Type	Description
api	REST or gRPC service-to-service communication.
database	Database connections (PostgreSQL, MySQL, ClickHouse, etc.).
queue	Message queue consumers/producers (RabbitMQ, Kafka, SQS, etc.).
cache	Cache layer dependencies (Redis, Memcached, etc.).
storage	Object storage dependencies (S3, GCS, etc.).
auth	Authentication/authorization service dependencies.

Impact Analysis

When an anomaly is detected, LogPulse analyzes the dependency graph to determine the blast radius. It identifies downstream services that may be affected and upstream services that may be the root cause.

Note: The impact badge on anomaly cards shows the number of potentially affected downstream services. This helps prioritize response by focusing on anomalies with the widest blast radius.

API Reference

All anomaly detection features are accessible via the REST API. All endpoints require authentication and are scoped to the current organization.

Service Endpoints

Method	Endpoint	Description
GET	/api/v1/detect/services	List all monitored services with health status and active anomaly counts.
POST	/api/v1/detect/services	Register a new service for monitoring.
PUT	/api/v1/detect/services/:id	Update service configuration (sensitivity, intervals, thresholds).
POST	/api/v1/detect/services/:id/activate	Activate monitoring for a discovered service.
GET	/api/v1/detect/services/health	Get overall service health summary (healthy, warning, critical counts).
GET	/api/v1/detect/services/:id/baseline-status	Get visual baseline gauges with current vs. expected metrics.

Anomaly Endpoints

Method	Endpoint	Description
GET	/api/v1/detect/anomalies	Query anomalies with filters (status, service, severity, time range).
GET	/api/v1/detect/anomalies/:id	Get detailed anomaly information including metrics and AI investigation results.
PUT	/api/v1/detect/anomalies/:id/acknowledge	Mark an anomaly as acknowledged.
PUT	/api/v1/detect/anomalies/:id/dismiss	Dismiss an anomaly as a false positive.
PUT	/api/v1/detect/anomalies/:id/feedback	Provide feedback (true_positive or false_positive).
POST	/api/v1/detect/anomalies/:id/investigate	Trigger an AI-powered investigation for this anomaly.
GET	/api/v1/detect/anomalies/timeline	Get anomaly timeline data for visualization.

Response Channel Endpoints

Method	Endpoint	Description
GET	/api/v1/detect/response-channels	List all configured notification channels.
POST	/api/v1/detect/response-channels	Create a new notification channel (email, Slack, webhook, PagerDuty).
PUT	/api/v1/detect/response-channels/:id	Update channel configuration.
DELETE	/api/v1/detect/response-channels/:id	Delete a notification channel.

Dependency Endpoints

Method	Endpoint	Description
GET	/api/v1/detect/dependencies	Get the service dependency graph.
POST	/api/v1/detect/dependencies	Create a dependency relationship between two services.
DELETE	/api/v1/detect/dependencies/:id	Remove a dependency relationship.
POST	/api/v1/detect/dependencies/:id/confirm	Confirm an auto-discovered dependency.