Anomaly Detection
LogPulse Anomaly Detection automatically learns normal behavior for each of your services and alerts you when metrics deviate from expected baselines. Powered by statistical analysis and AI-driven investigation, it catches issues before they become incidents.
Overview
The anomaly detection system continuously monitors your services by comparing real-time metrics against learned baselines. It uses a modified Z-score algorithm with time-of-day and day-of-week seasonality to minimize false positives while catching genuine issues early.
Automatic Baselines
Learns normal behavior from 7 days of historical data, updated hourly.
Smart Detection
Modified Z-score with IQR-based thresholds adapts to each service's patterns.
AI Investigation
Automated root cause analysis queries your logs to explain detected anomalies.
Adaptive Learning
Sensitivity auto-tunes based on your feedback to reduce noise over time.
How It Works
The detection pipeline runs continuously in the background with no manual setup required beyond activating your services.
Service Discovery
LogPulse automatically discovers services from your log data based on the source field.
Baseline Learning
Statistical baselines are computed hourly from a 7-day rolling window, grouped by day-of-week and hour-of-day.
Anomaly Detection
Current metrics are compared against baselines using modified Z-score. Deviations above the threshold trigger anomalies.
Correlation & Impact
Multiple anomalous metrics are correlated, severity is escalated, and downstream impact is analyzed via the dependency graph.
Notification & Response
Alerts are sent to configured channels (Slack, email, webhook, PagerDuty) filtered by minimum severity.
Service Discovery
LogPulse automatically discovers services by analyzing the source field in your logs. Any service with more than 100 log entries in the last 7 days is detected and added to your monitored services list.
Discovered services start in a pending state. You can activate, pause, or ignore them from the Detect & Response dashboard.
| Status | Description |
|---|---|
| discovered | Newly found service, not yet monitoring. Awaiting activation. |
| active | Actively monitored. Baselines are computed and anomalies are detected. |
| paused | Monitoring temporarily suspended. Baselines are retained. |
| ignored | Service is excluded from monitoring entirely. |
Baselines
Baselines represent the normal behavior of a service. They are computed from historical metrics and used as the reference point for anomaly detection.
Baseline Computation
Baselines are recalculated every hour using the last 7 days of metrics from the service_metrics_5m materialized view in ClickHouse. The following statistics are computed for each metric:
| Statistic | Description |
|---|---|
| Median | The middle value — robust against outliers. |
| Q1 (25th percentile) | Lower quartile boundary of normal range. |
| Q3 (75th percentile) | Upper quartile boundary of normal range. |
| IQR | Interquartile range (Q3 - Q1), measures spread of normal values. |
| Mean | Average value over the baseline window. |
| Standard Deviation | Standard deviation, measures overall variability. |
| Sample Count | Number of data points used to compute the baseline. |
Time-Slot Baselines
Baselines are computed per day-of-week (0-6) and hour-of-day (0-23), giving 168 unique time slots per metric per service. This captures daily and weekly seasonality patterns.
For example, a service that sees higher traffic on weekday mornings will have different baselines for Monday 9 AM vs. Sunday 3 AM, preventing normal traffic fluctuations from triggering false alerts.
Detection Algorithm
LogPulse uses a modified Z-score algorithm combined with multi-metric correlation to detect anomalies with high accuracy.
Modified Z-Score
The modified Z-score measures how far a value deviates from the median, normalized by the IQR. It is more robust than standard Z-scores because it uses the median and IQR instead of mean and standard deviation, making it resistant to outlier contamination.
# Modified Z-Score formula
modified_z_score = 0.6745 * (value - median) / iqrA score of 0 means the value equals the median. Higher absolute scores indicate larger deviations from normal behavior. The 0.6745 constant normalizes the IQR to be comparable with standard deviations.
Metric Correlation
When multiple metrics for the same service are anomalous simultaneously, the system escalates the severity. If 2 or more metrics exceed their thresholds at the same time, the overall severity is upgraded by one level.
Severity Levels
Anomaly severity is determined by the modified Z-score magnitude. Higher scores indicate more significant deviations from normal behavior.
| Severity | Z-Score Threshold | Description |
|---|---|---|
| Low | > 1.5 | Minor deviation. Might indicate an emerging trend worth monitoring. |
| Medium | > 2.5 | Notable deviation. Likely requires attention within a reasonable timeframe. |
| High | > 3.5 | Significant deviation. Strongly suggests a real problem requiring prompt investigation. |
| Critical | > 5.0 | Extreme deviation. Almost certainly a real incident requiring immediate action. |
Configuring Services
Each monitored service can be individually configured to tune detection behavior to your needs.
Sensitivity
Sensitivity controls how easily anomalies are triggered. It acts as a multiplier on the Z-score thresholds.
| Level | Behavior |
|---|---|
| Low | Fewer alerts. Only very significant deviations trigger anomalies. Good for noisy services. |
| Medium (default) | Balanced detection. Suitable for most services. |
| High | More alerts. Catches smaller deviations. Good for critical services where early warning matters. |
You can also set per-metric sensitivity multipliers for fine-grained control. For example, increase sensitivity on error_rate while keeping log_count at default.
Detection Intervals
Control how frequently detection runs and how much data it considers for each check.
| Setting | Options | Default |
|---|---|---|
| Detection Interval | 30s, 1m, 2m, 5m | 1m |
| Detection Lookback | 5m, 10m, 15m, 30m | 10m |
Urgent Detection
Urgent detection provides a fast-path alert when error rates spike dramatically, bypassing the normal baseline learning phase. This is useful for catching catastrophic failures immediately.
| Setting | Description | Default |
|---|---|---|
| Enabled | Toggle urgent detection on or off for this service. | Off |
| Error Rate Threshold | Error rate percentage that triggers an urgent alert. | 50% |
| Min Batch Size | Minimum number of recent logs required before evaluating the threshold. | 100 |
Tracked Metrics
LogPulse tracks the following metrics per service, aggregated in 5-minute windows from the ClickHouse materialized view:
| Metric | Description | Unit |
|---|---|---|
| log_count | Total number of log entries received. | count / 5m |
| error_count | Number of logs with level=error or level=fatal. | count / 5m |
| error_rate | Percentage of error logs out of total log count. | % |
| warn_count | Number of logs with level=warn or level=warning. | count / 5m |
| warn_rate | Percentage of warning logs out of total log count. | % |
| info_count | Number of logs with level=info. | count / 5m |
| info_rate | Percentage of info logs out of total log count. | % |
Managing Anomalies
The Detect & Response dashboard provides a central view of all detected anomalies with filtering, acknowledgment, and investigation tools.
Anomaly Lifecycle
Each anomaly moves through a defined lifecycle from detection to resolution.
| Status | Description |
|---|---|
| Active | Newly detected anomaly. Requires attention. |
| Acknowledged | An operator has seen the anomaly and is investigating. |
| Resolved | The anomaly has been resolved — metrics returned to normal. |
| Dismissed | Marked as a false positive or not actionable. |
Feedback & Learning
Providing feedback on anomalies helps LogPulse learn and improve over time. Each anomaly can be marked as a true positive or false positive.
| Feedback | Effect |
|---|---|
| True Positive | Confirms the anomaly was a real issue. Sensitivity may increase slightly for the affected metrics. |
| False Positive | Indicates the alert was not actionable. Sensitivity decreases to reduce similar alerts in the future. |
AI Investigation
LogPulse can automatically investigate anomalies using AI. When triggered (manually or via auto-investigate settings), the AI agent queries your logs using LPQL to understand what happened.
The AI investigator can:
- Query logs around the anomaly time window to identify error patterns
- Correlate events across multiple services
- Identify root causes and contributing factors
- Provide actionable recommendations for resolution
- Learn patterns from confirmed investigations to suppress future false positives
Known Patterns
When AI investigations are confirmed with high confidence, LogPulse automatically learns patterns to suppress expected anomalies in the future.
| Pattern Type | Description | Example |
|---|---|---|
| time_based | Recurring deviations at specific times. | Nightly batch jobs causing log volume spikes at 2 AM. |
| event_based | Deviations caused by known events. | Deployment rollouts causing brief error rate increases. |
| metric_correlation | Expected metric relationships. | High log volume correlating with high info_count during health checks. |
Notifications
Configure response channels to receive alerts when anomalies are detected. Each channel can be filtered by minimum severity to control alert volume.
Response Channels
LogPulse supports four notification channel types:
| Channel | Configuration |
|---|---|
| Send alerts to one or more email addresses. Supports custom subject templates. | |
| Slack | Post alerts to a Slack channel via incoming webhook URL. |
| Webhook | Send JSON payloads to any HTTP endpoint. Useful for custom integrations. |
| PagerDuty | Create incidents via PagerDuty Events API using a routing key. |
Severity Filtering
Each channel has a minimum severity setting. Only anomalies at or above that severity level trigger notifications through that channel. This lets you route critical alerts to PagerDuty while sending informational alerts to Slack.
Service Dependencies
Map relationships between your services to enable impact analysis. Dependencies can be manually created or auto-discovered from log patterns.
| Dependency Type | Description |
|---|---|
| api | REST or gRPC service-to-service communication. |
| database | Database connections (PostgreSQL, MySQL, ClickHouse, etc.). |
| queue | Message queue consumers/producers (RabbitMQ, Kafka, SQS, etc.). |
| cache | Cache layer dependencies (Redis, Memcached, etc.). |
| storage | Object storage dependencies (S3, GCS, etc.). |
| auth | Authentication/authorization service dependencies. |
Impact Analysis
When an anomaly is detected, LogPulse analyzes the dependency graph to determine the blast radius. It identifies downstream services that may be affected and upstream services that may be the root cause.
API Reference
All anomaly detection features are accessible via the REST API. All endpoints require authentication and are scoped to the current organization.
Service Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/detect/services | List all monitored services with health status and active anomaly counts. |
| POST | /api/v1/detect/services | Register a new service for monitoring. |
| PUT | /api/v1/detect/services/:id | Update service configuration (sensitivity, intervals, thresholds). |
| POST | /api/v1/detect/services/:id/activate | Activate monitoring for a discovered service. |
| GET | /api/v1/detect/services/health | Get overall service health summary (healthy, warning, critical counts). |
| GET | /api/v1/detect/services/:id/baseline-status | Get visual baseline gauges with current vs. expected metrics. |
Anomaly Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/detect/anomalies | Query anomalies with filters (status, service, severity, time range). |
| GET | /api/v1/detect/anomalies/:id | Get detailed anomaly information including metrics and AI investigation results. |
| PUT | /api/v1/detect/anomalies/:id/acknowledge | Mark an anomaly as acknowledged. |
| PUT | /api/v1/detect/anomalies/:id/dismiss | Dismiss an anomaly as a false positive. |
| PUT | /api/v1/detect/anomalies/:id/feedback | Provide feedback (true_positive or false_positive). |
| POST | /api/v1/detect/anomalies/:id/investigate | Trigger an AI-powered investigation for this anomaly. |
| GET | /api/v1/detect/anomalies/timeline | Get anomaly timeline data for visualization. |
Response Channel Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/detect/response-channels | List all configured notification channels. |
| POST | /api/v1/detect/response-channels | Create a new notification channel (email, Slack, webhook, PagerDuty). |
| PUT | /api/v1/detect/response-channels/:id | Update channel configuration. |
| DELETE | /api/v1/detect/response-channels/:id | Delete a notification channel. |
Dependency Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/detect/dependencies | Get the service dependency graph. |
| POST | /api/v1/detect/dependencies | Create a dependency relationship between two services. |
| DELETE | /api/v1/detect/dependencies/:id | Remove a dependency relationship. |
| POST | /api/v1/detect/dependencies/:id/confirm | Confirm an auto-discovered dependency. |