AI Evaluations
The AI Evaluations API lets you evaluate log data against natural-language criteria and get back a structured, machine-readable verdict. You supply log rows and describe what you want to know — "did every step of this deployment succeed?", "which of these expected detections actually fired?" — and LogPulse returns one finding per evaluated item, each with a status, an explanation grounded in the supplied logs, and optional metadata such as confidence and entity references.
The verdict is schema-enforced: the response always has the same structure, so you can assert on it in CI, feed it into reporting, or gate a pipeline on it — no free-text parsing required.
Use Cases
| Use case | How |
|---|---|
| Detection & alert testing | Replay attack-simulation logs, list the detections you expect as expectedItems, and assert every item comes back passed |
| Deployment verification | Feed the logs of a release run and ask whether each rollout step completed without errors |
| Log quality checks | Evaluate a sample of ingested logs against criteria like "every entry carries a request_id and a severity" |
| Ad-hoc triage | Send a suspicious batch of logs with free-form criteria and get a structured second opinion |
How It Works
An evaluation is a single synchronous API call. The request carries the log rows, the criteria, and optionally a list of expected items. LogPulse builds a constrained evaluation prompt, runs it against an EU-hosted Claude model, and forces the model to answer through a strict verdict schema. The result is validated server-side before it is returned — a malformed verdict is never passed through to you.
The evaluator only draws conclusions that are grounded in the rows you supply. If the logs contain no evidence for an expected item, its status is not_found — the model is explicitly instructed never to guess.
Evaluation Modes
| Mode | Behavior |
|---|---|
| test_classification (default) | Evaluates the logs against your expectedItems list. Returns exactly one finding per expected item, keyed by its id — guaranteed, even when the logs contain no evidence for an item. |
| generic_verdict | Free-form evaluation. Returns one finding per distinct conclusion your criteria ask for, with ids f1, f2, and so on. |
test_classification whenever you know in advance what you are checking for — the one-finding-per-item guarantee makes assertions in CI trivial. Use generic_verdict for open-ended questions.Findings & Statuses
Every finding carries a status describing what the supplied logs prove:
| Status | Meaning |
|---|---|
| passed | The logs prove the item succeeded / the condition holds |
| failed | The logs prove the item failed / the condition is violated |
| not_found | The supplied logs contain no relevant evidence for this item |
| skipped | The logs show the item was deliberately skipped |
not_found is not the same as failed. In detection testing, not_found usually means the expected alert never made it into the logs at all — often a pipeline or coverage problem rather than a detection-logic problem.Authentication
The Evaluations API is authenticated with a Personal Access Token (PAT) carrying the evaluations:run scope, sent as a Bearer token. PATs start with lpat_ and are created in the dashboard under Settings → Access Tokens.
Authorization: Bearer lpat_your_token_hereSee Personal Access Tokens for how to create, scope, rotate, and revoke tokens.
lp_ prefix) do not work on this endpoint — they are rejected with HTTP 401. The two credential types are deliberately separate so that an ingest key on a server can never start AI runs.Endpoint Reference
POST https://api.logpulse.io/api/v1/evaluations/run-adhocRuns a stateless, synchronous evaluation: you supply the rows inline and receive the verdict in the response. Nothing is stored — rerunning the same request later evaluates it fresh.
Request Fields
| Field | Type | Required | Description |
|---|---|---|---|
| rows | array | one of rows / log | Log rows to evaluate. Strings or JSON objects, 1 to 10,000 entries (200 are evaluated, see Limits) |
| log | any | one of rows / log | Convenience field for evaluating a single log entry |
| criteria | string | yes | Natural-language evaluation criteria, max 4,000 characters |
| expectedItems | array | no | Items to check for, max 200. Each: { id, description } |
| mode | string | no | 'test_classification' (default) or 'generic_verdict' |
| model | string | no | Model ID; defaults to the fast utility model (see Model Selection) |
Provide either rows or log; a request with neither is rejected with a validation error. Rows can be raw strings or structured JSON — objects are serialized before evaluation.
Expected Items
In test_classification mode, expectedItems defines the checklist. Each item has an id (max 100 characters, your own identifier — it is echoed back on the matching finding) and a description (max 500 characters) telling the evaluator what evidence to look for.
"expectedItems": [
{ "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
{ "id": "priv-esc", "description": "An alert fired for a sudo invocation by a non-admin service account" },
{ "id": "exfil", "description": "An alert fired for an outbound transfer larger than 100 MB to an unknown host" }
]The response is guaranteed to contain exactly one finding for every expected item id. Items the evaluator found no evidence for come back as not_found.
Model Selection
By default, evaluations run on the fast utility model (claude-haiku-4-5), which handles classification against clear criteria well and keeps credit consumption low. For nuanced criteria or noisy logs you can request a more capable model via the model field.
| Model | Character | Relative cost |
|---|---|---|
| claude-haiku-4-5 (default) | Fast, best for clear-cut classification | 1x |
| claude-sonnet-4-6 | Stronger reasoning over ambiguous evidence | ~3x |
| claude-opus-4-7 | Maximum depth for complex investigative criteria | ~15x |
model field.Response Reference
A successful run returns HTTP 200 with the verdict and run metadata:
{
"data": {
"verdict": [
{
"id": "brute-force",
"status": "passed",
"confidence": 0.95,
"explanation": "Alert 'SSH brute force detected' fired at 09:14:02 after 23 failed logins from 203.0.113.7."
},
{
"id": "priv-esc",
"status": "failed",
"confidence": 0.88,
"explanation": "The sudo invocation by svc-backup appears in the logs, but no corresponding alert event is present."
},
{
"id": "exfil",
"status": "not_found",
"explanation": "No outbound-transfer events or related alerts are present in the supplied logs."
}
],
"summary": "1 of 3 expected detections fired. Privilege-escalation triggered the underlying event but no alert; exfiltration left no trace in the supplied window.",
"model": "claude-haiku-4-5",
"provider": "bedrock",
"region": "eu-north-1",
"inputRowCount": 142,
"truncated": false,
"usage": { "tokensIn": 18234, "tokensOut": 412 }
}
}| Field | Type | Description |
|---|---|---|
| verdict | Finding[] | One finding per evaluated item (see below) |
| summary | string | null | Optional overall summary, max 500 characters |
| model | string | Model that produced the verdict |
| provider | string | AI provider the run executed on |
| region | string | Region the run executed in (always an EU region) |
| inputRowCount | number | Rows actually evaluated after caps were applied |
| truncated | boolean | true when input was clipped to fit the limits |
| usage.tokensIn | number | Prompt tokens consumed |
| usage.tokensOut | number | Completion tokens consumed |
The Finding Object
| Field | Type | Always present | Description |
|---|---|---|---|
| id | string | yes | The expectedItems id, or f1, f2, … in generic mode |
| explanation | string | yes | Short factual justification grounded in the logs, max 300 characters |
| status | string | no | 'passed' | 'failed' | 'not_found' | 'skipped' |
| confidence | number | no | Evaluator confidence between 0 and 1 |
| entity | object | no | Identity or asset the finding concerns: { type, key }, e.g. { "type": "user", "key": "svc-backup" } |
| score | number | no | Numeric risk score, only when the criteria ask for scoring |
| tags | string[] | no | Framework tags such as MITRE technique IDs, only when the criteria ask for them |
Optional fields appear only when they are relevant: ask for risk scores or MITRE mappings in your criteria and the corresponding fields are populated.
Limits & Truncation
| Limit | Value |
|---|---|
| Rows accepted per request | 10,000 |
| Rows evaluated per request | 200 |
| Characters per serialized row | 2,000 |
| Total input budget | ~50,000 tokens |
| Criteria length | 4,000 characters |
| Expected items | 200 |
| Findings per verdict | 200 |
| Explanation length per finding | 300 characters |
| Summary length | 500 characters |
Truncation is never silent. When the row count, per-row size, or total input budget clips your data, the response sets truncated: true and inputRowCount tells you how many rows were actually evaluated.
Rate Limits & AI Credits
The endpoint is rate-limited to 10 requests per minute per caller. Beyond that, evaluations consume your organization's monthly AI credit bundle — the same bundle every AI feature in LogPulse draws from. Consumption is proportional to actual model cost (1 credit is roughly $0.01 of model cost), so a small Haiku run costs a fraction of a credit while a large Opus run costs several.
| Plan | AI credits / month |
|---|---|
| Free | 100 |
| Starter | 1,000 |
| Pro | 5,000 |
| Business | 20,000 |
Bundles reset on the first of each calendar month (UTC). Current usage is visible in Settings → Billing. When the bundle is exhausted, AI endpoints return HTTP 429 with code AI_CREDITS_EXHAUSTED:
{
"error": "AI credits exhausted",
"code": "AI_CREDITS_EXHAUSTED",
"remaining": 0,
"limit": 1000,
"resetsAt": "2026-07-01T00:00:00.000Z"
}Error Reference
Errors use the standard LogPulse error envelope. Match on the stable code field, not on the message text.
| HTTP status | Code | Meaning |
|---|---|---|
| 400 | VALIDATION_ERROR | Request body failed validation; details lists the issues |
| 401 | UNAUTHORIZED / INVALID_PAT | Missing, malformed, or unknown token |
| 401 | PAT_REVOKED / PAT_EXPIRED | Token revoked or past its expiration date |
| 403 | INSUFFICIENT_SCOPE | Token lacks the evaluations:run scope |
| 429 | (rate limit) | More than 10 requests in a minute — retry with backoff |
| 429 | AI_CREDITS_EXHAUSTED | Organization's monthly AI credit bundle is used up |
| 502 | EVALUATION_SCHEMA_ERROR | The model produced an invalid verdict; safe to retry |
| 503 | AI_PROVIDER_NOT_EU | EU evaluation capacity unavailable; retry later |
Examples
Detection Testing (cURL)
The canonical workflow: replay simulated attack traffic, pull the resulting logs, and verify that every expected detection fired.
curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
-H "Authorization: Bearer $LOGPULSE_PAT" \
-H "Content-Type: application/json" \
-d '{
"mode": "test_classification",
"criteria": "These are SIEM logs from a purple-team exercise. For each expected item, determine whether the corresponding detection alert fired.",
"expectedItems": [
{ "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
{ "id": "priv-esc", "description": "An alert fired for a sudo invocation by a non-admin service account" }
],
"rows": [
{"timestamp": "2026-06-12T09:13:55Z", "source": "sshd", "event": "Failed password for root from 203.0.113.7", "count": 23},
{"timestamp": "2026-06-12T09:14:02Z", "source": "siem", "alert": "SSH brute force detected", "src_ip": "203.0.113.7"},
{"timestamp": "2026-06-12T09:20:11Z", "source": "auditd", "event": "sudo invoked", "user": "svc-backup"}
]
}'Generic Verdict (cURL)
For one-off questions, use generic_verdict — and the log field when you only have a single entry:
curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
-H "Authorization: Bearer $LOGPULSE_PAT" \
-H "Content-Type: application/json" \
-d '{
"mode": "generic_verdict",
"criteria": "Does this log entry indicate a security-relevant event? If so, classify the severity and name the affected account.",
"log": {
"timestamp": "2026-06-12T03:12:44Z",
"source": "auth-service",
"event": "Password changed",
"user": "[email protected]",
"ip": "198.51.100.23",
"geo": "country=BR",
"mfa": false
}
}'Python
import os
import requests
PAT = os.environ["LOGPULSE_PAT"]
BASE_URL = "https://api.logpulse.io"
def run_evaluation(rows, criteria, expected_items):
response = requests.post(
f"{BASE_URL}/api/v1/evaluations/run-adhoc",
headers={"Authorization": f"Bearer {PAT}"},
json={
"mode": "test_classification",
"rows": rows,
"criteria": criteria,
"expectedItems": expected_items,
},
timeout=120, # synchronous AI call; allow time for large inputs
)
response.raise_for_status()
return response.json()["data"]
result = run_evaluation(
rows=load_exercise_logs(), # your own loader
criteria="For each expected item, determine whether the detection alert fired.",
expected_items=[
{"id": "brute-force", "description": "Alert for repeated failed SSH logins from one IP"},
{"id": "priv-esc", "description": "Alert for sudo by a non-admin service account"},
],
)
if result["truncated"]:
print(f"warning: input clipped, {result['inputRowCount']} rows evaluated")
for finding in result["verdict"]:
print(f"{finding['id']}: {finding.get('status')} — {finding['explanation']}")
failed = [f for f in result["verdict"] if f.get("status") != "passed"]
if failed:
raise SystemExit(f"{len(failed)} expected detection(s) did not pass")Node.js
const PAT = process.env.LOGPULSE_PAT;
async function runEvaluation({ rows, criteria, expectedItems }) {
const response = await fetch(
'https://api.logpulse.io/api/v1/evaluations/run-adhoc',
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${PAT}`,
},
body: JSON.stringify({
mode: 'test_classification',
rows,
criteria,
expectedItems,
}),
},
);
if (response.status === 429) {
const body = await response.json();
if (body.code === 'AI_CREDITS_EXHAUSTED') {
throw new Error(`AI credits exhausted, resets at ${body.resetsAt}`);
}
throw new Error('Rate limited — retry with backoff');
}
if (!response.ok) {
throw new Error(`Evaluation failed: ${response.status}`);
}
const { data } = await response.json();
return data;
}
const data = await runEvaluation({
rows: exerciseLogs,
criteria: 'For each expected item, determine whether the detection alert fired.',
expectedItems: [
{ id: 'brute-force', description: 'Alert for repeated failed SSH logins from one IP' },
{ id: 'priv-esc', description: 'Alert for sudo by a non-admin service account' },
],
});
const notPassed = data.verdict.filter((f) => f.status !== 'passed');
console.log(`${data.verdict.length - notPassed.length}/${data.verdict.length} passed`);
if (notPassed.length > 0) process.exit(1);CI Pipeline Integration
A GitHub Actions job that runs an attack simulation, evaluates the resulting logs, and fails the build when an expected detection did not fire. Store the PAT as a repository secret.
jobs:
detection-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run attack simulation
run: ./scripts/run-simulation.sh > exercise-logs.json
- name: Evaluate detections
env:
LOGPULSE_PAT: ${{ secrets.LOGPULSE_PAT }}
run: |
RESULT=$(curl -sf -X POST \
https://api.logpulse.io/api/v1/evaluations/run-adhoc \
-H "Authorization: Bearer $LOGPULSE_PAT" \
-H "Content-Type: application/json" \
-d "$(jq -n --slurpfile rows exercise-logs.json \
--slurpfile expected detection-expectations.json '{
mode: "test_classification",
criteria: "For each expected item, determine whether the detection alert fired.",
rows: $rows[0],
expectedItems: $expected[0]
}')")
echo "$RESULT" | jq '.data.verdict'
echo "$RESULT" | jq -e \
'[.data.verdict[] | select(.status != "passed")] | length == 0'Data Residency
Evaluations run exclusively on EU-hosted AI infrastructure (Amazon Bedrock in an EU region). This is enforced server-side as a hard requirement: if EU capacity is unavailable, the API returns HTTP 503 with code AI_PROVIDER_NOT_EU rather than silently routing your log data to another region or provider. The provider and region fields in every response tell you exactly where the run executed.
Log rows submitted for evaluation are processed for the duration of the request only; the ad-hoc endpoint stores neither your rows nor the verdict.