AI Evaluations

The AI Evaluations API lets you evaluate log data against natural-language criteria and get back a structured, machine-readable verdict. You supply log rows and describe what you want to know, "did every step of this deployment succeed?", "which of these expected detections actually fired?", and LogPulse returns one finding per evaluated item, each with a status, an explanation grounded in the supplied logs, and optional metadata such as confidence and entity references.

The verdict is schema-enforced: the response always has the same structure, so you can assert on it in CI, feed it into reporting, or gate a pipeline on it, no free-text parsing required.

Use Cases

Use case	How
Detection & alert testing	Replay attack-simulation logs, list the detections you expect as expectedItems, and assert every item comes back passed
Deployment verification	Feed the logs of a release run and ask whether each rollout step completed without errors
Log quality checks	Evaluate a sample of ingested logs against criteria like "every entry carries a request_id and a severity"
Ad-hoc triage	Send a suspicious batch of logs with free-form criteria and get a structured second opinion

How It Works

An evaluation is a single synchronous API call. The request carries the log rows, the criteria, and optionally a list of expected items. LogPulse builds a constrained evaluation prompt, runs it against an EU-hosted Claude model, and forces the model to answer through a strict verdict schema. The result is validated server-side before it is returned. A malformed verdict is never passed through to you.

The evaluator only draws conclusions that are grounded in the rows you supply. If the logs contain no evidence for an expected item, its status is not_found. The model is explicitly instructed never to guess.

Evaluation Modes

Mode	Behavior
test_classification (default)	Evaluates the logs against your expectedItems list. Returns exactly one finding per expected item, keyed by its id, guaranteed, even when the logs contain no evidence for an item.
generic_verdict	Free-form evaluation. Returns one finding per distinct conclusion your criteria ask for, with ids f1, f2, and so on.

Tip

Use test_classification whenever you know in advance what you are checking for. The one-finding-per-item guarantee makes assertions in CI trivial. Use generic_verdict for open-ended questions.

Findings & Statuses

Every finding carries a status describing what the supplied logs prove:

Status	Meaning
passed	The logs prove the item succeeded / the condition holds
failed	The logs prove the item failed / the condition is violated
not_found	The supplied logs contain no relevant evidence for this item
skipped	The logs show the item was deliberately skipped

Note

not_found is not the same as failed. In detection testing, not_found usually means the expected alert never made it into the logs at all, often a pipeline or coverage problem rather than a detection-logic problem.

Authentication

The Evaluations API is authenticated with a Personal Access Token (PAT) carrying the evaluations:run scope, sent as a Bearer token. PATs start with lpat_ and are created in the dashboard under Settings → Access Tokens.

Authorization header

Authorization: Bearer lpat_your_token_here

See Personal Access Tokens for how to create, scope, rotate, and revoke tokens.

Warning

Ingest API keys (lp_ prefix) do not work on this endpoint. They are rejected with HTTP 401. The two credential types are deliberately separate so that an ingest key on a server can never start AI runs.

Endpoint Reference

Run an ad-hoc evaluation

POST https://api.logpulse.io/api/v1/evaluations/run-adhoc

Runs a stateless, synchronous evaluation: you supply the rows inline and receive the verdict in the response. Nothing is stored. Rerunning the same request later evaluates it fresh.

Request Fields

Field	Type	Required	Description
rows	array	one of rows / log	Log rows to evaluate. Strings or JSON objects, 1 to 10,000 entries (see Limits for how many are considered)
log	any	one of rows / log	Convenience field for evaluating a single log entry
criteria	string	yes*	Natural-language evaluation criteria, max 4,000 characters. Not required when preview is true
expectedItems	array	no	Items to check for, max 200. Each: { id, description }
mode	string	no	'test_classification' (default) or 'generic_verdict'
model	string	no	Model ID; defaults to the fast utility model (see Model Selection)
logCompression	boolean	no	Default true. Compresses all rows into patterns + counts so the full set is considered. Set false to evaluate only the first 200 rows verbatim (see Limits)
compressionOptions	object	no	Optional tuning for the compression pass. See Compression & Preview
preview	boolean	no	Default false. When true, returns the compressed view + an AI-credit estimate without calling the model or spending credits. criteria is not required. See Compression & Preview

Provide either rows or log; a request with neither is rejected with a validation error. Rows can be raw strings or structured JSON. Objects are serialized before evaluation.

Expected Items

In test_classification mode, expectedItems defines the checklist. Each item has an id (max 100 characters, your own identifier, it is echoed back on the matching finding) and a description (max 500 characters) telling the evaluator what evidence to look for.

expectedItems

"expectedItems": [
  { "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
  { "id": "priv-esc",    "description": "An alert fired for a sudo invocation by a non-admin service account" },
  { "id": "exfil",       "description": "An alert fired for an outbound transfer larger than 100 MB to an unknown host" }
]

The response is guaranteed to contain exactly one finding for every expected item id. Items the evaluator found no evidence for come back as not_found.

Model Selection

By default, evaluations run on the fast utility model (claude-haiku-4-5), which handles classification against clear criteria well and keeps credit consumption low. For nuanced criteria or noisy logs you can request a more capable model via the model field.

Model	Character	Relative cost
claude-haiku-4-5 (default)	Fast, best for clear-cut classification	1x
claude-sonnet-4-6	Stronger reasoning over ambiguous evidence	~3x
claude-opus-4-7	Maximum depth for complex investigative criteria	~15x

Note

Unknown model values fall back to the default model rather than failing the request. The model actually used is always returned in the response's model field.

Response Reference

A successful run returns HTTP 200 with the verdict and run metadata:

200 OK

{
  "data": {
    "verdict": [
      {
        "id": "brute-force",
        "status": "passed",
        "confidence": 0.95,
        "explanation": "Alert 'SSH brute force detected' fired at 09:14:02 after 23 failed logins from 203.0.113.7."
      },
      {
        "id": "priv-esc",
        "status": "failed",
        "confidence": 0.88,
        "explanation": "The sudo invocation by svc-backup appears in the logs, but no corresponding alert event is present."
      },
      {
        "id": "exfil",
        "status": "not_found",
        "explanation": "No outbound-transfer events or related alerts are present in the supplied logs."
      }
    ],
    "summary": "1 of 3 expected detections fired. Privilege-escalation triggered the underlying event but no alert; exfiltration left no trace in the supplied window.",
    "model": "claude-haiku-4-5",
    "provider": "bedrock",
    "region": "eu-north-1",
    "inputRowCount": 142,
    "truncated": false,
    "compressed": true,
    "usage": { "tokensIn": 18234, "tokensOut": 412 },
    "creditsSpent": 0.0871
  }
}

Field	Type	Description
verdict	Finding[]	One finding per evaluated item (see below)
summary	string \| null	Optional overall summary, max 500 characters
model	string	Model that produced the verdict
provider	string	AI provider the run executed on
region	string	Region the run executed in (always an EU region)
inputRowCount	number	Rows considered: all supplied rows when compressed, else the first 200
truncated	boolean	true when input was clipped to fit the limits
compressed	boolean	true when rows were compressed into patterns before evaluation (logCompression)
usage.tokensIn	number	Prompt tokens consumed
usage.tokensOut	number	Completion tokens consumed
creditsSpent	number	AI credits debited for this run (1 credit ≈ $0.01 model cost)

The Finding Object

Field	Type	Always present	Description
id	string	yes	The expectedItems id, or f1, f2, … in generic mode
explanation	string	yes	Short factual justification grounded in the logs, max 300 characters
status	string	no	'passed' \| 'failed' \| 'not_found' \| 'skipped'
confidence	number	no	Evaluator confidence between 0 and 1
entity	object	no	Identity or asset the finding concerns: { type, key }, e.g. { "type": "user", "key": "svc-backup" }
score	number	no	Numeric risk score, only when the criteria ask for scoring
tags	string[]	no	Framework tags such as MITRE technique IDs, only when the criteria ask for them

Optional fields appear only when they are relevant: ask for risk scores or MITRE mappings in your criteria and the corresponding fields are populated.

Limits & Truncation

Limit	Value
Request body size	10 MB
Rows accepted per request	10,000
Rows considered (logCompression: true, default)	All supplied rows, compressed into patterns within the input budget
Rows considered (logCompression: false)	200 (the first 200 rows, in request order)
Characters per serialized row	2,000
Total input budget	~50,000 tokens
Criteria length	4,000 characters
Expected items	200
Findings per verdict	200
Explanation length per finding	300 characters
Summary length	500 characters

By default (logCompression: true) all supplied rows are compressed into patterns with occurrence counts before evaluation, so the evaluator sees the full distribution of every row instead of a sample. Rare one-off lines are kept verbatim. This fits far more signal within the input budget. Set logCompression: false to skip compression and evaluate only the first 200 rows in request order; the rest is dropped, so put the rows that matter first.

Truncation is never silent. When the row count, per-row size, or total input budget clips your data, the response sets truncated: true and inputRowCount tells you how many rows were considered. The compressed field reports whether compression was applied.

Tip

Pre-filter in LPQL before evaluating: narrow your logs to the relevant window and fields, then send that result set. With compression on you can safely send the whole window; with it off, well-chosen rows evaluate better, and cheaper, than arbitrary ones.

Compression & Preview

When logCompression is on (the default), rows are grouped into patterns using occurrence counting. Near-identical lines collapse into a single template such as [412x] ERROR connection to <IP> failed, where placeholders like <IP>, <NUM> and <*> mark the parts that vary between rows. Rare one-off lines are kept verbatim. You can tune this with the optional compressionOptions object.

Compression Options

Option	Type	Default	Description
maxTemplates	number	200	Maximum distinct patterns to surface, highest-count first (1 to 500)
maxSamples	number	2	Example rows kept per pattern, shown after "e.g." (0 to 5)
rareThreshold	number	1	Patterns at or below this count are kept verbatim rather than collapsed (0 to 50)
simThreshold	number	0.4	How similar two lines must be to share a pattern, 0 to 1. Higher merges less

Previewing Compression

To see exactly how your logs compress before running an evaluation, send preview: true on the same run-adhoc request. It applies the same compression a real run would, but never calls the model and never consumes AI credits, so criteria is not required.

Preview request

POST https://api.logpulse.io/api/v1/evaluations/run-adhoc
{
  "rows": [ ... ],
  "preview": true,
  "compressionOptions": { "maxTemplates": 30 }
}

The response returns the exact lines the evaluator would feed the model, plus an AI-credit estimate so you can see the saving in the same unit you are billed in:

200 OK

{
  "data": {
    "preview": true,
    "lines": [
      "[412x] ERROR connection to <IP> failed  e.g. connection to 10.0.3.12 failed",
      "[388x] INFO request <*> completed in <NUM>  e.g. request GET /orders completed in 42ms",
      "[1x] FATAL OOMKilled pod worker-7c"
    ],
    "inputRowCount": 1000,
    "uniqueTemplates": 18,
    "droppedTemplates": 0,
    "rareLineCount": 1,
    "truncated": false,
    "creditEstimate": {
      "model": "claude-haiku-4-5",
      "uncompressed": 1.0606,
      "compressed": 0.0164
    }
  }
}

Field	Type	Description
lines	string[]	The exact compressed view the evaluator would send to the model
inputRowCount	number	Number of rows considered
uniqueTemplates	number	Distinct patterns found before the maxTemplates cap
droppedTemplates	number	Patterns dropped by the maxTemplates cap
rareLineCount	number	One-off lines kept verbatim
truncated	boolean	true when the input-token budget clipped the output
creditEstimate.model	string	Model the estimate is priced at (defaults to the utility model, or your model field)
creditEstimate.uncompressed	number	AI credits for the log-data input if sent uncompressed
creditEstimate.compressed	number	AI credits for the log-data input with compression. The saving is the difference

The estimate covers the log-data input only, the part compression changes, priced at the model's input rate. It does not include the verdict output or the criteria, which are roughly the same whether or not you compress, so the real saving is the difference between the two numbers.

Note

Preview uses the same run-adhoc endpoint, rate limit, and evaluations:run token scope as a normal run. It just skips the model call and the credit charge.

Rate Limits & AI Credits

The endpoint is rate-limited to 10 requests per minute per caller. Beyond that, evaluations consume your organization's monthly AI credit bundle, the same bundle every AI feature in LogPulse draws from. Consumption is proportional to actual model cost (1 credit is roughly $0.01 of model cost), so a small Haiku run costs a fraction of a credit while a large Opus run costs several.

Plan	AI credits / month
Free	100
Pro	2,500
Business	10,000
Enterprise	Custom

Bundles reset on the first of each calendar month (UTC). Current usage is visible in Settings → Billing. When the bundle is exhausted, AI endpoints return HTTP 429 with code AI_CREDITS_EXHAUSTED:

429: credits exhausted

{
  "error": "AI credits exhausted",
  "code": "AI_CREDITS_EXHAUSTED",
  "remaining": 0,
  "limit": 1000,
  "resetsAt": "2026-07-01T00:00:00.000Z"
}

Error Reference

Errors use the standard LogPulse error envelope. Match on the stable code field, not on the message text.

HTTP status	Code	Meaning
400	VALIDATION_ERROR	Request body failed validation; details lists the issues
401	UNAUTHORIZED / INVALID_PAT	Missing, malformed, or unknown token
401	PAT_REVOKED / PAT_EXPIRED	Token revoked or past its expiration date
403	INSUFFICIENT_SCOPE	Token lacks the evaluations:run scope
413	FST_ERR_CTP_BODY_TOO_LARGE	Request body exceeds 10 MB. Pre-filter or split the payload
429	(rate limit)	More than 10 requests in a minute. Retry with backoff
429	AI_CREDITS_EXHAUSTED	Organization's monthly AI credit bundle is used up
502	EVALUATION_SCHEMA_ERROR	The model produced an invalid verdict; safe to retry
503	AI_PROVIDER_NOT_EU	EU evaluation capacity unavailable; retry later

Note

502 and 503 are transient. Retry with exponential backoff. 4xx errors will not succeed on retry without changing the request or token.

Examples

Detection Testing (cURL)

The canonical workflow: replay simulated attack traffic, pull the resulting logs, and verify that every expected detection fired.

cURL: test_classification

curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
  -H "Authorization: Bearer $LOGPULSE_PAT" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "test_classification",
    "criteria": "These are SIEM logs from a purple-team exercise. For each expected item, determine whether the corresponding detection alert fired.",
    "expectedItems": [
      { "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
      { "id": "priv-esc", "description": "An alert fired for a sudo invocation by a non-admin service account" }
    ],
    "rows": [
      {"timestamp": "2026-06-12T09:13:55Z", "source": "sshd", "event": "Failed password for root from 203.0.113.7", "count": 23},
      {"timestamp": "2026-06-12T09:14:02Z", "source": "siem", "alert": "SSH brute force detected", "src_ip": "203.0.113.7"},
      {"timestamp": "2026-06-12T09:20:11Z", "source": "auditd", "event": "sudo invoked", "user": "svc-backup"}
    ]
  }'

Generic Verdict (cURL)

For one-off questions, use generic_verdict, and the log field when you only have a single entry:

cURL: generic_verdict, single log

curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
  -H "Authorization: Bearer $LOGPULSE_PAT" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "generic_verdict",
    "criteria": "Does this log entry indicate a security-relevant event? If so, classify the severity and name the affected account.",
    "log": {
      "timestamp": "2026-06-12T03:12:44Z",
      "source": "auth-service",
      "event": "Password changed",
      "user": "[email protected]",
      "ip": "198.51.100.23",
      "geo": "country=BR",
      "mfa": false
    }
  }'

Python

python: requests

import os
import requests

PAT = os.environ["LOGPULSE_PAT"]
BASE_URL = "https://api.logpulse.io"

def run_evaluation(rows, criteria, expected_items):
    response = requests.post(
        f"{BASE_URL}/api/v1/evaluations/run-adhoc",
        headers={"Authorization": f"Bearer {PAT}"},
        json={
            "mode": "test_classification",
            "rows": rows,
            "criteria": criteria,
            "expectedItems": expected_items,
        },
        timeout=120,  # synchronous AI call; allow time for large inputs
    )
    response.raise_for_status()
    return response.json()["data"]

result = run_evaluation(
    rows=load_exercise_logs(),  # your own loader
    criteria="For each expected item, determine whether the detection alert fired.",
    expected_items=[
        {"id": "brute-force", "description": "Alert for repeated failed SSH logins from one IP"},
        {"id": "priv-esc", "description": "Alert for sudo by a non-admin service account"},
    ],
)

if result["truncated"]:
    print(f"warning: input clipped, {result['inputRowCount']} rows evaluated")

for finding in result["verdict"]:
    print(f"{finding['id']}: {finding.get('status')}, {finding['explanation']}")

failed = [f for f in result["verdict"] if f.get("status") != "passed"]
if failed:
    raise SystemExit(f"{len(failed)} expected detection(s) did not pass")

Node.js

node.js: fetch

const PAT = process.env.LOGPULSE_PAT;

async function runEvaluation({ rows, criteria, expectedItems }) {
  const response = await fetch(
    'https://api.logpulse.io/api/v1/evaluations/run-adhoc',
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${PAT}`,
      },
      body: JSON.stringify({
        mode: 'test_classification',
        rows,
        criteria,
        expectedItems,
      }),
    },
  );

  if (response.status === 429) {
    const body = await response.json();
    if (body.code === 'AI_CREDITS_EXHAUSTED') {
      throw new Error(`AI credits exhausted, resets at ${body.resetsAt}`);
    }
    throw new Error('Rate limited, retry with backoff');
  }
  if (!response.ok) {
    throw new Error(`Evaluation failed: ${response.status}`);
  }

  const { data } = await response.json();
  return data;
}

const data = await runEvaluation({
  rows: exerciseLogs,
  criteria: 'For each expected item, determine whether the detection alert fired.',
  expectedItems: [
    { id: 'brute-force', description: 'Alert for repeated failed SSH logins from one IP' },
    { id: 'priv-esc', description: 'Alert for sudo by a non-admin service account' },
  ],
});

const notPassed = data.verdict.filter((f) => f.status !== 'passed');
console.log(`${data.verdict.length - notPassed.length}/${data.verdict.length} passed`);
if (notPassed.length > 0) process.exit(1);

CI Pipeline Integration

A GitHub Actions job that runs an attack simulation, evaluates the resulting logs, and fails the build when an expected detection did not fire. Store the PAT as a repository secret.

.github/workflows/detection-tests.yml

jobs:
  detection-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run attack simulation
        run: ./scripts/run-simulation.sh > exercise-logs.json

      - name: Evaluate detections
        env:
          LOGPULSE_PAT: ${{ secrets.LOGPULSE_PAT }}
        run: |
          RESULT=$(curl -sf -X POST \
            https://api.logpulse.io/api/v1/evaluations/run-adhoc \
            -H "Authorization: Bearer $LOGPULSE_PAT" \
            -H "Content-Type: application/json" \
            -d "$(jq -n --slurpfile rows exercise-logs.json \
              --slurpfile expected detection-expectations.json '{
                mode: "test_classification",
                criteria: "For each expected item, determine whether the detection alert fired.",
                rows: $rows[0],
                expectedItems: $expected[0]
              }')")
          echo "$RESULT" | jq '.data.verdict'
          echo "$RESULT" | jq -e \
            '[.data.verdict[] | select(.status != "passed")] | length == 0'

Data Residency

Evaluations run exclusively on EU-hosted AI infrastructure (Amazon Bedrock in an EU region). This is enforced server-side as a hard requirement: if EU capacity is unavailable, the API returns HTTP 503 with code AI_PROVIDER_NOT_EU rather than silently routing your log data to another region or provider. The provider and region fields in every response tell you exactly where the run executed.

Log rows submitted for evaluation are processed for the duration of the request only; the ad-hoc endpoint stores neither your rows nor the verdict.