AI Evaluations

The AI Evaluations API lets you evaluate log data against natural-language criteria and get back a structured, machine-readable verdict. You supply log rows and describe what you want to know — "did every step of this deployment succeed?", "which of these expected detections actually fired?" — and LogPulse returns one finding per evaluated item, each with a status, an explanation grounded in the supplied logs, and optional metadata such as confidence and entity references.

The verdict is schema-enforced: the response always has the same structure, so you can assert on it in CI, feed it into reporting, or gate a pipeline on it — no free-text parsing required.

Use Cases

Use caseHow
Detection & alert testingReplay attack-simulation logs, list the detections you expect as expectedItems, and assert every item comes back passed
Deployment verificationFeed the logs of a release run and ask whether each rollout step completed without errors
Log quality checksEvaluate a sample of ingested logs against criteria like "every entry carries a request_id and a severity"
Ad-hoc triageSend a suspicious batch of logs with free-form criteria and get a structured second opinion

How It Works

An evaluation is a single synchronous API call. The request carries the log rows, the criteria, and optionally a list of expected items. LogPulse builds a constrained evaluation prompt, runs it against an EU-hosted Claude model, and forces the model to answer through a strict verdict schema. The result is validated server-side before it is returned — a malformed verdict is never passed through to you.

The evaluator only draws conclusions that are grounded in the rows you supply. If the logs contain no evidence for an expected item, its status is not_found — the model is explicitly instructed never to guess.

Evaluation Modes

ModeBehavior
test_classification (default)Evaluates the logs against your expectedItems list. Returns exactly one finding per expected item, keyed by its id — guaranteed, even when the logs contain no evidence for an item.
generic_verdictFree-form evaluation. Returns one finding per distinct conclusion your criteria ask for, with ids f1, f2, and so on.
Tip
Use test_classification whenever you know in advance what you are checking for — the one-finding-per-item guarantee makes assertions in CI trivial. Use generic_verdict for open-ended questions.

Findings & Statuses

Every finding carries a status describing what the supplied logs prove:

StatusMeaning
passedThe logs prove the item succeeded / the condition holds
failedThe logs prove the item failed / the condition is violated
not_foundThe supplied logs contain no relevant evidence for this item
skippedThe logs show the item was deliberately skipped
Note
not_found is not the same as failed. In detection testing, not_found usually means the expected alert never made it into the logs at all — often a pipeline or coverage problem rather than a detection-logic problem.

Authentication

The Evaluations API is authenticated with a Personal Access Token (PAT) carrying the evaluations:run scope, sent as a Bearer token. PATs start with lpat_ and are created in the dashboard under Settings → Access Tokens.

Authorization header
Authorization: Bearer lpat_your_token_here

See Personal Access Tokens for how to create, scope, rotate, and revoke tokens.

Warning
Ingest API keys (lp_ prefix) do not work on this endpoint — they are rejected with HTTP 401. The two credential types are deliberately separate so that an ingest key on a server can never start AI runs.

Endpoint Reference

Run an ad-hoc evaluation
POST https://api.logpulse.io/api/v1/evaluations/run-adhoc

Runs a stateless, synchronous evaluation: you supply the rows inline and receive the verdict in the response. Nothing is stored — rerunning the same request later evaluates it fresh.

Request Fields

FieldTypeRequiredDescription
rowsarrayone of rows / logLog rows to evaluate. Strings or JSON objects, 1 to 10,000 entries (200 are evaluated, see Limits)
loganyone of rows / logConvenience field for evaluating a single log entry
criteriastringyesNatural-language evaluation criteria, max 4,000 characters
expectedItemsarraynoItems to check for, max 200. Each: { id, description }
modestringno'test_classification' (default) or 'generic_verdict'
modelstringnoModel ID; defaults to the fast utility model (see Model Selection)

Provide either rows or log; a request with neither is rejected with a validation error. Rows can be raw strings or structured JSON — objects are serialized before evaluation.

Expected Items

In test_classification mode, expectedItems defines the checklist. Each item has an id (max 100 characters, your own identifier — it is echoed back on the matching finding) and a description (max 500 characters) telling the evaluator what evidence to look for.

expectedItems
"expectedItems": [
  { "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
  { "id": "priv-esc",    "description": "An alert fired for a sudo invocation by a non-admin service account" },
  { "id": "exfil",       "description": "An alert fired for an outbound transfer larger than 100 MB to an unknown host" }
]

The response is guaranteed to contain exactly one finding for every expected item id. Items the evaluator found no evidence for come back as not_found.

Model Selection

By default, evaluations run on the fast utility model (claude-haiku-4-5), which handles classification against clear criteria well and keeps credit consumption low. For nuanced criteria or noisy logs you can request a more capable model via the model field.

ModelCharacterRelative cost
claude-haiku-4-5 (default)Fast, best for clear-cut classification1x
claude-sonnet-4-6Stronger reasoning over ambiguous evidence~3x
claude-opus-4-7Maximum depth for complex investigative criteria~15x
Note
Unknown model values fall back to the default model rather than failing the request. The model actually used is always returned in the response's model field.

Response Reference

A successful run returns HTTP 200 with the verdict and run metadata:

200 OK
{
  "data": {
    "verdict": [
      {
        "id": "brute-force",
        "status": "passed",
        "confidence": 0.95,
        "explanation": "Alert 'SSH brute force detected' fired at 09:14:02 after 23 failed logins from 203.0.113.7."
      },
      {
        "id": "priv-esc",
        "status": "failed",
        "confidence": 0.88,
        "explanation": "The sudo invocation by svc-backup appears in the logs, but no corresponding alert event is present."
      },
      {
        "id": "exfil",
        "status": "not_found",
        "explanation": "No outbound-transfer events or related alerts are present in the supplied logs."
      }
    ],
    "summary": "1 of 3 expected detections fired. Privilege-escalation triggered the underlying event but no alert; exfiltration left no trace in the supplied window.",
    "model": "claude-haiku-4-5",
    "provider": "bedrock",
    "region": "eu-north-1",
    "inputRowCount": 142,
    "truncated": false,
    "usage": { "tokensIn": 18234, "tokensOut": 412 }
  }
}
FieldTypeDescription
verdictFinding[]One finding per evaluated item (see below)
summarystring | nullOptional overall summary, max 500 characters
modelstringModel that produced the verdict
providerstringAI provider the run executed on
regionstringRegion the run executed in (always an EU region)
inputRowCountnumberRows actually evaluated after caps were applied
truncatedbooleantrue when input was clipped to fit the limits
usage.tokensInnumberPrompt tokens consumed
usage.tokensOutnumberCompletion tokens consumed

The Finding Object

FieldTypeAlways presentDescription
idstringyesThe expectedItems id, or f1, f2, … in generic mode
explanationstringyesShort factual justification grounded in the logs, max 300 characters
statusstringno'passed' | 'failed' | 'not_found' | 'skipped'
confidencenumbernoEvaluator confidence between 0 and 1
entityobjectnoIdentity or asset the finding concerns: { type, key }, e.g. { "type": "user", "key": "svc-backup" }
scorenumbernoNumeric risk score, only when the criteria ask for scoring
tagsstring[]noFramework tags such as MITRE technique IDs, only when the criteria ask for them

Optional fields appear only when they are relevant: ask for risk scores or MITRE mappings in your criteria and the corresponding fields are populated.

Limits & Truncation

LimitValue
Rows accepted per request10,000
Rows evaluated per request200
Characters per serialized row2,000
Total input budget~50,000 tokens
Criteria length4,000 characters
Expected items200
Findings per verdict200
Explanation length per finding300 characters
Summary length500 characters

Truncation is never silent. When the row count, per-row size, or total input budget clips your data, the response sets truncated: true and inputRowCount tells you how many rows were actually evaluated.

Tip
Pre-filter in LPQL before evaluating: narrow your logs to the relevant window and fields, then send that result set. 200 well-chosen rows evaluate better — and cheaper — than 200 arbitrary ones.

Rate Limits & AI Credits

The endpoint is rate-limited to 10 requests per minute per caller. Beyond that, evaluations consume your organization's monthly AI credit bundle — the same bundle every AI feature in LogPulse draws from. Consumption is proportional to actual model cost (1 credit is roughly $0.01 of model cost), so a small Haiku run costs a fraction of a credit while a large Opus run costs several.

PlanAI credits / month
Free100
Starter1,000
Pro5,000
Business20,000

Bundles reset on the first of each calendar month (UTC). Current usage is visible in Settings → Billing. When the bundle is exhausted, AI endpoints return HTTP 429 with code AI_CREDITS_EXHAUSTED:

429 — credits exhausted
{
  "error": "AI credits exhausted",
  "code": "AI_CREDITS_EXHAUSTED",
  "remaining": 0,
  "limit": 1000,
  "resetsAt": "2026-07-01T00:00:00.000Z"
}

Error Reference

Errors use the standard LogPulse error envelope. Match on the stable code field, not on the message text.

HTTP statusCodeMeaning
400VALIDATION_ERRORRequest body failed validation; details lists the issues
401UNAUTHORIZED / INVALID_PATMissing, malformed, or unknown token
401PAT_REVOKED / PAT_EXPIREDToken revoked or past its expiration date
403INSUFFICIENT_SCOPEToken lacks the evaluations:run scope
429(rate limit)More than 10 requests in a minute — retry with backoff
429AI_CREDITS_EXHAUSTEDOrganization's monthly AI credit bundle is used up
502EVALUATION_SCHEMA_ERRORThe model produced an invalid verdict; safe to retry
503AI_PROVIDER_NOT_EUEU evaluation capacity unavailable; retry later
Note
502 and 503 are transient. Retry with exponential backoff. 4xx errors will not succeed on retry without changing the request or token.

Examples

Detection Testing (cURL)

The canonical workflow: replay simulated attack traffic, pull the resulting logs, and verify that every expected detection fired.

cURL — test_classification
curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
  -H "Authorization: Bearer $LOGPULSE_PAT" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "test_classification",
    "criteria": "These are SIEM logs from a purple-team exercise. For each expected item, determine whether the corresponding detection alert fired.",
    "expectedItems": [
      { "id": "brute-force", "description": "An alert fired for repeated failed SSH logins from a single source IP" },
      { "id": "priv-esc", "description": "An alert fired for a sudo invocation by a non-admin service account" }
    ],
    "rows": [
      {"timestamp": "2026-06-12T09:13:55Z", "source": "sshd", "event": "Failed password for root from 203.0.113.7", "count": 23},
      {"timestamp": "2026-06-12T09:14:02Z", "source": "siem", "alert": "SSH brute force detected", "src_ip": "203.0.113.7"},
      {"timestamp": "2026-06-12T09:20:11Z", "source": "auditd", "event": "sudo invoked", "user": "svc-backup"}
    ]
  }'

Generic Verdict (cURL)

For one-off questions, use generic_verdict — and the log field when you only have a single entry:

cURL — generic_verdict, single log
curl -X POST https://api.logpulse.io/api/v1/evaluations/run-adhoc \
  -H "Authorization: Bearer $LOGPULSE_PAT" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "generic_verdict",
    "criteria": "Does this log entry indicate a security-relevant event? If so, classify the severity and name the affected account.",
    "log": {
      "timestamp": "2026-06-12T03:12:44Z",
      "source": "auth-service",
      "event": "Password changed",
      "user": "[email protected]",
      "ip": "198.51.100.23",
      "geo": "country=BR",
      "mfa": false
    }
  }'

Python

python — requests
import os
import requests

PAT = os.environ["LOGPULSE_PAT"]
BASE_URL = "https://api.logpulse.io"

def run_evaluation(rows, criteria, expected_items):
    response = requests.post(
        f"{BASE_URL}/api/v1/evaluations/run-adhoc",
        headers={"Authorization": f"Bearer {PAT}"},
        json={
            "mode": "test_classification",
            "rows": rows,
            "criteria": criteria,
            "expectedItems": expected_items,
        },
        timeout=120,  # synchronous AI call; allow time for large inputs
    )
    response.raise_for_status()
    return response.json()["data"]

result = run_evaluation(
    rows=load_exercise_logs(),  # your own loader
    criteria="For each expected item, determine whether the detection alert fired.",
    expected_items=[
        {"id": "brute-force", "description": "Alert for repeated failed SSH logins from one IP"},
        {"id": "priv-esc", "description": "Alert for sudo by a non-admin service account"},
    ],
)

if result["truncated"]:
    print(f"warning: input clipped, {result['inputRowCount']} rows evaluated")

for finding in result["verdict"]:
    print(f"{finding['id']}: {finding.get('status')} — {finding['explanation']}")

failed = [f for f in result["verdict"] if f.get("status") != "passed"]
if failed:
    raise SystemExit(f"{len(failed)} expected detection(s) did not pass")

Node.js

node.js — fetch
const PAT = process.env.LOGPULSE_PAT;

async function runEvaluation({ rows, criteria, expectedItems }) {
  const response = await fetch(
    'https://api.logpulse.io/api/v1/evaluations/run-adhoc',
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${PAT}`,
      },
      body: JSON.stringify({
        mode: 'test_classification',
        rows,
        criteria,
        expectedItems,
      }),
    },
  );

  if (response.status === 429) {
    const body = await response.json();
    if (body.code === 'AI_CREDITS_EXHAUSTED') {
      throw new Error(`AI credits exhausted, resets at ${body.resetsAt}`);
    }
    throw new Error('Rate limited — retry with backoff');
  }
  if (!response.ok) {
    throw new Error(`Evaluation failed: ${response.status}`);
  }

  const { data } = await response.json();
  return data;
}

const data = await runEvaluation({
  rows: exerciseLogs,
  criteria: 'For each expected item, determine whether the detection alert fired.',
  expectedItems: [
    { id: 'brute-force', description: 'Alert for repeated failed SSH logins from one IP' },
    { id: 'priv-esc', description: 'Alert for sudo by a non-admin service account' },
  ],
});

const notPassed = data.verdict.filter((f) => f.status !== 'passed');
console.log(`${data.verdict.length - notPassed.length}/${data.verdict.length} passed`);
if (notPassed.length > 0) process.exit(1);

CI Pipeline Integration

A GitHub Actions job that runs an attack simulation, evaluates the resulting logs, and fails the build when an expected detection did not fire. Store the PAT as a repository secret.

.github/workflows/detection-tests.yml
jobs:
  detection-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run attack simulation
        run: ./scripts/run-simulation.sh > exercise-logs.json

      - name: Evaluate detections
        env:
          LOGPULSE_PAT: ${{ secrets.LOGPULSE_PAT }}
        run: |
          RESULT=$(curl -sf -X POST \
            https://api.logpulse.io/api/v1/evaluations/run-adhoc \
            -H "Authorization: Bearer $LOGPULSE_PAT" \
            -H "Content-Type: application/json" \
            -d "$(jq -n --slurpfile rows exercise-logs.json \
              --slurpfile expected detection-expectations.json '{
                mode: "test_classification",
                criteria: "For each expected item, determine whether the detection alert fired.",
                rows: $rows[0],
                expectedItems: $expected[0]
              }')")
          echo "$RESULT" | jq '.data.verdict'
          echo "$RESULT" | jq -e \
            '[.data.verdict[] | select(.status != "passed")] | length == 0'

Data Residency

Evaluations run exclusively on EU-hosted AI infrastructure (Amazon Bedrock in an EU region). This is enforced server-side as a hard requirement: if EU capacity is unavailable, the API returns HTTP 503 with code AI_PROVIDER_NOT_EU rather than silently routing your log data to another region or provider. The provider and region fields in every response tell you exactly where the run executed.

Log rows submitted for evaluation are processed for the duration of the request only; the ad-hoc endpoint stores neither your rows nor the verdict.

We use cookies to analyze site traffic and improve your experience. No cookies are placed without your consent. Privacy Policy