Agent Evals

Agent Evals help you measure how well AI agents and workflows perform in production. You attach scores to function runs, group related work into sessions, preserve execution details in traces, and compare changes with experiments.

They are useful when you want to:

Score an agent run based on guardrails, model confidence, or output quality.
Wait for a real product signal, such as user feedback, ticket resolution, conversion, or human review.
Compare prompts, models, providers, tools, or workflow changes against production traffic.
Debug a bad result by tracing the run, session, model calls, tool calls, and outcome signal together.

Agent Evals is not a separate SDK package. It is the production evaluation workflow you build with Inngest functions, scoring, deferred scoring, sessions, traces, step experiments, and Insights.

Scoring and deferred scoring require the TypeScript SDK v4 and currently use beta APIs.

Basic example

Add scoreMiddleware() to your Inngest client, then score the current run from inside a function.

import { Inngest } from "inngest";
import { scoreMiddleware } from "inngest/experimental";

export const inngest = new Inngest({
  id: "support-agent",
  middleware: [scoreMiddleware()],
});

export default inngest.createFunction(
  {
    id: "answer-support-ticket",
    triggers: { event: "support/ticket.created" },
  },
  async ({ event, step }) => {
    const answer = await step.run("generate-answer", async () => {
      return generateAnswer(event.data.ticket);
    });

    const passed = await step.run("check-answer", async () => {
      return validateAnswer(answer);
    });

    await step.score("score-answer-quality", {
      name: "answer-quality",
      value: passed,
    });

    return { answer, passed };
  }
);

step.score() is durable and memoized. If the function retries or replays, Inngest does not record the same score twice.

How Agent Evals works

A production eval usually has four parts:

Run the workflow. Put model calls, tool calls, waits, and side effects inside Inngest functions and steps.
Preserve context. Use traces, AI metadata, OpenTelemetry, and sessions so the run can be inspected later.
Attach scores. Use direct scoring when the result is known during the run, or deferred scoring when the outcome arrives later.
Compare changes. Use step experiments when you need to compare prompts, models, tools, providers, or workflow variants.

Add session context

Sessions group related function runs by an ID from your product, such as a conversation, support ticket, import, or agent task.

Add meta.sessions when sending the event that starts the workflow:

await inngest.send({
  name: "support/ticket.created",
  data: {
    ticketId: "tk_123",
    message: "I can't sign in.",
  },
  meta: {
    sessions: {
      ticket_id: "tk_123",
    },
  },
});

The session does not change which functions run. It makes every related run easier to find and inspect in the dashboard.

Score during the run

Use direct scoring when the outcome is known before the function finishes.

Good direct scores include:

guardrail pass or fail
JSON validity
retrieval confidence
tool success
model confidence
LLM-as-a-judge result that runs inline

const confidence = await step.run("score-retrieval", async () => {
  return calculateRetrievalConfidence(results);
});

await step.score("score-retrieval-confidence", {
  name: "retrieval-confidence",
  value: confidence,
});

Scores can be numbers or booleans. Use consistent score names across runs so results can be aggregated.

Score after the run

Use deferred scoring when the signal arrives after the workflow finishes.

For example, a support agent may answer a ticket now, but the useful score may arrive later when the user clicks "helpful" or the ticket reopens.

import { createScorer } from "inngest/experimental";
import { z } from "zod";

export const feedbackScorer = createScorer(
  inngest,
  {
    id: "support-feedback-scorer",
    schema: z.object({ ticketId: z.string() }),
  },
  async ({ event, step }) => {
    const feedback = await step.waitForEvent("wait-for-feedback", {
      event: "support/feedback.received",
      timeout: "7d",
      if: `async.data.ticketId == '${event.data.ticketId}'`,
    });

    return {
      name: "user-feedback",
      value: feedback?.data.helpful ? 1 : 0,
    };
  }
);

Trigger the scorer from the function that produced the result:

export default inngest.createFunction(
  {
    id: "answer-support-ticket",
    triggers: { event: "support/ticket.created" },
  },
  async ({ event, step, defer }) => {
    const answer = await step.run("generate-answer", async () => {
      return generateAnswer(event.data.ticket);
    });

    defer("score-feedback", {
      function: feedbackScorer,
      data: { ticketId: event.data.ticketId },
    });

    return { answer };
  }
);

The scorer runs separately in the background. Its returned score is attributed to the parent run that deferred it.

Compare variants

Use step experiments when you need to compare more than one version of an AI workflow.

For example, you can split traffic between two prompt strategies, then score the selected variant when feedback arrives:

import { experiment } from "inngest";

const { result, experimentRef } = await group.experiment("answer-style", {
  variants: {
    concise: () => step.run("answer-concise", () => answerConcise(event.data)),
    detailed: () => step.run("answer-detailed", () => answerDetailed(event.data)),
  },
  select: experiment.bucket(event.data.accountId, {
    weights: { concise: 50, detailed: 50 },
  }),
});

defer("score-answer-feedback", {
  function: feedbackScorer,
  data: { ticketId: event.data.ticketId },
  experiment: experimentRef,
});

return result;

experimentRef identifies the experiment and variant that served the result. Passing it to defer() lets the scorer attribute the later score to the selected variant.

Attribute scores from later runs

When a score is written outside the run that produced the result, pass the run ID you want to score.

await inngest.score({
  name: "user-feedback",
  value: 1,
  runId: "01ABC123...",
});

For experiment variants, pass both the original runId and the experimentRef returned by group.experiment().

await inngest.score.experiment({
  name: "clickthrough",
  value: 1,
  experiment: experimentRef,
  runId: "01ABC123...",
});

See the scoring reference for the full attribution rules.

Inspect results

Agent Evals connects several docs and dashboard surfaces:

Surface	Use it to
Traces	Inspect the function run, steps, model calls, tool calls, errors, and scores.
Sessions	Find all runs related to a conversation, ticket, account, job, or agent task.
Scoring	Attach numeric or boolean quality signals to runs and steps.
Deferred scoring	Evaluate outcomes that arrive after the original workflow finishes.
Step experiments	Split traffic across workflow variants and compare scored outcomes.
Insights	Query historical run, event, step, and trace data.

Choose the right eval tool

Goal	Use
Score a guardrail, validation check, model confidence, or inline judge result	`step.score()` or `inngest.score()`
Wait for user feedback, ticket resolution, conversion, or review before scoring	Deferred scoring
Compare prompts, models, providers, tools, or workflow rewrites	Step experiments
Keep a user, account, or tenant on the same variant while testing	`experiment.bucket()`
Find all runs related to one conversation, ticket, import, or agent task	Sessions
Inspect the model call, tool call, database query, or HTTP request inside a run	OpenTelemetry and Extended Traces
Query historical events, runs, steps, traces, and scores	Insights

Notes and best practices

Start with the product outcome you care about, then decide where that signal appears.
Use direct scoring for signals known during the run.
Use deferred scoring for signals that arrive later or take time to compute.
Use sessions when many runs belong to the same conversation, ticket, user task, or job.
Use stable score names. Changing score names creates separate metrics.
Score experiments with experimentRef so outcomes are attributed to the variant that produced them.
Keep model calls, tool calls, waits, and side effects inside steps so traces explain the score.

Troubleshooting

Issue	Solution
Setup
Scoring is not available.	Use the TypeScript SDK v4 and install the latest SDK version.
`step.score()` is missing or throws.	Register `scoreMiddleware()` on the Inngest client: `middleware: [scoreMiddleware()]`.
Attribution
A score does not appear on the run you expected.	When scoring outside the current run, pass the target `runId`. See attribute scores from later runs.
A deferred score does not attach to an experiment variant.	Pass the `experimentRef` returned by `group.experiment()` when calling `defer()`. See attribute scores from later runs.
Related runs are hard to find.	Add `meta.sessions` to the events that start or connect the workflow.
Interpreting results
Boolean scores look like numbers in aggregates.	Boolean scores aggregate numerically. Treat `true` as `1` and `false` as `0`.
A model call succeeded but the score is bad.	Inspect the trace and session. Production evals measure the outcome, not only whether the model request completed.