
Building a Multi-Model AI Support Agent
Lauren Craigie· 10/24/2025 · 16 min read
Most AI support agents use a single model for every query. Simple password resets get the same treatment as complex technical debugging. Which means you’re either burning money on expensive models for trivial questions, or giving customers inadequate answers because your fast model can't reason through nuanced problems.
Thoughtful routing helps: route simple queries to fast models like GPT-5-nano, escalate complex ones to reasoning models like gpt-5-pro. But production systems need more than just routing—you need flow control to prevent cost explosions, durable execution so failures don't waste money, and streaming infrastructure for real-time responses.
In this post I’ll show you how to build all of that, with Inngest and the stack of your choice.
Why it’s hard to build multi-model agents
Before getting into the code, let's talk about why this work is currently really hard to navigate:
Problem 1: Cost Explosions
Let’s say a customer sends 500 queries in 10 minutes. Without throttling, your system processes all 500. If 20% need GPT-5-pro, that's 100 queries at $0.50 each. You just spent $50 on one customer. With proper throttling: they get 100 queries per hour. The rest queue.
Problem 2: Context Loss on Failures
Your reasoning model fails after spending 30 seconds and $0.40 fetching customer history. Traditional retry systems re-run everything. You waste another $0.40 on the same data fetch. With durable execution: only the failed step retries. Context from successful steps persists. No wasted API calls.
Problem 3: Heavyweight Infrastructure
Most teams spend 2-3 weeks building WebSocket servers, managing connections, handling reconnects, and scaling real-time infrastructure. Then they maintain it forever.
The stack we’ll use to avoid these pitfalls
We're using:
- NextJS API Routes - Typescript backend (where most AI code lives)
- OpenAI API - GPT-5-nano for fast queries, gpt-5-pro for complex reasoning
- PostgreSQL - Conversation storage
- Inngest - Streaming AI orchestration layer with flow control
- React - Frontend with TypeScript
Of course, you can swap any of these layers for your preferred tooling. The orchestration patterns we’ll discuss work with Claude, Gemini, or really any LLM. The key point to remember when choosing your stack is that production AI systems need infrastructure that understands variable costs, unpredictable timing, and context preservation.
Database Schema
Before writing any agent code, we need to decide how to store conversations. The key decision here: do we store every message separately, or batch them per conversation? Separate messages wins—it gives you granular cost tracking per response and makes it easy to analyze which queries triggered expensive models.
We need three tables. Customers to track who's using the system and their tier (for priority routing). Conversations to group related messages. Messages to store the actual back-and-forth along with metadata about which model ran, how many tokens it used, and what it cost.
This structure makes it easy to answer questions like "how much did this customer cost us this month?" or "what percentage of queries needed the reasoning model?"
CREATE TABLE customers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
tier TEXT DEFAULT 'free',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE conversations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID REFERENCES customers(id),
status TEXT DEFAULT 'open',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE messages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
conversation_id UUID REFERENCES conversations(id),
role TEXT NOT NULL,
content TEXT NOT NULL,
agent_type TEXT,
model TEXT,
tokens_used INTEGER,
cost_usd DECIMAL(10, 6),
created_at TIMESTAMPTZ DEFAULT NOW()
);
The Fast Agent with Flow Control
Now we get to the core orchestration. This is where most teams reach for a job queue like Celery or BullMQ and then spend weeks adding flow control, retry logic, and streaming on top.
The key architectural decision: how do you prevent one customer from consuming all your resources? Traditional queues give you a single global queue. If Customer A sends 1,000 queries, they fill the queue. Customer B waits. Inngest solves this with concurrency keys—you can create virtual queues per customer. Each customer gets their own throughput limit. Customer A's 1,000 queries don't block Customer B's 5 queries.
The second decision: how do you prevent cost explosions? You need throttling at the customer level. Process the first 100 queries immediately, queue the rest for the next hour. This turns unpredictable costs into predictable ones. Here's what that looks like in code. Pay attention to the function configuration at the top—that's where flow control happens:
// inngest/client.ts
import { Inngest } from "inngest";
import { realtimeMiddleware, channel, topic } from "@inngest/realtime";
export const inngest = new Inngest({
id: "support-agent",
middleware: [realtimeMiddleware()],
});
// Define the user channel for streaming messages
export const userChannel = channel((userId: string) => `user:${userId}`).addTopic(
topic("messages").type<{ chunk: string; conversation_id: string }>()
);
// inngest/functions/fastAgent.ts
import { inngest, userChannel } from "../client";
import OpenAI from "openai";
import { Pool } from "pg";
import { publish } from "@inngest/realtime";
const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
export const fastAgent = inngest.createFunction(
{
id: "fast-support-agent",
concurrency: {
limit: 50,
key: "event.data.customer_id"
},
throttle: {
limit: 100,
period: "1h",
key: "event.data.customer_id"
}
},
{ event: "support/query.received" },
async ({ event, step, publish }) => {
const { customer_id, conversation_id, message, user_id } = event.data;
// Step 1: Get customer context (cached if this step retries)
const context = await step.run("get-context", async () => {
const client = await pool.connect();
try {
const result = await client.query(
"SELECT tier FROM customers WHERE id = $1",
[customer_id]
);
return { tier: result.rows[0]?.tier || "free" };
} finally {
client.release();
}
});
// Step 2: Stream AI response with built-in streaming
const response = await step.run("generate-response", async () => {
const stream = await openai.chat.completions.create({
model: "gpt-5-nano",
messages: [
{
role: "system",
content: `Support agent. Customer tier: ${context.tier}`
},
{ role: "user", content: message }
],
stream: true
});
let fullContent = "";
let tokenCount = 0;
// Inngest makes streaming trivial - just publish chunks
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
const text = chunk.choices[0].delta.content;
fullContent += text;
tokenCount += 1;
// Publish to Inngest Realtime - subscribers get it instantly
await publish(userChannel(user_id).messages({ chunk: text, conversation_id }));
}
}
return {
content: fullContent,
tokens: tokenCount,
model: "gpt-5-nano"
};
});
// Step 3: Save to database
await step.run("save-response", async () => {
const client = await pool.connect();
try {
await client.query(
`INSERT INTO messages
(conversation_id, role, content, agent_type, model, tokens_used, cost_usd)
VALUES ($1, $2, $3, $4, $5, $6, $7)`,
[
conversation_id,
"assistant",
response.content,
"fast",
"gpt-5-nano",
response.tokens,
(response.tokens / 1_000_000) * 2.50
]
);
} finally {
client.release();
}
});
// Decide if we need to escalate to reasoning model
const needsReasoning = (
response.content.length > 800 ||
message.toLowerCase().includes("api") ||
message.toLowerCase().includes("integration") ||
message.toLowerCase().includes("webhook")
);
if (needsReasoning) {
await step.sendEvent("escalate", {
name: "support/query.escalate",
data: {
customer_id,
conversation_id,
message,
user_id,
tier: context.tier
}
});
}
return {
escalated: needsReasoning,
cost: response.tokens * 2.50 / 1_000_000
};
}
);
Let's break down what's happening:
Flow Control at Function Level
concurrency: {
limit: 50,
key: "event.data.customer_id"
}
This creates virtual queues per customer. Each customer can have max 50 concurrent executions. Customer A's 100 queries won't block Customer B's 5 queries.
throttle: {
limit: 100,
period: "1h",
key: "event.data.customer_id"
}
Each customer can trigger max 100 queries per hour. After that, events queue until the window resets. This prevents the $500 overnight surprise.
Durable Execution
Each step.run() is cached. If "generate-response" fails, "get-context" doesn't re-run. You don't pay for the database query twice. This matters when steps involve expensive API calls or slow database operations.
Built-in Streaming
await publish(userChannel(user_id).messages({ chunk: text }));
That's it. No WebSocket server. No connection management. Inngest handles the infrastructure. Your frontend subscribes to the channel and receives chunks in real-time.
The Reasoning Agent
The fast agent decides when to escalate by sending an event. Now we need a second agent that listens for that event and handles complex queries with a reasoning model.
The constraints are different here. Reasoning models are 20x more expensive and take 10x longer. You need stricter flow control—lower concurrency, tighter throttling. You also want priority routing: enterprise customers should jump ahead of free-tier customers when the queue backs up.
This is where event-driven architecture shines. The fast agent doesn't directly call the reasoning agent. It sends an event. That event triggers the reasoning agent function, which has its own independent flow control configuration. They scale separately.
// inngest/functions/reasoningAgent.ts
import { inngest } from "../client";
import OpenAI from "openai";
import { Pool } from "pg";
const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
export const reasoningAgent = inngest.createFunction(
{
id: "reasoning-support-agent",
concurrency: {
limit: 5,
key: "event.data.customer_id"
},
throttle: {
limit: 20,
period: "1h",
key: "event.data.customer_id"
},
priority: {
run: "event.data.tier == 'enterprise' ? 200 : event.data.tier == 'pro' ? 100 : 50"
}
},
{ event: "support/query.escalate" },
async ({ event, step }) => {
const data = event.data;
// Get full conversation history
const history = await step.run("get-history", async () => {
const client = await pool.connect();
try {
const result = await client.query(
`SELECT role, content FROM messages
WHERE conversation_id = $1
ORDER BY created_at ASC`,
[data.conversation_id]
);
return result.rows.map(row => ({
role: row.role as "user" | "assistant",
content: row.content
}));
} finally {
client.release();
}
});
// Use reasoning model
const response = await step.run("generate-reasoning", async () => {
const completion = await openai.chat.completions.create({
model: "gpt-5-pro",
messages: [
...history,
{ role: "user", content: data.message }
]
});
return {
content: completion.choices[0].message.content || "",
tokens: completion.usage?.total_tokens || 0,
model: "gpt-5-pro"
};
});
// Save response
await step.run("save-response", async () => {
const client = await pool.connect();
try {
await client.query(
`INSERT INTO messages
(conversation_id, role, content, agent_type, model, tokens_used, cost_usd)
VALUES ($1, $2, $3, $4, $5, $6, $7)`,
[
data.conversation_id,
"assistant",
response.content,
"reasoning",
"gpt-5-pro",
response.tokens,
(response.tokens / 1_000_000) * 60.00
]
);
} finally {
client.release();
}
});
return { cost: response.tokens * 60.00 / 1_000_000 };
}
);
Key differences:
- Lower concurrency: 5 instead of 50. GPT-5-pro is expensive and slow.
- Stricter throttle: 20/hour instead of 100.
- Priority routing: Enterprise customers process first during peak load.
Nextjs API Routes
Ok, now we need an HTTP interface between the frontend and Inngest. The API layer handles two jobs: receiving support queries from users and generating subscription tokens for the streaming connection.
The pattern here is important. Your API doesn't run the AI logic—it sends an event to Inngest and returns immediately. This keeps your API fast and lets Inngest handle the orchestration, retries, and flow control.
For streaming, you need a way to authenticate users to Inngest Realtime. That's what the subscription token endpoint does. The frontend requests a token, then uses it to subscribe to the customer's channel.
// app/api/support/route.ts
import { NextRequest, NextResponse } from "next/server";
import { inngest } from "@/inngest/client";
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
interface SupportQuery {
customer_id: string;
user_id: string;
conversation_id: string;
message: string;
}
export async function POST(req: NextRequest) {
const query: SupportQuery = await req.json();
// Validate customer exists
const client = await pool.connect();
try {
const result = await client.query(
"SELECT id FROM customers WHERE id = $1",
[query.customer_id]
);
if (result.rows.length === 0) {
return NextResponse.json(
{ error: "Customer not found" },
{ status: 404 }
);
}
} finally {
client.release();
}
// Send event to Inngest
await inngest.send({
name: "support/query.received",
data: {
customer_id: query.customer_id,
user_id: query.user_id,
conversation_id: query.conversation_id,
message: query.message
}
});
return NextResponse.json({ status: "processing" });
}
// app/api/subscription-token/route.ts
import { NextRequest, NextResponse } from "next/server";
import { getSubscriptionToken } from "@inngest/realtime";
import { inngest, userChannel } from "@/inngest/client";
export async function GET(req: NextRequest) {
const searchParams = req.nextUrl.searchParams;
const userId = searchParams.get("user_id");
if (!userId) {
return NextResponse.json(
{ error: "user_id is required" },
{ status: 400 }
);
}
const token = await getSubscriptionToken(inngest, {
channel: userChannel(userId),
ttl: "1h"
});
return NextResponse.json({ token });
}
React Frontend with Streaming
The frontend needs to handle two things: sending queries to the API and receiving streamed responses in real-time. The streaming piece is where most teams build custom infrastructure—WebSocket servers, connection pooling, reconnect logic.
With Inngest Realtime, you subscribe to a channel and get messages. That's it.
The pattern: when the component mounts, get a subscription token from your API. Use that token to subscribe to Inngest Realtime. When chunks arrive, append them to your streaming state. When streaming completes, move the full message into your message history.
// app/components/SupportChat.tsx
"use client";
import { useState, useEffect } from 'react';
import { subscribe } from '@inngest/realtime';
interface Message {
role: 'user' | 'assistant';
content: string;
}
export default function SupportChat() {
const [messages, setMessages] = useState<Message[]>([]);
const [input, setInput] = useState('');
const [streamingChunk, setStreamingChunk] = useState('');
const userId = 'user-123'; // From your auth system
useEffect(() => {
async function setupRealtimeSubscription() {
// Get subscription token from your API
const response = await fetch(`/api/subscription-token?user_id=${userId}`);
const { token } = await response.json();
// Subscribe to Inngest Realtime
const stream = await subscribe({
channel: `user:${userId}`,
topics: ['messages'],
token
});
// Handle incoming messages
for await (const message of stream) {
setStreamingChunk(prev => prev + message.data.chunk);
}
}
setupRealtimeSubscription();
}, [userId]);
// When streaming completes, add to messages
useEffect(() => {
if (streamingChunk && streamingChunk.length > 0) {
const timer = setTimeout(() => {
setMessages(prev => [...prev, {
role: 'assistant',
content: streamingChunk
}]);
setStreamingChunk('');
}, 500); // Wait 500ms after last chunk
return () => clearTimeout(timer);
}
}, [streamingChunk]);
const sendMessage = async () => {
if (!input.trim()) return;
const userMessage: Message = { role: 'user', content: input };
setMessages(prev => [...prev, userMessage]);
setInput('');
await fetch('/api/support', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
customer_id: userId,
user_id: userId,
conversation_id: 'conv-123',
message: input
})
});
};
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-6">
<div className="flex-1 overflow-y-auto space-y-4">
{messages.map((msg, i) => (
<div
key={i}
className={`p-4 rounded-lg ${
msg.role === 'user'
? 'bg-blue-50 ml-12'
: 'bg-gray-50 mr-12'
}`}
>
{msg.content}
</div>
))}
{streamingChunk && (
<div className="p-4 rounded-lg bg-gray-50 mr-12">
{streamingChunk}
<span className="inline-block w-2 h-4 ml-1 bg-gray-400 animate-pulse" />
</div>
)}
</div>
<div className="flex gap-2 mt-4">
<input
type="text"
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyPress={(e) => e.key === 'Enter' && sendMessage()}
placeholder="Describe your issue..."
className="flex-1 px-4 py-2 border rounded-lg"
/>
<button
onClick={sendMessage}
className="px-6 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
>
Send
</button>
</div>
</div>
);
}
What This Actually Prevents
Let's walk through production scenarios.
Scenario: The Spam Attack
Customer sends 500 queries in 10 minutes. With throttle at 100/hour:
- First 100 queries process immediately
- Remaining 400 queue for the next hour
- Max cost: $10 instead of $50
Scenario: The Failure Loop
Reasoning agent fails at step 2 after spending $0.40 on step 1 (getting history).
Traditional approach: Retry whole function. Waste $0.40 on step 1 again.
Inngest approach: Step 1 result is cached. Only step 2 retries. Zero wasted cost.
At 100 failures per day, this saves $40 daily. $1,200 monthly.
Scenario: The Enterprise Customer
Enterprise customer triggers 50 concurrent complex queries. Without concurrency keys, they monopolize all workers. Other customers wait 5+ minutes.
With concurrency keys: Enterprise customer gets 5 concurrent queries (or whatever their tier allows). Other customers continue processing normally.
Scenario: Building Streaming
Without Inngest: 2-3 weeks building WebSocket infrastructure. Connection management. Reconnect logic. Scaling strategy. Monitoring.
With Inngest: Call inngest.publish(). Done. Your frontend subscribes and receives chunks. Zero infrastructure work.
Extensions Worth Building
Add Human Handoff
// In your agent function
await step.waitForEvent("support/agent.approved", {
match: "data.conversation_id",
timeout: "30m"
});
This pauses the workflow and waits for a support agent to approve. The function resumes automatically when the event arrives. No polling. No state management. Just works.
Track Accuracy with Product Events
Store thumbs up/down feedback:
export const trackFeedback = inngest.createFunction(
{ id: "track-feedback" },
{ event: "support/feedback.received" },
async ({ event, step }) => {
// Link feedback to the original AI workflow run
// Use this data to tune your confidence thresholds
await step.run("save-feedback", async () => {
const client = await pool.connect();
try {
await client.query(
`INSERT INTO feedback (conversation_id, run_id, rating, comment)
VALUES ($1, $2, $3, $4)`,
[
event.data.conversation_id,
event.data.run_id,
event.data.rating,
event.data.comment
]
);
} finally {
client.release();
}
});
}
);
Add Specialized Agents
Create agents for billing, technical issues, account management. Route based on initial classification. Each agent can have different flow control settings.
Serving Your Functions
Finally, you need to serve your Inngest functions so the platform can invoke them:
// app/api/inngest/route.ts
import { serve } from "inngest/next";
import { inngest } from "@/inngest/client";
import { fastAgent, reasoningAgent, trackFeedback } from "@/inngest/functions";
export const { GET, POST, PUT } = serve({
client: inngest,
functions: [
fastAgent,
reasoningAgent,
trackFeedback,
],
});
Bringing It All Together
Let’s recap! We built a support system that:
- Routes queries between GPT-5-nano and GPT-5-pro
- Prevents cost explosions through throttling
- Isolates customers with concurrency keys, and
- Streams responses in real-time.
The core challenge we tackled: traditional job queues weren't designed for AI's variable costs (one query costs $0.005, another costs $0.50) and unpredictable duration (most finish in 2 seconds, some take 2 minutes).
Inngest handles this because flow control is configuration, not code you maintain. Durable execution means failed steps don't re-run expensive operations. Streaming is calling publish() rather than building WebSocket infrastructure. Priority routing lets enterprise customers jump the queue.
So let’s just answer what you might be thinking: Could you build this with traditional backend workers and message queues? Partially, but then you'd burn 6 months building flow control, retry logic, and streaming infrastructure that already exists in Inngest.
The companies moving fastest in AI aren't the ones with the best models—everyone has access to the same models. They're the ones who stopped rebuilding infrastructure and started building product.
Get started: app.inngest.com/sign-up