Traces

For Coding Agents

traces.md

AGNT Traces

Every LLM call your agents make through AGNT Studio produces a trace — the exact prompt that was sent, the exact response that came back, tokens consumed, cost incurred, latency measured, and the full compilation context (variables, conditions, model config). AGNT Traces captures all of it and gives you a built-in edit-and-replay loop to fix problems on the spot.

For fleet-level operational metrics (message volumes, task completion rates, active users, assistant performance), see AGNT Analytics.

Why AGNT Traces

Observability tools show you what happened. AGNT Traces let you fix it.

Most LLM observability platforms (LangSmith, Langfuse, Helicone) give you a read-only trace viewer. You can see the prompt, the response, the tokens, the cost. Great. Now what? You copy the prompt into a playground, tweak it, re-run it manually, copy the changes back to your codebase, open a PR, get it reviewed, deploy it, and hope it works.

AGNT Studio hosts the prompts. So when you open a trace in the playground, edit a block, and re-run it — you're editing the actual prompt. Save your changes and they go directly to the draft. Publish and they're live. The distance from "this response was bad" to "it's fixed in production" is three API calls, not three days.

Quick Start

List recent traces

bash
curl "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/traces" \
  -H "Authorization: Bearer $TOKEN"

Open a trace in the playground, edit, and save

bash
# 1. Create a playground session from a trace
curl -X POST "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/traces/$TRACE_ID/playground/sessions" \
  -H "Authorization: Bearer $TOKEN"

# 2. Edit a block in the session
curl -X PATCH "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/playground/sessions/$SESSION_ID/blocks/$BLOCK_ID" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"content": "Updated instruction text with better wording."}'

# 3. Re-run to test the change (real LLM call)
curl -X POST "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/playground/sessions/$SESSION_ID/run" \
  -H "Authorization: Bearer $TOKEN"

# 4. Save changes back to the prompt draft
curl -X POST "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/playground/sessions/$SESSION_ID/save" \
  -H "Authorization: Bearer $TOKEN"

That's the closed loop. Trace to fix in four calls.

Core Concepts

Studio Traces

Studio Traces capture the full context of every LLM call made through AGNT Studio-managed prompts: the compiled prompt (after variable resolution and condition evaluation), the model response, token counts, cost, latency, and status.

Every trace records:

FieldWhat it captures
promptNameWhich prompt was compiled
manifestThe full compiled manifest (system message, tools, model config)
etagVersion fingerprint of the prompt at call time
variablesVariable values used for compilation
messagesThe message array sent to the model
outputThe model's response
inputTokensTokens in the prompt
outputTokensTokens in the response
totalTokensTotal token consumption
costDollar cost of the call
durationLatency in seconds
modelProvider, model name, and metadata
statussuccess or error
tagsCustom tags for filtering

The Playground (Trace-to-Edit Loop)

This is what separates AGNT from every other observability tool. The playground is not a separate sandbox — it's the prompt editor loaded with the trace's resolved state. Same blocks, same variables, same model config, but editable.

The workflow:

  1. Open a trace in the playground. Creates a session with the trace's state.
  2. Edit blocks. Change wording, reorder content, add or remove blocks.
  3. Update variables. Try different variable values.
  4. Switch models. Test the same prompt on a different model.
  5. Compile and run. Real LLM call with your changes.
  6. Diff. See exactly what changed between the original trace and your edits.
  7. Save. Push your edits back to the prompt's draft.
  8. Publish. Deploy the fix to production.

Every step is an API call. An agent can do this entire loop programmatically — find underperforming traces, open playground sessions, iterate on the prompt, and deploy fixes without human involvement.

Note: The playground API lives in AGNT Studio's namespace. During prompt authoring, the playground is a Studio feature for testing as you build. Post-run, the same playground becomes the trace investigation tool documented here. Same API, different entry point.

Trace Diffing

Compare a trace against the current state of its prompt:

bash
curl "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/traces/$TRACE_ID/diff" \
  -H "Authorization: Bearer $TOKEN"

This answers: "The prompt has changed since this trace was recorded — what's different?" Useful for understanding whether a regression was caused by a prompt change.

You can also diff within a playground session to see what you've changed before saving:

bash
curl "https://studio.agnt.ai/api/v1/tenants/$TENANT_ID/playground/sessions/$SESSION_ID/diff" \
  -H "Authorization: Bearer $TOKEN"

API Reference

Studio Traces (studio.agnt.ai/api/v1)

MethodPathDescriptionAuth
GET/tenants/:tenantId/tracesList tracesManagement
POST/tenants/:tenantId/tracesIngest a traceManagement
GET/tenants/:tenantId/traces/:traceIdGet trace detailManagement
GET/tenants/:tenantId/traces/:traceId/diffDiff trace vs current promptManagement

POST /tenants/:tenantId/traces

Ingest a trace from the studio-node SDK or directly.

json
{
  "promptName": "customer-support",
  "manifest": {},
  "etag": "abc123",
  "variables": { "company_name": "Acme Corp" },
  "messages": [
    { "role": "user", "content": "I need help with my order" }
  ],
  "output": "I'd be happy to help you with your order...",
  "inputTokens": 150,
  "outputTokens": 89,
  "totalTokens": 239,
  "cost": 0.0012,
  "duration": 1.2,
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4-6",
    "metadata": {}
  },
  "status": "success",
  "metadata": {},
  "tags": ["production", "support"]
}

Studio Playground (studio.agnt.ai/api/v1)

MethodPathDescriptionAuth
POST/tenants/:t/traces/:traceId/playground/sessionsCreate session from traceManagement
GET/tenants/:t/playground/sessions/:sessionIdGet sessionManagement
PATCH/tenants/:t/playground/sessions/:sessionId/blocks/:blockIdEdit blockManagement
PATCH/tenants/:t/playground/sessions/:sessionId/variablesUpdate variablesManagement
PATCH/tenants/:t/playground/sessions/:sessionId/modelsUpdate modelsManagement
POST/tenants/:t/playground/sessions/:sessionId/compileCompile sessionManagement
POST/tenants/:t/playground/sessions/:sessionId/runRun (real LLM call)Management
GET/tenants/:t/playground/sessions/:sessionId/diffDiff changesManagement
POST/tenants/:t/playground/sessions/:sessionId/saveSave back to draftManagement
DELETE/tenants/:t/playground/sessions/:sessionIdDelete sessionManagement

For Coding Agents

Traces are your feedback loop. If you're a coding agent managing prompts through AGNT Studio, here's the workflow:

The closed loop

GET /traces → find the underperforming call
POST /playground/sessions → open trace in playground
PATCH /blocks → tweak the prompt
POST /run → re-run with the change
POST /save → save edits to draft
POST /publish → deploy to production

This is the entire debug-iterate-deploy cycle, all via API. No human needed. No codebase to modify.

Pattern: Trace-driven debugging

  1. List traces with GET /tenants/:t/traces to find specific failing calls.
  2. Diff the trace against the current prompt with GET /traces/:traceId/diff to see if a prompt change caused the regression.
  3. Open a playground session, iterate on the prompt, and deploy the fix.

For fleet-level regression detection (tracking completion rates over time, spotting degradation trends), use AGNT Analytics to identify the scope first, then drill into individual traces here.

Pattern: Cost optimization

  1. Pull traces for expensive calls (sort by cost or totalTokens).
  2. Open playground sessions and test with cheaper models or shorter prompts.
  3. Compare token counts between the original trace and your playground run.
  4. Save and publish when you find a configuration that maintains quality at lower cost.

What to track

  • status: "error" traces — these are failed LLM calls. Investigate immediately.
  • High cost traces — look for prompts that are longer than they need to be.
  • High duration traces — could indicate model congestion or overly complex prompts.
  • Trace-to-prompt drift — use the diff endpoint to detect when production traces were generated by an outdated prompt version.

For Product Teams

  • Quality assurance. Every LLM response your product generates is recorded with its full context. When a customer reports a bad response, you can pull the exact trace, see the exact prompt, and understand exactly what happened.
  • The playground closes the feedback loop. Product managers can open a trace, see the problematic response, tweak the prompt in the playground, re-run it, and save the fix — without involving engineering. The distance from "this response was bad" to "it's fixed" is minutes, not sprint cycles.
  • Trace diffing answers "what changed?" When a response quality shifts, diff the trace against the current prompt version. Did someone update the prompt? Did variable values change? The diff tells you.
  • Operational metrics live in Analytics. For questions like "how many messages did we handle this week?" or "which assistants are most active?", see AGNT Analytics. Traces answer "what happened in this specific call." Analytics answers "how is our fleet performing overall."