# Concepts (/docs/concepts)
The event log is the source of truth. The runtime, the inspector, and
replay all read the same shape. Every other design choice falls out of
that.
## The event log [#the-event-log]
Every run is an append-only sequence of events:
```text
seq=1 RunStarted (model, tools, system prompt, params hash)
seq=2 TurnStarted (turn id, prompt hash)
seq=3 AssistantMessageCompleted (full text, tool plans, raw response hash)
seq=4 ToolCallScheduled
seq=5 ToolCallCompleted
seq=6 TurnStarted
…
seq=N RunCompleted | RunFailed | RunCancelled (Merkle root of all priors)
```
Every event carries a `PrevHash` field equal to BLAKE3 over the canonical
CBOR of the previous event. The terminal event commits a Merkle root over
all priors. Tampering with any earlier event breaks both the chain and the
root; `eventlog.Validate` returns `ErrLogCorrupt`.
The full schema lives on the [Event schema](/docs/events) page.
## The determinism contract [#the-determinism-contract]
Starling borrows Temporal's determinism model. The agent loop is allowed
to do exactly two things:
1. Read events from the log.
2. Emit commands that the runtime reifies as new events.
Anything else: wall clock, RNG, HTTP, filesystem: must go through the
`step` package:
```go
now := step.Now(ctx) // recorded once, returned on replay
n := step.Random(ctx) // same idea, deterministic
val, err := step.SideEffect(ctx, "name", func() (T, error) {
// any non-deterministic effect goes here
})
```
Live: append a `SideEffectRecorded` event. Replay: read the recorded
value, skip the closure. The `name` argument is the lookup key — reuse
it for the same logical effect, change it when the effect changes.
## Replay [#replay]
`starling.Replay(ctx, log, runID, agent)` re-executes a recorded run
against the same agent wiring. Every event the loop attempts to emit is
compared to the recording at the matching seq:
* `Kind` mismatch: the loop produced a different event type.
* `Payload` mismatch: same kind, different bytes.
* `Exhausted`: the loop ran past the end of the recording.
Mismatches surface as a typed `*replay.Divergence` carrying `RunID`,
`Seq`, `Kind`, `ExpectedKind`, `Class`, and `Reason`. Wrap with
`errors.Is(err, starling.ErrNonDeterminism)` to detect; use `errors.As`
to get the structured fields.
Replay never calls the provider. Tool execution still runs but reads
recorded results out of the log when wrapped in `step.SideEffect`.
## Resume [#resume]
When a run crashes mid-flight, `(*Agent).Resume(ctx, runID, extra)`
reconstructs the conversation state from the log and re-enters the agent
loop in a new process. Pending tool calls are re-issued under fresh
`CallID`s; the orphaned `ToolCallScheduled` from the prior process stays
in the log for audit. A `RunResumed` seam event marks the boundary.
## Tools [#tools]
A tool is anything implementing `tool.Tool`:
```go
type Tool interface {
Name() string
Description() string
Schema() json.RawMessage // JSON Schema for input
Execute(ctx context.Context, in json.RawMessage) (json.RawMessage, error)
}
```
`tool.Typed[In, Out](name, description, fn)` is the convenience wrapper
that derives the JSON Schema from your input type via reflection. For
non-deterministic tools (HTTP, filesystem, anything beyond pure compute),
wrap the work in `step.SideEffect` so replay returns the recorded result.
## Budgets [#budgets]
Four axes:
| Axis | Where enforced |
| ----------------- | --------------------------------------- |
| `MaxInputTokens` | Pre-call, before every LLM call |
| `MaxOutputTokens` | Mid-stream, on every usage chunk |
| `MaxUSD` | Mid-stream, using per-model prices |
| `MaxWallClock` | `context.WithDeadline` wrapping the run |
A trip emits a `BudgetExceeded` event with `(limit, cap, actual, where)`
and unwinds the run with `RunFailed{ErrorType:"budget"}`. Budgets are
inline runtime checks, not after-the-fact dashboards.
## Backends [#backends]
Three event-log implementations:
* `eventlog.NewInMemory()`: tests, demos, ephemeral CLI tools.
* `eventlog.NewSQLite(path)`: single-host services. WAL mode + per-run
`_txlock=immediate` makes one-writer-many-readers correct.
* `eventlog.NewPostgres(db)`: multi-host services. Per-run advisory
locks serialize appenders by run.
All three satisfy `EventLog` and share the migration contract.
# Event schema (/docs/events)
Wire-format reference for the event log.
## Envelope [#envelope]
Every event is wrapped in the same struct:
```go
type Event struct {
RunID string `cbor:"run_id"` // ULID, per-run identifier
Seq uint64 `cbor:"seq"` // monotonic per run, starts at 1
PrevHash []byte `cbor:"prev_hash"` // BLAKE3 of canonical CBOR of prev event
Timestamp int64 `cbor:"ts"` // unix nanoseconds (from step.Now)
Kind Kind `cbor:"kind"` // discriminator
Payload cborenc.RawMessage `cbor:"payload"` // kind-specific struct, CBOR-encoded
}
```
Hash chain: `ev.PrevHash = BLAKE3(CanonicalCBOR(prevEvent))`. The first
event has empty `PrevHash`. Canonical CBOR follows RFC 8949 §4.2:
shortest integer form, sorted map keys, no indefinite-length items,
shortest float that round-trips.
## Event kinds [#event-kinds]
Closed set. Adding a kind is a schema-version bump.
### Emitted by the runtime today [#emitted-by-the-runtime-today]
| # | Kind | Emitted by |
| -- | --------------------------- | ------------------------------------------- |
| 1 | `RunStarted` | First event of every run. |
| 2 | `UserMessageAppended` | `Resume` injecting an extra message. |
| 3 | `TurnStarted` | `step.LLMCall` pre-call. |
| 4 | `ReasoningEmitted` | Provider reasoning blocks (optional). |
| 5 | `AssistantMessageCompleted` | `step.LLMCall` on `ChunkEnd`. |
| 6 | `ToolCallScheduled` | `step.CallTool` / `CallTools` pre-dispatch. |
| 7 | `ToolCallCompleted` | Tool returned successfully. |
| 8 | `ToolCallFailed` | Tool returned an error. |
| 9 | `SideEffectRecorded` | `step.Now`, `Random`, `SideEffect`. |
| 10 | `BudgetExceeded` | Budget axis tripped. |
| 12 | `RunCompleted` | Terminal: successful run. |
| 13 | `RunFailed` | Terminal: error path. |
| 14 | `RunCancelled` | Terminal: ctx cancelled. |
| 15 | `RunResumed` | Non-terminal seam from `Resume`. |
### Reserved (defined in schema, not emitted by core) [#reserved-defined-in-schema-not-emitted-by-core]
| # | Kind | Notes |
| -- | ------------------ | ----------------------------------------------------------------------------------------- |
| 11 | `ContextTruncated` | Reserved for context-window trim strategies. Validate accepts; core does not emit. |
| 16 | `TurnFailed` | Reserved for mid-turn streaming failure with retry. Validate accepts; core does not emit. |
## Payloads [#payloads]
Payloads are CBOR-encoded structs. The highlights:
### `RunStarted` (kind 1) [#runstarted-kind-1]
Pins the entire deterministic surface of the run.
```go
type RunStarted struct {
SchemaVersion uint32
Goal string
ProviderID string
ModelID string
APIVersion string
ParamsHash []byte
Params cborenc.RawMessage
SystemPromptHash []byte
SystemPrompt string
ToolRegistryHash []byte
ToolSchemas []ToolSchemaRef
Budget *BudgetLimits
StarlingVersion string // linked module version
AppVersion string // caller-supplied
}
```
### `TurnStarted` / `AssistantMessageCompleted` (kinds 3, 5) [#turnstarted--assistantmessagecompleted-kinds-3-5]
```go
type TurnStarted struct {
TurnID string
PromptHash []byte
InputTokens int64
}
type AssistantMessageCompleted struct {
TurnID string
Text string
ToolUses []PlannedToolUse
StopReason string
InputTokens int64
OutputTokens int64
CacheReadTokens int64
CacheCreateTokens int64
CostUSD float64
RawResponseHash []byte // BLAKE3 of canonicalized provider response
ProviderRequestID string
}
```
### `ReasoningEmitted` (kind 4) [#reasoningemitted-kind-4]
Optional. Anthropic-only fields are populated only when the provider
returned them; OpenAI reasoning summaries arrive without a signature.
```go
type ReasoningEmitted struct {
TurnID string
Content string
Sensitive bool
Signature []byte // Anthropic per-block integrity signature
Redacted bool // true when Content is the opaque redacted_thinking payload
}
```
### `ToolCallScheduled` / `Completed` / `Failed` (kinds 6, 7, 8) [#toolcallscheduled--completed--failed-kinds-6-7-8]
```go
type ToolCallScheduled struct {
CallID string
TurnID string
ToolName string
Args cborenc.RawMessage
Attempt uint32
IdempKey string
}
type ToolCallCompleted struct {
CallID string
Result cborenc.RawMessage
DurationMs int64
Attempt uint32
}
type ToolCallFailed struct {
CallID string
Error string
ErrorType string // "timeout" | "panic" | "tool" | "cancelled"
DurationMs int64
Attempt uint32
}
```
Retries share `CallID` with incrementing `Attempt`.
### `SideEffectRecorded` (kind 9) [#sideeffectrecorded-kind-9]
Captures any non-deterministic value the agent loop consumed. `step.Now`
emits one with `Name: "now"`; `step.Random` with `"rand"`; user calls
to `step.SideEffect` set their own name.
```go
type SideEffectRecorded struct {
Name string
Value cborenc.RawMessage
}
```
### `BudgetExceeded` (kind 10) [#budgetexceeded-kind-10]
```go
type BudgetExceeded struct {
Limit string // "input_tokens" | "output_tokens" | "usd" | "wall_clock"
Cap float64
Actual float64
Where string // "pre_call" | "mid_stream" | "post_call"
TurnID string
CallID string
PartialText string
PartialTokens int64
}
```
### `RunCompleted` / `Failed` / `Cancelled` (kinds 12, 13, 14) [#runcompleted--failed--cancelled-kinds-12-13-14]
All three terminals carry a `MerkleRoot []byte` over the BLAKE3 hashes
of every prior event. `RunCompleted` adds totals (turn count, tool-call
count, USD, tokens, duration). `RunFailed` adds an error string and
classification. `RunCancelled` adds a reason.
### `RunResumed` (kind 15) [#runresumed-kind-15]
```go
type RunResumed struct {
AtSeq uint64 // last seq from the prior process
ExtraMessage string
ReissueTools bool
PendingCalls int
}
```
A seam, not a terminal. Marks the boundary between a crashed run's
prior process and its resuming process. Validation pairing rules treat
it as a reset point: orphaned tool-call schedules from before the
seam don't need outcomes after it.
orphan"):::orphan
end
SCH --> X("crash"):::crash
subgraph p2["Process 2 · resume"]
direction LR
SR("RunResumed · seam"):::accent --> SCH2("ToolCallScheduled
fresh CallID")
SCH2 --> CC("ToolCallCompleted")
CC --> RC("RunCompleted")
end
X -.-> SR
classDef accent fill:#cffafe,stroke:#06b6d4,color:#0f172a,stroke-width:1.5px;
classDef crash fill:#fee2e2,stroke:#f43f5e,color:#0f172a,stroke-width:1.5px;
classDef orphan fill:#fef3c7,stroke:#f59e0b,color:#0f172a,stroke-dasharray:4 3;
`"
/>
## Invariants [#invariants]
`eventlog.Validate` enforces:
1. Slice non-empty, `events[0].Seq == 1`, monotonic seq with no gaps.
2. `RunID` consistent and non-empty across all events.
3. Hash chain unbroken: each `PrevHash` equals BLAKE3 of canonical
CBOR of the previous event.
4. Exactly one terminal event, and it's the last event.
5. First event is `RunStarted` with `SchemaVersion ∈ [1, current]`.
6. **Turn pairing.** Every `TurnStarted` is closed by a same-`TurnID`
`AssistantMessageCompleted` or `BudgetExceeded` before the next
`TurnStarted`. An open turn at the terminal is allowed only when
the terminal is `RunFailed` or `RunCancelled`.
7. **Call pairing.** Every `ToolCallScheduled` has exactly one matching
`ToolCallCompleted` or `ToolCallFailed` with the same `(CallID,
Attempt)`. Outcomes without a prior schedule, or duplicate outcomes
for the same key, are rejected. A `RunResumed` seam clears pending
pairing state.
8. Merkle root in the terminal payload matches the recomputed BLAKE3
Merkle root over every pre-terminal event.
Invalid log → `ErrLogCorrupt` with a diagnostic locating the violation.
## Worked example [#worked-example]
A run with one LLM turn, two parallel tool calls, then a final summary
turn:
```text
seq=1 RunStarted prev=nil
seq=2 TurnStarted prev=H(seq1) turn=T1
seq=3 AssistantMessageCompleted prev=H(seq2) turn=T1, tool_uses=[C1, C2]
seq=4 ToolCallScheduled prev=H(seq3) call=C1, attempt=1
seq=5 ToolCallScheduled prev=H(seq4) call=C2, attempt=1
seq=6 ToolCallCompleted prev=H(seq5) call=C2, attempt=1
seq=7 ToolCallCompleted prev=H(seq6) call=C1, attempt=1
seq=8 TurnStarted prev=H(seq7) turn=T2
seq=9 AssistantMessageCompleted prev=H(seq8) turn=T2, no tool_uses
seq=10 RunCompleted prev=H(seq9) merkle_root=M(seq1..seq9)
```
Parallel tool completions land in arrival order: replay reproduces
that ordering deterministically because results are read from the log,
not re-executed.
## Size expectations [#size-expectations]
Order of magnitude per event:
| Kind | Typical size |
| ----------------------------- | ------------ |
| `RunStarted` | 1–10 KB |
| `TurnStarted` | under 1 KB |
| `AssistantMessageCompleted` | 2–50 KB |
| `ToolCallScheduled/Completed` | 1–20 KB each |
| Terminals | under 1 KB |
A typical 5-turn run with 10 tool calls is \~100–500 KB. The retention
patterns are documented under [Operations](/docs/operations).
## Schema evolution [#schema-evolution]
Pre-1.0: any change permitted, schema-version bump on breaks.
After 1.0:
* Additive (new optional fields, new kinds) → minor bump.
* Breaking → major bump.
* Replayer refuses logs whose schema version is newer than its own major.
* Old logs remain replayable forever: pin the binary version, or use the
migration tools shipped at the major bump.
# Welcome (/docs)
A Go runtime for building LLM agents where every run is an append-only,
BLAKE3-chained, Merkle-rooted event log. When an agent fails in production,
inspect the log, replay it against the same agent wiring, and see the exact
step where today's behavior diverges.
## For AI assistants [#for-ai-assistants]
[**/llms-full.txt**](/llms-full.txt) packs the whole docs into one file.
Paste it into your assistant's context and it will know every type,
signature, and convention in the runtime.
## Get started [#get-started]
## Build [#build]
## Operate [#operate]
## Reference [#reference]
# Inspector (/docs/inspector)
`starling-inspect` opens a SQLite event log read-only and serves a
self-contained UI on loopback. No CDN, no JS build, never writes
back to the log.
```bash title="terminal"
go run github.com/jerkeyray/starling/cmd/starling-inspect runs.db
```
`--help` lists every flag. For non-local access, see
[Operations → Inspector auth and TLS](/docs/operations#inspector-auth-and-tls).
## Runs list [#runs-list]
Per-run totals up top, status tabs and preset chips for narrowing,
substring search on run id (`/` focuses). The runs list is paged on
the server: 50 rows by default, up to 200 with `?per_page=200`.
Filters and search terms are preserved when paging. Totals reflect the
currently visible page, while the pager shows the full matching count.
## Run detail [#run-detail]
Timeline on the left (color-coded by event family, inline
cost/token chips). Detail pane on the right with a sticky meta
header - hash, prev hash, call id are click-to-copy - and a
syntax-highlighted JSON body that wraps long strings by default.
Press `?` in the timeline header for the keyboard shortcuts.
## Diff [#diff]
`/diff` aligns two runs by sequence. Pick A and B from the latest 100
runs in the dropdowns, hit Compare. Older runs can still be compared by
opening `/diff?a={runIDA}&b={runIDB}` directly. Diverging rows get a red
left rail, a summary strip up top counts matches/diffs and points at
the first divergence.
## Replaying a run [#replaying-a-run]
A **Replay** button shows up on the run page when the inspector is
built into a dual-mode binary that wires
`starling.InspectCommand(factory)`. It opens a side-by-side
recorded-vs-reproduced timeline; divergence surfaces as a toast
plus a click-through dialog. The source log is never written to.
```go title="cmd/myagent/main.go"
if len(os.Args) > 1 && os.Args[1] == "inspect" {
return starling.InspectCommand(myFactory).Run(os.Args[2:])
}
```
## Embedding the inspector [#embedding-the-inspector]
If you want to mount the inspector in a larger HTTP server, build it
directly:
```go
import "github.com/jerkeyray/starling/inspect"
srv, err := inspect.New(log,
inspect.WithAuth(inspect.BearerAuth(os.Getenv("STARLING_INSPECT_TOKEN"))),
inspect.WithReplayer(myFactory), // optional: enables Replay button
inspect.WithDBPath("/var/lib/runs.db"), // optional: shows the DB basename in the topbar context chip
)
if err != nil { return err }
http.Handle("/runs/", http.StripPrefix("/runs", srv))
```
`WithDBPath` populates a small chip in the inspector topbar with
the database's basename (full path on hover) so operators can see
which log the UI is pointing at when several inspectors run side
by side. The standalone CLI passes it automatically.
- [Operations](/docs/operations): Deploy the inspector behind TLS, bearer auth, retention.
- [Replay tests](/docs/replay-tests): Drive the divergence machinery from Go tests.
# MCP server (inbound) (/docs/mcp-server)
This page is the inbound MCP **server** - AI assistants querying
your event log. For the outbound **client** that lets agents call
external MCP servers as tools, see [MCP tools](/docs/mcp-tools).
`starling-mcp` exposes a recorded event log to AI assistants over the
Model Context Protocol. Once installed, your AI client (Claude
Desktop, Cursor, Claude Code) can answer natural-language questions
about your agent's runs without you copy-pasting JSON.
The server is read-only by construction: the SQLite handle is opened
with `eventlog.WithReadOnly`, and only read tools are registered. It
runs as a stdio subprocess of the AI client; nothing opens a network
port.
## Install [#install]
```bash
go install github.com/jerkeyray/starling/cmd/starling-mcp@latest
```
Or build from source:
```bash
go build -o ~/bin/starling-mcp ./cmd/starling-mcp
```
The `starling` CLI also bundles it as a subcommand:
`starling mcp ` is identical.
## Wire into an MCP client [#wire-into-an-mcp-client]
**Claude Desktop** - `~/Library/Application Support/Claude/claude_desktop_config.json`
on macOS:
```json
{
"mcpServers": {
"starling": {
"command": "starling-mcp",
"args": ["/Users/me/.hearsay/saves/playbot.db"]
}
}
}
```
**Cursor** - `~/.cursor/mcp.json` (same shape).
**Claude Code** - add via `claude mcp add starling starling-mcp /path/to/runs.db`.
Restart the client. The server starts when the client first
connects, exits when the client disconnects, and re-reads the DB
file on every tool call so new events are visible immediately.
## Tools [#tools]
Seven read-only tools. All return JSON; arguments are typed.
| Tool | What it does |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `list_runs` | Enumerate recorded runs newest-first. Filters: `status`, `query` (substring match on run id + status), `since` (RFC3339), `with_tool_calls`. Page with `limit` (default 50, max 200) + `offset`. |
| `get_run` | Return the run summary plus every event with hashes and decoded payloads. Caps at 1000 events; use `offset` to page when truncated. |
| `get_event` | One event by `(run_id, seq)`, with hash, prev\_hash, kind, decoded payload. |
| `summarize_run` | Turn count, tool-call count, in/out tokens, USD cost, duration, terminal kind, final assistant text. |
| `validate_run` | Verify the BLAKE3 hash chain. Returns `{ok: true}` on a clean chain, otherwise `{ok: false, reason: "..."}`. |
| `diff_runs` | Align two runs by sequence and classify each row as `match`, `diff`, `only-a`, or `only-b`. Includes the first-divergence sequence number. |
| `search_runs` | Naive substring/kind scan across a bounded run page. Requires at least one of `query` or `kind`; cap hits with `limit` (default 50, max 500), scanned runs with `run_limit` (default 200, max 1000), and inspected events with `max_examined_events` (default 10000, max 50000). Returns `{run_id, seq, kind, summary}` per hit plus scan counters. |
## Sample interactions [#sample-interactions]
> **You:** What runs are in the database?
>
> **Claude:** *(calls `list_runs`)*
> Five runs, newest first. Three completed, one failed
> (`max_turns` at 14:32:07), one cancelled. Total cost $0.0089.
> **You:** Summarize the failed one.
>
> **Claude:** *(calls `summarize_run` with the failed run's id)*
> Run `01HZ...` failed after 5 turns and 2 tool calls,
> $0.0021, 1247 input / 38 output tokens, ran for 8.4 seconds.
> The final assistant text was empty - terminal kind is RunFailed
> with error\_type `max_turns`.
> **You:** Was there a tool that errored?
>
> **Claude:** *(calls `search_runs` with `kind=ToolCallFailed`)*
> Two `ToolCallFailed` events. Both in run `01HZ...`, both for the
> `fetch` tool with `error_type=tool` and message
> `upstream 503: service unavailable`. Same call id `c4` retried
> twice before final failure.
> **You:** Show me event 5 of that run.
>
> **Claude:** *(calls `get_event` with `seq=5`)*
> AssistantMessageCompleted at 14:32:05.882, payload says it
> planned to call `fetch` with `{url: "https://..."}`, stop\_reason
> `tool_use`, 421 input + 18 output tokens, hash
> `9038a81bea67…`.
The model picks which tool to call. You ask a normal question.
## Read-only by construction [#read-only-by-construction]
Three layers of defence:
1. **Storage layer.** `eventlog.NewSQLite(path,
eventlog.WithReadOnly())` - `Append` returns `ErrReadOnly`.
2. **Tool surface.** The server registers no write tools. There is
no `add_event`, no `prune`, no `migrate`. Even a malicious
client can't ask for what isn't there.
3. **Construction.** `mcpsrv.New` accepts an `eventlog.EventLog`,
not a path; the binary opens the read-only handle before
handing it off.
## Limits and gotchas [#limits-and-gotchas]
* **`search_runs` is naive.** It walks every event of every run
matching the run page; cost is `O(runs × events)`. It requires
either `query` or `kind`, pages runs with `run_limit` /
`run_offset`, caps hits with `limit`, and stops once
`max_examined_events` is reached. Responses include
`runs_examined`, `total_matching_runs`, `runs_capped`, and
`scan_capped`. Real indexes are a future pass.
* **`get_run` event cap.** A single run can have thousands of
events with long assistant text. Default cap is 200 events per
call; the response includes `truncated: true` and a
`total_events` count so the model knows to page. Built-in log
backends page events in storage; custom logs fall back to `Read`
plus in-memory slicing.
* **No streaming.** MCP tools are request/response. To watch a run
unfold turn-by-turn, use the [inspector](/docs/inspector) - its
SSE timeline is the right surface for that.
* **One DB per server.** Configure multiple `mcpServers` entries in
your client to query multiple databases.
## See also [#see-also]
- [MCP tools (outbound)](/docs/mcp-tools): The other half: agents calling external MCP servers as tools.
- [Inspector](/docs/inspector): Web UI for the same event log. SSE-streamed live timeline.
- [Events](/docs/events): The event Kinds whose payloads get_event decodes.
# MCP tools (outbound client) (/docs/mcp-tools)
This page is the outbound MCP **client** - agents calling external
MCP servers. For the inbound **server** that lets AI assistants
query your recorded event log, see [MCP server](/docs/mcp-server).
`tool/mcp` adapts [Model Context Protocol](https://modelcontextprotocol.io)
servers onto `tool.Tool`. The core runtime stays MCP-agnostic.
## Three transports [#three-transports]
```go
import (
"os/exec"
mcptool "github.com/jerkeyray/starling/tool/mcp"
)
// 1. stdio subprocess server.
client, err := mcptool.NewCommand(ctx,
exec.Command("uvx", "mcp-server-filesystem", "/tmp"),
mcptool.WithToolNamePrefix("fs_"),
)
// 2. streamable HTTP server.
client, err = mcptool.NewHTTP(ctx, "https://mcp.example.com/sse", nil)
// 3. any custom mcp.Transport.
client, err = mcptool.New(ctx, transport)
```
Each constructor lists the server's tools at connect time and caches them.
Call `client.Tools(ctx)` to retrieve `[]tool.Tool` for use with `Agent`,
or `client.RefreshTools(ctx)` to re-list.
## Wire into an agent [#wire-into-an-agent]
```go
client, err := mcptool.NewCommand(ctx,
exec.Command("uvx", "mcp-server-filesystem", "/tmp"),
mcptool.WithToolNamePrefix("fs_"),
mcptool.WithCallTimeout(10*time.Second),
)
if err != nil {
panic(err)
}
defer client.Close()
mcpTools, err := client.Tools(ctx)
if err != nil {
panic(err)
}
a := &starling.Agent{
Provider: prov,
Tools: append(localTools, mcpTools...),
Log: log,
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 8},
}
```
## Options [#options]
| Option | Purpose |
| ---------------------------------- | ---------------------------------------------------------------- |
| `WithClientInfo(name, version)` | Override the client identity sent on `initialize`. |
| `WithToolNamePrefix(p)` | Namespace remote tools: useful when mounting multiple servers. |
| `WithIncludeTools(...)` | Restrict to the named remote tools. |
| `WithExcludeTools(...)` | Drop the named remote tools. |
| `WithCallTimeout(d)` | Per-call deadline. Zero leaves cancellation to the caller's ctx. |
| `WithMaxOutputBytes(n)` | Cap the JSON-encoded result. Defaults to 1 MiB. |
| `WithTextOnly(true)` | Reject non-text content rather than forwarding it. |
| `WithTransientErrorClassifier(fn)` | Classify transport errors as `tool.ErrTransient` for retries. |
## Replay safety [#replay-safety]
Each MCP tool call goes through `step.SideEffect` keyed on
`mcp/`. The first live invocation contacts the server,
records the result as a `SideEffectRecorded` event, and returns. Replay
reads the recorded value out of the log and never re-contacts the server.
The same applies under `(*Agent).Resume`. An orphaned MCP call from a
crashed run is reissued under a fresh `CallID`; the new live invocation is
recorded as its own SideEffect.
This means your recorded runs are portable. You can replay them on a
machine that has no network access to the MCP server.
## Tool errors vs transport errors [#tool-errors-vs-transport-errors]
The MCP protocol distinguishes two failure modes; Starling preserves both.
**Server returned `IsError: true`.** The remote tool ran and decided the
call failed (bad input, business-rule rejection). `Execute` returns a
typed `*mcptool.ToolError` carrying the server's content. Use `errors.As`
to inspect.
**Transport or protocol error.** Connection refused, timeout, malformed
JSON-RPC, etc. `Execute` returns the underlying error wrapped with the
tool name. If `WithTransientErrorClassifier` reports the error retryable,
it's also wrapped with `tool.ErrTransient` so the caller's
`step.ToolCall{Idempotent: true, MaxAttempts: N}` retries kick in.
## What's intentionally not adapted [#whats-intentionally-not-adapted]
`tool/mcp` ships with a deliberately narrow surface:
* **MCP resources**: file-tree-style reads; not yet wired.
* **MCP prompts**: server-provided prompt templates; deferred.
* **MCP sampling**: server asking the client to run a model; deferred.
Tools cover the common ADK-parity case. If you hit a wall on resources or
prompts, file an issue with the use case before building it locally.
# Operations (/docs/operations)
## Process model [#process-model]
Starling is a Go library, not a server. Two common shapes:
1. **Embedded.** The agent runs inside your existing Go service. The
event log is a file (SQLite) or connection (Postgres) the service
already manages. The inspector ships as a separate binary or a
subcommand of your binary, pointed at the same log read-only.
2. **Sidecar inspector.** Agent in service A, inspector in service B
with a read-only handle to the same log. Use this when operators
need debugging access without redeploying the primary service.
There is no required scheduler. `Agent.Run` is a blocking Go call.
## Picking a backend [#picking-a-backend]
| Backend | Use when | Avoid when |
| ---------------------- | ------------------------------------------------------------------------------- | --------------------------------------------------------- |
| `eventlog.NewInMemory` | Tests, ephemeral CLIs. | Anything you might want to replay later. |
| `eventlog.NewSQLite` | Single-host services, edge nodes. WAL + per-run `_txlock=immediate`. | Multi-host writers: SQLite has no cross-host locking. |
| `eventlog.NewPostgres` | Multi-host services, regulated workloads, anything needing PITR or replication. | Workloads where the DB is unavailable for long stretches. |
## Schema migrations [#schema-migrations]
Every `eventlog.NewSQLite` call auto-migrates on open. Postgres callers
run migrations explicitly:
```bash
# CLI (SQLite)
starling migrate /var/lib/starling/log.db
starling schema-version /var/lib/starling/log.db
```
```go
// In-process (any backend)
log, err := eventlog.NewPostgres(db)
if err != nil { return err }
if _, err := eventlog.Migrate(ctx, log); err != nil { return err }
```
`Agent.Run`, `Agent.Resume`, and the inspector run `eventlog.Preflight`
on startup and refuse to operate against a stale or too-new schema.
Disable with `Config.SkipSchemaCheck = true` only in tests.
## SQLite [#sqlite]
```go
log, err := eventlog.NewSQLite("/var/lib/starling/log.db")
```
* WAL mode is on (`PRAGMA journal_mode=WAL`); fsync on commit is
`synchronous=NORMAL`. Set `=FULL` if you need stronger guarantees and
can pay the latency.
* File permissions: `0600`, owned by the agent user.
* One process per file. Multiple processes can read concurrently
(`WithReadOnly`), but only one process should ever write: use
Postgres for multi-writer.
* Backup: `sqlite3 log.db ".backup /tmp/log-backup.db"` while the agent
is running. Restore by stopping the agent, swapping the file, and
restarting.
## Postgres [#postgres]
```go
db, _ := sql.Open("postgres", os.Getenv("DATABASE_URL"))
db.SetMaxOpenConns(8)
log, err := eventlog.NewPostgres(db, eventlog.WithAutoMigratePG())
```
* Postgres ≥ 11 (uses `hashtextextended` for advisory locks).
* Connection pool: size to expected concurrent runs plus headroom for
the inspector.
* Per-run advisory locks (`pg_advisory_xact_lock`) serialize appends
for a given `run_id`. Different runs are independent.
* Backup: standard `pg_dump --table=eventlog_events` for logical
exports; PITR via WAL archiving for hot recovery.
* Restore: `pg_restore` into an empty schema, then run `eventlog.Migrate`.
## Security [#security]
### Threat model [#threat-model]
| Actor | What they can do | What we defend |
| -------- | ------------------------------------------------ | ------------------------------------------------------------------------ |
| Operator | Runs the process, owns the DB and provider keys. | Trusted; the runtime assumes operator code is benign. |
| End user | Supplies goals and conversation messages. | Tool inputs, event payloads, replay determinism. |
| Provider | LLM API the agent talks to. | Stream chunk validation, raw-response hash via `RequireRawResponseHash`. |
### What the hash chain does and does not prove [#what-the-hash-chain-does-and-does-not-prove]
**Proves:** events were appended in a specific order; no event was
modified after append; replays match recorded byte-exact behavior.
**Does not prove:** that the operator wrote the truth into the log
(an operator can construct any valid run); that the provider returned
a specific response (`RawResponseHash` is a BLAKE3 digest computed
by the adapter over the SDK-level response, not a vendor signature);
that the agent ran on the claimed wall-clock time (`Timestamp` comes
from `step.Now`).
For cross-process attestation, sign the terminal `RunCompleted`
payload externally: the Merkle root is the natural signing target.
### Inspector auth and TLS [#inspector-auth-and-tls]
For the user-facing tour (UI, keyboard shortcuts, diff page,
theme), see [Inspector](/docs/inspector). This section covers the
deployment posture only.
Bearer auth via `inspect.WithAuth(inspect.BearerAuth(token))` or the
`STARLING_INSPECT_TOKEN` env var. CSRF protection is always on for
state-changing routes (replay POSTs); the inspector plants an
`X-CSRF-Token` cookie on safe responses.
The inspector's HTTP server is plain HTTP. **Always front it with a
TLS-terminating reverse proxy** for non-loopback access:
```nginx
server {
listen 443 ssl;
ssl_certificate /etc/ssl/starling.pem;
ssl_certificate_key /etc/ssl/starling.key;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_buffering off; # required for SSE
proxy_read_timeout 1h;
}
}
```
For mTLS, use `ssl_verify_client on` at the proxy. The inspector itself
does not consume client certs: the proxy decision is authoritative.
### Secrets [#secrets]
| Secret | Where | Notes |
| ------------------------ | ----------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Provider API keys | `Agent.Provider` config | Pass via env, not source. Never log. |
| `STARLING_INSPECT_TOKEN` | Env var | Rotate on operator changes. |
| Postgres DSN | Env var | Use a role with minimum privileges (writer: `SELECT, INSERT`; retention job: `SELECT, DELETE`; inspector: `SELECT`). |
### Sensitive event payloads [#sensitive-event-payloads]
Payloads carry full conversations, reasoning, and tool I/O. Treat the
DB as PII: encrypt at rest, lock down OS access, never expose the
inspector outside a trusted audience.
No built-in field-level redaction. Redact at the tool boundary.
## Retention [#retention]
The log is append-only. Mutating any event breaks the hash chain for
every later event in the same run.
**Cannot do:** mutate events, delete a single event from a run, reuse
a `run_id`. The unit of deletion is the whole run.
**Can do:**
* Delete whole runs (`DELETE ... WHERE run_id IN (...)`).
* Archive runs to NDJSON via `starling export`, then delete.
* Use `starling prune` for dry-run-first rolling retention on SQLite
logs.
* Partition by time (Postgres `DROP PARTITION`, `pg_partman` or a cron
job rolling monthly partitions).
* Filter the inspector view by `RunSummary.StartedAt` instead of
deleting.
### Rolling-window deletion [#rolling-window-deletion]
```bash
starling prune --older-than 2160h /var/lib/starling/log.db
starling prune --older-than 2160h --confirm /var/lib/starling/log.db
```
`prune` deletes whole terminal runs only by default. Use `--status` to
target one status and `--limit` to break large cleanup jobs into
smaller batches. Run `VACUUM` during low traffic if you need SQLite to
return freed pages to the filesystem.
For Postgres, call `eventlog.RunPruner` in your maintenance job with a
role that has `SELECT` and `DELETE` on `eventlog_events`, or use time
partitions when you need instant archival drops. Keep inspector roles
read-only.
### PII deletion (right-to-erasure) [#pii-deletion-right-to-erasure]
Maintain an external `user_id → []run_id` index: Starling does not
keep one. On request, prune those whole runs and remove any archived
NDJSON. Don't try to selectively redact within a run; the chain depends
on every event.
## Metrics [#metrics]
`Agent.Metrics = starling.NewMetrics(reg)` registers a Prometheus
collection set on the supplied `prometheus.Registerer`. Highlights:
| Metric | Type | Labels |
| ------------------------------------------- | --------- | ------------------------------ |
| `starling_runs_started_total` | Counter | - |
| `starling_runs_in_flight` | Gauge | - |
| `starling_run_duration_seconds` | Histogram | `status` |
| `starling_run_terminal_total` | Counter | `status`, `error_type` |
| `starling_provider_calls_total` | Counter | `model`, `status` |
| `starling_provider_call_duration_seconds` | Histogram | `model` |
| `starling_provider_tokens_total` | Counter | `model`, `type` |
| `starling_tool_calls_total` | Counter | `tool`, `status`, `error_type` |
| `starling_tool_call_duration_seconds` | Histogram | `tool` |
| `starling_eventlog_appends_total` | Counter | `kind`, `status` |
| `starling_eventlog_append_duration_seconds` | Histogram | `kind` |
| `starling_budget_exceeded_total` | Counter | `axis` |
Wire the `/metrics` handler into your existing HTTP mux:
```go
metrics := starling.NewMetrics(prometheus.DefaultRegisterer)
http.Handle("/metrics", starling.MetricsHandler(prometheus.DefaultGatherer))
```
For direct `Append` callers outside `step.emit`, wrap the log with
`eventlog.WithMetrics(log, observer)` so latency histograms cover
that path too.
## Tracing [#tracing]
OpenTelemetry spans are emitted under the `starling` instrumentation
name. Wire any OTLP exporter:
```go
exp, _ := otlptracegrpc.New(ctx)
provider := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(provider)
```
Expected span tree per run:
```text
agent.run
└── agent.turn × N
├── provider.stream
└── step.tool × M
```
## Failure recovery [#failure-recovery]
A crashed run leaves an open hash chain. Restart the same `runID` with
`Agent.Resume(ctx, runID, "")`:
* If the crash happened mid-turn before `AssistantMessageCompleted`,
Resume reissues pending tool calls under fresh `CallID`s.
* If the assistant turn completed but tools were pending, same path.
* Pass `WithReissueTools(false)` to refuse and inspect manually.
Resume goes through the same `eventlog.Preflight` check as `Run`, so a
stale schema fails fast.
## Operational checklist [#operational-checklist]
* [ ] Backups verified by restoring into a staging DB monthly.
* [ ] Inspector behind TLS + bearer token.
* [ ] Provider API keys in env, not source.
* [ ] DB file/connection user has minimum privileges.
* [ ] `eventlog.Migrate` in your release script (Postgres) or trusted
to run on `NewSQLite`.
* [ ] Metrics scraped; dashboards alert on
`starling_eventlog_append_duration_seconds` p99 and
`starling_provider_call_duration_seconds` p99.
* [ ] Retention policy implemented (rolling window, archive-to-NDJSON,
or partitioning).
* [ ] Security review for tool-side network/filesystem access.
# Quickstart (/docs/quickstart)
Build a single-tool agent, persist it to SQLite, verify replay.
## Install [#install]
```bash title="terminal"
go get github.com/jerkeyray/starling
```
Go 1.26+. SQLite is pure-Go via `modernc.org/sqlite`; no CGo.
## Hello agent [#hello-agent]
```go title="main.go"
package main
import (
"context"
"fmt"
"os"
"time"
starling "github.com/jerkeyray/starling"
"github.com/jerkeyray/starling/eventlog"
"github.com/jerkeyray/starling/provider/openai"
"github.com/jerkeyray/starling/step"
"github.com/jerkeyray/starling/tool"
)
type clockOut struct {
UTC string `json:"utc"`
}
func main() {
prov, err := openai.New(openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
if err != nil {
panic(err)
}
// step.Now records the timestamp on the live run and returns the
// recorded value on replay, so the tool stays deterministic.
clock := tool.Typed(
"current_time",
"Return the current UTC time in RFC3339.",
func(ctx context.Context, _ struct{}) (clockOut, error) {
return clockOut{UTC: step.Now(ctx).UTC().Format(time.RFC3339)}, nil
},
)
a := &starling.Agent{
Provider: prov,
Tools: []tool.Tool{clock},
Log: eventlog.NewInMemory(),
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 4},
}
res, err := a.Run(context.Background(), "What is the current UTC time?")
if err != nil {
panic(err)
}
fmt.Println(res.FinalText)
}
```
`tool.Typed` derives the JSON schema from the input type. The agent
records `ToolCallScheduled` + `ToolCallCompleted` around every dispatch.
## Persist the run [#persist-the-run]
Swap the in-memory log for SQLite to keep events on disk:
```go title="main.go"
log, err := eventlog.NewSQLite("starling.db")
if err != nil {
panic(err)
}
defer log.Close()
a := &starling.Agent{
Provider: prov,
Tools: []tool.Tool{clock},
Log: log,
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 4},
}
```
`NewSQLite` opens in WAL mode and auto-migrates. One writer, many readers.
## Replay the run [#replay-the-run]
Once a run is persisted you can re-execute it against the same agent and
verify every emitted event matches the recording byte-for-byte:
```go title="main.go"
import "errors"
if err := starling.Replay(ctx, log, runID, a); err != nil {
if errors.Is(err, starling.ErrNonDeterminism) {
// A tool output, prompt, or model changed since the original run.
// err carries a *replay.Divergence with seq + class fields.
}
panic(err)
}
```
Replay reads recorded LLM and tool outputs from the log and never re-invokes
the provider. If your code path hits `step.Now`, `step.Random`, or
`step.SideEffect`, the recorded values are returned instead. Anything else
is non-deterministic and should be moved behind a `step` helper.
## Inspect a run [#inspect-a-run]
```bash title="terminal"
go run github.com/jerkeyray/starling/cmd/starling-inspect starling.db
```
Embedded read-only web UI: runs list, timeline, payload detail, replay
controls, divergence rendering. Dark by default, click-to-pick diff
page, syntax-highlighted JSON. Full tour in
[Inspector](/docs/inspector).
## Where to next [#where-to-next]
- [Concepts](/docs/concepts): The determinism contract, side effects, and the event log as source of truth.
- [MCP tools (outbound)](/docs/mcp-tools): Add remote MCP server tools as ordinary Starling tools.
- [MCP server (inbound)](/docs/mcp-server): Query your event log from Claude Desktop, Cursor, or Claude Code.
# Reference (/docs/reference)
Per-package types, signatures, and short examples.
## starling (root) [#starling-root]
The agent loop, run lifecycle, and replay surface.
### Agent [#agent]
```go
type Agent struct {
Provider provider.Provider
Tools []tool.Tool
Log eventlog.EventLog
Config Config
Budget *Budget
Metrics *Metrics
Namespace string // optional run-id prefix
}
```
`Agent` holds no per-run state. Two instances pointing at the same log
are interchangeable.
### Config [#config]
| Field | Default | Notes |
| ------------------------ | ---------------- | ------------------------------------------------------------------------- |
| `Model` | required | Provider-specific model id, e.g. `"gpt-4o-mini"`. |
| `MaxTurns` | `0` = ∞ | Caps the ReAct loop. `0` is allowed but not recommended. |
| `SystemPrompt` | `""` | Prepended to every conversation. Captured into `RunStarted`. |
| `Params` | `nil` | Provider-specific param blob (CBOR). Hashed into `RunStarted.ParamsHash`. |
| `RequireRawResponseHash` | `false` | Fail any turn whose `ChunkEnd` lacks a 32-byte raw-response digest. |
| `AppVersion` | `""` | Stamped into `RunStarted` alongside the Starling library version. |
| `EmitTimeout` | `0` = ∞ | Bounds each event-log Append under `context.WithoutCancel`. |
| `SkipSchemaCheck` | `false` | Disables `eventlog.Preflight` on `Run` / `Resume`. Tests only. |
| `Logger` | `slog.Default()` | Structured slog records for run lifecycle. |
### Run / Resume / Replay [#run--resume--replay]
* **`Run(ctx, goal) (*RunResult, error)`** — live entry. Mints a fresh
run id (namespaced when `Namespace != ""`), emits `RunStarted`, runs
the loop, returns the terminal `*RunResult`.
* **`Resume(ctx, runID, extraMessage) (*RunResult, error)`** — re-enters
a run from its last seq. Pending tool calls reissue under fresh
`CallID`s; the orphan stays for audit. `ResumeWith(...opts)` adds
`WithReissueTools(false)` for manual recovery.
* **`Replay(ctx, log, runID, agent, opts...) error`** — re-executes
against the same wiring. Returns `nil` on a clean replay, wraps a
`*replay.Divergence` with `ErrNonDeterminism` on the first
mismatch, or `ErrProviderModelMismatch` when the agent's
`Provider.ID` / `APIVersion` / `Config.Model` disagree with the
recording. `WithForceProvider()` disables the identity check.
* **`RunStream(ctx, goal) (string, <-chan AgentEvent, error)`** —
typed event stream layered over `Stream`. Variants: `TextDelta`,
`ToolCallStarted`, `ToolCallEnded`, `Done`. Channel closes after
a single `Done`.
`RunResult` carries `RunID`, `FinalText`, totals (`TurnCount`,
`ToolCallCount`, `TotalCostUSD`, `InputTokens`, `OutputTokens`,
`Duration`), `TerminalKind`, `MerkleRoot`, and `CacheStats`
(`Hits`, `Misses`, `ReadTokens`, `CreateTokens`). All recoverable
from the log; the struct is a convenience.
### Sentinel errors [#sentinel-errors]
| Error | Meaning |
| -------------------------- | ----------------------------------------------------------------------------------- |
| `ErrNonDeterminism` | Replay diverged from the recording. Wraps `*replay.Divergence`. |
| `ErrPartialToolCall` | `Resume` saw pending tool calls and `WithReissueTools(false)` was set. |
| `ErrRunNotFound` | Resume target run id is absent from the log. |
| `ErrRunAlreadyTerminal` | Resume target ended with a terminal event. |
| `ErrRunInUse` | Another writer already advanced the chain. |
| `ErrSchemaVersionMismatch` | The recording's schema version is unsupported by this binary. |
| `ErrProviderModelMismatch` | Replay agent's Provider.ID / APIVersion / Config.Model disagrees with `RunStarted`. |
## budget [#budget]
`Budget` has four axes; zero on any field disables it. A trip emits
`BudgetExceeded{Limit, Cap, Actual, Where}` and unwinds with
`RunFailed{ErrorType:"budget"}`.
| Axis (field) | Type | When |
| ----------------- | --------------- | ---------------------------------------------------- |
| `MaxInputTokens` | `int64` | Pre-call, before every `step.LLMCall`. |
| `MaxOutputTokens` | `int64` | Mid-stream on every `ChunkUsage`. |
| `MaxUSD` | `float64` | Mid-stream using `budget/prices.go` per-model rates. |
| `MaxWallClock` | `time.Duration` | `context.WithDeadline` wrapping the run. |
`budget.RegisterPricing(model, inPerMtok, outPerMtok)` registers
or overrides per-model USD pricing at runtime; resets the
unknown-model warn-once memo so a stale warning doesn't outlive
the call. Built-in rates ship for major-vendor models in
`budget/prices.go`.
## event [#event]
The wire format. Every event carries:
```go
type Event struct {
RunID string
Seq uint64
PrevHash []byte // BLAKE3 of canonical CBOR of prev event
Timestamp int64 // Unix nanoseconds
Kind Kind
Payload cborenc.RawMessage // kind-specific struct, CBOR-encoded
}
```
The full schema with payload definitions, the kinds the runtime emits,
the reserved kinds, and the invariants live on the [Events](/docs/events) page.
Encoding helpers: `Marshal`, `Unmarshal`, `Hash`, `ToJSON`.
`event.HashSize` is 32. Each typed payload has an `EncodePayload[T]`
helper; each kind has a matching accessor (`AsRunStarted`,
`AsToolCallCompleted`, …).
## eventlog [#eventlog]
```go
type EventLog interface {
Append(ctx, runID, ev) error
Read(ctx, runID) ([]Event, error)
Stream(ctx, runID) (<-chan Event, error)
Close() error
}
```
`RunLister` adds `ListRuns(ctx) ([]RunSummary, error)`. `RunPageLister`
adds `ListRunsPage(ctx, opts) (RunPage, error)` for filtered,
server-side pagination. `RunPruner` adds explicit whole-run retention
cleanup with `PruneRuns(ctx, opts) (PruneReport, error)`. All three
built-in backends implement these optional interfaces.
`RunSummary` carries per-run aggregates (`TurnCount`,
`ToolCallCount`, `InputTokens`, `OutputTokens`, `CostUSD`,
`DurationMs`) so dashboards don't have to re-aggregate event streams.
Helpers: `eventlog.AggregateRun(events)` returns the same totals
over a chained event slice (single source of truth for the
inspector and the MCP server). `eventlog.ForkSQLite(ctx, src,
dst, runID, beforeSeq)` is a WAL-safe SQLite branch via
`VACUUM INTO`, truncating one run's events at a sequence
boundary. The BLAKE3 chain helpers used by `Agent.Run` are public
at [`github.com/jerkeyray/starling/merkle`](https://pkg.go.dev/github.com/jerkeyray/starling/merkle).
### Backends [#backends]
| Constructor | Use when |
| -------------------------- | ---------------------------------------------------------------- |
| `NewInMemory()` | Tests, demos, ephemeral CLI tools. |
| `NewSQLite(path, opts...)` | Single-host services, edge nodes. |
| `NewPostgres(db, opts...)` | Multi-host services. Per-run advisory locks serialize appenders. |
Options: `WithReadOnly()` / `WithReadOnlyPG()` for inspector mode,
`WithAutoMigratePG()` to run migrations on connect.
### Validation, migrations, preflight [#validation-migrations-preflight]
* **`Validate(events)`** — seq monotonicity, hash chain, terminal
placement, Merkle root, and the semantic pairing rules from the
[Event schema](/docs/events).
* **`SchemaVersion(ctx, log)` / `Migrate(ctx, log, opts...)`** —
forward-only migration API. `Migrate` returns a `MigrationReport`.
* **`Preflight(ctx, log)`** — fails fast with `ErrSchemaOutdated` or
`ErrSchemaTooNew`. `Agent.Run`, `Agent.Resume`, and the inspector
all call it unless `Config.SkipSchemaCheck = true`.
* **`WithMetrics(log, obs)`** — wraps any `EventLog` so direct
`Append` callers see the same latency histograms `step.emit` records.
Sentinel errors: `ErrLogClosed`, `ErrLogCorrupt`, `ErrInvalidAppend`,
`ErrReadOnly`, `ErrSchemaOutdated`, `ErrSchemaTooNew`.
## step [#step]
The determinism layer. Anything non-deterministic in the agent loop must
go through `step` so replay can reproduce it byte-for-byte.
### Helpers [#helpers]
```go
func Now(ctx context.Context) time.Time
func Random(ctx context.Context) int64
func SideEffect[T any](ctx context.Context, name string, fn func() (T, error)) (T, error)
```
Live mode runs `fn` and emits a `SideEffectRecorded` event. Replay reads
the recorded value back without invoking `fn`. `T` must be CBOR-serializable.
### LLM calls [#llm-calls]
`LLMCall(ctx, req)` drives a streaming completion through the configured
provider. Emits `TurnStarted`, optional `ReasoningEmitted`, and
`AssistantMessageCompleted`. Enforces input/output/USD budgets inline.
Validates the chunk state machine (no EOF before `ChunkEnd`, no
duplicate `ChunkToolUseStart`, no chunks after `ChunkEnd`).
### Tool dispatch [#tool-dispatch]
```go
type ToolCall struct {
CallID, TurnID, Name string
Args json.RawMessage
Idempotent bool
MaxAttempts int
Backoff func(attempt int) time.Duration
}
```
* **`CallTool(ctx, c)`** — sequential dispatch.
* **`CallTools(ctx, calls)`** — fan-out with a semaphore (cap is
`step.DefaultMaxParallelTools`, 8).
* Retries kick in on `tool.ErrTransient` when `Idempotent` and
`MaxAttempts > 1`. `NewCallID()` mints fresh IDs.
### Replay errors [#replay-errors]
`MismatchError` carries `Seq`, `Kind`, `ExpectedKind`, `Class`
(`"exhausted" | "kind" | "payload" | "turn_id"`), and `Reason`. It
satisfies `errors.Is(ErrReplayMismatch)`. Use `errors.As` for the
structured fields. Other sentinels: `ErrInvalidStream`,
`ErrMissingRawResponseHash`. The replay package lifts these into
`replay.Divergence` (next section).
## tool [#tool]
```go
type Tool interface {
Name() string
Description() string
Schema() json.RawMessage // JSON Schema for input
Execute(ctx, in) (json.RawMessage, error)
}
```
`tool.Typed[In, Out](name, description, fn)` derives the JSON Schema
from `In` via reflection. Errors wrapping `tool.ErrTransient` opt the
call into retry under `step.ToolCall{Idempotent: true, MaxAttempts: N}`.
`tool.Wrap(t Tool, mw ...Middleware) Tool` composes middleware
around `Execute` while passing `Name`, `Description`, and `Schema`
through unchanged. Last middleware passed runs first
(net/http.Handler ordering); short-circuiting middleware can skip
the inner call entirely. Useful for logging, timing, span
injection, request authentication, output redaction.
### Test scaffolding (`starlingtest/`) [#test-scaffolding-starlingtest]
`ScriptedProvider` is a deterministic `provider.Provider` driven
by a slice of canned chunks per turn. Helpers `NewStream`,
`AppendRunStarted`, `AssertReplayMatches`, and
`AssertReplayDiverges` cover the common test shapes without
contacting an LLM.
### MCP adapter (`tool/mcp`) [#mcp-adapter-toolmcp]
Three constructors mount remote MCP tools as ordinary Starling tools:
* **`New(ctx, transport, opts...)`** — any `mcp.Transport`.
* **`NewCommand(ctx, exec.Cmd, opts...)`** — stdio subprocess.
* **`NewHTTP(ctx, endpoint, client, opts...)`** — streamable HTTP.
Each connects, lists remote tools, and exposes them via
`client.Tools(ctx)`. Calls route through `step.SideEffect` so replay
never re-contacts the server. Full options table on the
[MCP tools](/docs/mcp-tools) page. The inbound counterpart - a
read-only MCP **server** that exposes a recorded log to AI clients -
lives at [MCP server](/docs/mcp-server).
### HTTP daemon (`starlingd`) [#http-daemon-starlingd]
`starlingd.Command(factory)` builds a CLI entrypoint for serving your
own agent over HTTP. `starlingd.New(config)` returns an `http.Handler`
for apps that already own server setup. The daemon exposes async run
creation, bounded in-process queueing, SSE streams, read APIs,
Prometheus metrics, bearer auth, and an optional inspector mount. Full
reference lives at [HTTP daemon](/docs/starlingd).
### Built-in tools [#built-in-tools]
`tool/builtin/` ships `Fetch()` (public `http`/`https` only, 15s
timeout, 1 MiB cap, local/private-address and unsafe redirect
rejection) and `ReadFile(baseDir)` (path-escape rejection). Use
directly or as templates.
## provider [#provider]
The streaming-completion abstraction.
```go
type Provider interface {
Info() Info
Stream(ctx, req) (EventStream, error)
}
```
Optional `Capabler` exposes `Capabilities()` so the conformance suite
can skip what the adapter doesn't support. A `Request` carries `Model`,
`SystemPrompt`, `Messages`, `Tools`, `ToolChoice` (`""` | `"auto"` |
`"any"` | `"none"` | tool name), `StopSequences`, `TopK`,
`MaxOutputTokens`, and a vendor-specific `Params` blob.
`EventStream` yields `StreamChunk` values: `ChunkText`,
`ChunkReasoning`, `ChunkRedactedThinking`,
`ChunkToolUseStart/Delta/End`, `ChunkUsage`, `ChunkEnd`. The state
machine is enforced by `step.LLMCall`.
### Adapters [#adapters]
| Package | Use when |
| ---------------------- | ------------------------------------------------------------------------------------------------------------ |
| `provider/openai` | OpenAI, Groq, Together, Ollama, vLLM, LM Studio, Azure, anything else OpenAI-compatible (set `WithBaseURL`). |
| `provider/anthropic` | Messages API. Tool use, extended thinking with signature, prompt caching. |
| `provider/gemini` | Native Google Gemini. |
| `provider/bedrock` | Amazon Bedrock via native `ConverseStream` (AWS SDK v2). |
| `provider/openrouter` | OpenRouter: thin wrapper over the OpenAI adapter with attribution headers. |
| `provider/conformance` | The contract test every adapter passes. |
Each adapter advertises its support set via
`provider.Capabler.Capabilities()`. The conformance suite skips
capability-gated assertions when the adapter reports `false`.
### Error classification [#error-classification]
Adapters wrap underlying SDK / HTTP errors with one of four
sentinels for retry policy via `errors.Is`:
| Sentinel | When |
| ----------------------- | -------------------------------- |
| `provider.ErrRateLimit` | 429 / quota |
| `provider.ErrAuth` | 401 / 403 |
| `provider.ErrServer` | 5xx |
| `provider.ErrNetwork` | DNS / dial / TLS / broken stream |
Helpers: `provider.WrapHTTPStatus(err, status)` annotates by HTTP
status (delegates to `ClassifyTransport` when `status == 0`);
`provider.ClassifyTransport(err)` wraps `net.Error` and
`*url.Error` with `ErrNetwork`. 4xx errors that are neither auth
nor rate-limit pass through unwrapped on purpose - they reflect
caller bugs, not transient conditions.
## replay [#replay]
* **`Verify(ctx, log, runID, agent)`** — headless check. Returns `nil`
on a clean replay or wraps `*Divergence` with `ErrNonDeterminism` on
the first mismatch. `starling.Replay` is a thin wrapper that takes
`*Agent` directly.
* **`Stream(ctx, factory, log, runID)`** — inspector path. Yields a
`ReplayStep` per emitted event so the UI can render recorded vs
produced side-by-side. The final step has `Diverged: true` when the
replay didn't reach the recorded terminal.
`Divergence` carries `RunID`, `Seq`, `Kind`, `ExpectedKind`, `Class`,
`Reason`. `Factory` is `func(ctx) (Agent, error)`.
## inspect [#inspect]
Embedded HTTP handler. Serves the runs list, per-run timeline, event
detail, live tail (SSE), and replay. The standalone binary at
`cmd/starling-inspect` opens any SQLite log read-only.
`inspect.New(log, opts...) (*Server, error)`. Options:
* **`WithAuth(authenticator)`** — protect every endpoint.
* **`BearerAuth(token)`** — convenience `Authenticator`.
* **`WithReplayer(factory)`** — enable replay re-execution.
* **`WithDBPath(path)`** — show the DB basename in the topbar
context chip (full path on hover).
Read-only by construction, CSRF-protected on the replay POST endpoints.
Front it with TLS in production: see [Operations](/docs/operations).
## CLI (`cmd/starling`) [#cli-cmdstarling]
```bash
starling validate [] # hash chain + Merkle check
starling export # NDJSON event dump
starling prune [flags] # dry-run-first retention deletion
starling inspect [flags] # local web inspector (read-only)
starling mcp # read-only MCP server over stdio for AI clients
starling replay # headless replay (dual-mode binary only)
starling migrate [-dry-run] # apply pending schema migrations
starling schema-version # print the current schema version
starling doctor # quick health check: version, env vars, schema, validation
starling version # print the binary's Starling version (also -v / --version)
```
The stock binary is SQLite-only. Building a dual-mode binary that
links your agent factory enables `starling replay` and
`starling inspect` with replay re-execution.
## Examples [#examples]
| Path | What it shows |
| -------------------------- | -------------------------------------------------------------------------------------- |
| `examples/m1_hello` | Minimal hello agent, dual-mode inspector, OTel stdout exporter. |
| `examples/incident_triage` | Multi-tool agent, budgets, Resume, replay regression test, Postgres, Prometheus, OTel. |
# Replay-driven tests (/docs/replay-tests)
The replay model isn't just for production debugging. Treat a recorded
run as a test fixture, and `starling.Replay` becomes a regression test
that fails the moment your agent's logic shifts: refactored a tool,
changed a prompt, swapped a model, upgraded a dependency.
## The shape [#the-shape]
1. Record a real run once, against a real provider.
2. Commit the resulting event log as a test fixture.
3. In CI, replay the fixture against the same agent wiring (no provider
network call). Replay returns `nil` on byte-identical re-execution
and a typed `*replay.Divergence` otherwise.
That's it. No mocks, no recorded HTTP fixtures, no snapshot dance.
## Capture a fixture [#capture-a-fixture]
Run the agent once with a SQLite log under your test data dir:
```go title="capture/main.go"
package main
import (
"context"
"fmt"
"os"
starling "github.com/jerkeyray/starling"
"github.com/jerkeyray/starling/eventlog"
"github.com/jerkeyray/starling/provider/openai"
"github.com/jerkeyray/starling/tool"
)
func main() {
log, err := eventlog.NewSQLite("testdata/golden.db")
if err != nil { panic(err) }
defer log.Close()
prov, err := openai.New(openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
if err != nil { panic(err) }
a := newAgent(prov, log) // your Agent constructor
res, err := a.Run(context.Background(), "What is the current UTC time?")
if err != nil { panic(err) }
fmt.Println("captured", res.RunID)
}
```
Run it once, commit `testdata/golden.db` and the captured `runID`. From
this point on you never touch the network for this test.
## Replay in CI [#replay-in-ci]
```go title="agent_test.go"
package agent_test
import (
"context"
"errors"
"testing"
starling "github.com/jerkeyray/starling"
"github.com/jerkeyray/starling/eventlog"
"github.com/jerkeyray/starling/replay"
)
const goldenRunID = "01HZ8PQ5...XKJ3"
func TestAgentMatchesRecording(t *testing.T) {
log, err := eventlog.NewSQLite("testdata/golden.db", eventlog.WithReadOnly())
if err != nil { t.Fatal(err) }
t.Cleanup(func() { _ = log.Close() })
a := newAgent(stubProvider(), log) // any non-nil provider; Replay swaps it
err = starling.Replay(context.Background(), log, goldenRunID, a)
if err == nil {
return // byte-identical
}
var d *replay.Divergence
if errors.As(err, &d) {
t.Fatalf("agent diverged from recording at seq=%d kind=%s class=%s reason=%s",
d.Seq, d.Kind, d.Class, d.Reason)
}
t.Fatal(err)
}
```
`starling.Replay` shallow-clones the Agent and overrides `Provider`
with a synthetic replay provider that yields chunks from the recording.
Your real provider is never contacted. The original `Agent.Provider`
must still be non-nil because `validate()` runs before the swap; any
stub will do.
Tools, on the other hand, **re-execute live**. The tool's `Execute`
method runs and its output bytes are compared to the recorded
`ToolCallCompleted` payload. A deterministic tool replays cleanly. A
tool that reads `time.Now()` directly produces a new timestamp on
replay and diverges. Wrap non-deterministic reads in `step.Now`,
`step.Random`, or `step.SideEffect` so replay returns the recorded
value.
## What "byte-identical" means [#what-byte-identical-means]
Replay compares each event the live loop emits to the recording at the
same `seq`. A divergence falls into one of four classes:
| `Class` | What it means |
| ----------- | --------------------------------------------------------- |
| `kind` | The loop produced a different event type at this `seq`. |
| `payload` | Same kind, different bytes. |
| `turn_id` | A turn started under a different `TurnID` than recorded. |
| `exhausted` | The loop ran past the end of the recording (extra event). |
The first divergence is reported. The recording is the source of truth.
## Make tools replay-safe before you record [#make-tools-replay-safe-before-you-record]
Recording captures the values the loop consumed at `step` boundaries.
Anything outside `step` will diverge on every replay because it isn't
recorded. Common culprits:
* Reading `time.Now()` directly inside a tool. Use `step.Now(ctx)`.
* Calling `rand.Intn` inside a tool or middleware. Use `step.Random(ctx)`.
* Hitting an HTTP endpoint inside a tool without `step.SideEffect`.
* Using `os.Getenv` mid-run. Read once at construction or wrap in a
`step.SideEffect` keyed on the variable name.
If your test passes once and fails on the next run with no code change,
something non-deterministic leaked past `step`.
## What divergence catches [#what-divergence-catches]
Real things that flip the test from green to red:
* A model upgrade that changes tool-plan order.
* A prompt edit that changes `RunStarted.SystemPromptHash`.
* A tool that started returning slightly different JSON (whitespace,
field reordering, added field).
* A schema migration that changed the canonical CBOR shape of a
payload.
* A library upgrade that changed RNG seeding or a cost-table value.
Each is a real signal that today's build does something different from
the day you recorded the fixture.
## Provider / model mismatch [#provider--model-mismatch]
Before any turn replays, `Replay` cross-checks the agent's current
`Provider.ID` / `APIVersion` / `Config.Model` against the values
recorded in the log's `RunStarted` event. If any of the three
disagree, replay fails fast with `starling.ErrProviderModelMismatch`
* typically the "I edited the agent factory and forgot the
fixture is older" mistake, surfaced before the test produces a
confusing turn-1 byte diff.
```go
err := starling.Replay(ctx, log, runID, agent)
if errors.Is(err, starling.ErrProviderModelMismatch) {
t.Fatalf("agent wiring drifted from fixture: %v", err)
}
```
Override when the divergence is intentional (e.g. you re-recorded
the fixture and want the same test file to replay both shapes):
```go
err := starling.Replay(ctx, log, runID, agent, starling.WithForceProvider())
```
The CLI equivalent is `starling replay --force `. With
`--force`, all other replay invariants still apply - chunks, tool
output bytes, step-name lookups - so the only thing relaxed is the
provider/model identity check itself.
## Multiple fixtures [#multiple-fixtures]
Capture more than one run, one per behavior you care about: happy
path, tool-error path, budget trip, multi-turn refinement. Each gets a
test. The fixtures live under `testdata/`; the test file picks a
`runID` per case.
For sprawling fixtures, the `starling export` CLI dumps a run to NDJSON
so you can review what's recorded and trim noise before committing.
## CI shape [#ci-shape]
A typical CI job looks like:
```yaml title=".github/workflows/test.yml"
- name: Replay golden runs
run: go test ./... -run TestAgentMatches -count=1
```
No `OPENAI_API_KEY`, no Anthropic key, no network. The fixture is the
contract. If your test suite fails on a PR that should be a no-op,
replay points at the exact `seq` where today's behavior diverged from
the day the fixture was recorded.
## Limits [#limits]
Replay verifies *behavior under the same wiring*. It does not catch:
* Bugs in code paths your fixture didn't exercise. Coverage still
matters; pick fixtures deliberately.
* Provider regressions on the *real* network. That's still on you to
monitor with the metrics from [Operations](/docs/operations).
* Issues that only manifest under concurrency or scale. Replay is
single-run.
What it does catch is the class of change most agent codebases miss
entirely: *"my code looks the same and the model still works, but
the agent quietly does something different now."*
## Where to next [#where-to-next]
- [Concepts](/docs/concepts): The determinism contract behind why this works.
- [Reference · replay](/docs/reference#replay): Verify, Stream, Divergence, ErrNonDeterminism.
# HTTP daemon (/docs/starlingd)
`package starlingd` turns your own `*starling.Agent` wiring into a
small private HTTP service. Use it when another app needs to enqueue
runs, stream progress, read recorded events, scrape Prometheus
metrics, and open the inspector next to the API.
`starlingd` is intentionally not a distributed job system. The current
queue is in-process FIFO. Use one daemon process per queue, and put
durable orchestration above it if you need cross-process failover.
## Minimal binary [#minimal-binary]
```go
package main
import (
"context"
"os"
starling "github.com/jerkeyray/starling"
"github.com/jerkeyray/starling/provider/openai"
"github.com/jerkeyray/starling/starlingd"
)
func buildAgent(ctx context.Context) (*starling.Agent, error) {
prov, err := openai.New(openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
if err != nil {
return nil, err
}
return &starling.Agent{
Provider: prov,
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 4},
// starlingd overwrites Log and Metrics for every run.
}, nil
}
func main() {
if len(os.Args) > 1 && os.Args[1] == "serve" {
if err := starlingd.Command(buildAgent).Run(os.Args[2:]); err != nil {
panic(err)
}
return
}
panic("usage: my-agent serve [flags] ")
}
```
Run it:
```bash
STARLINGD_TOKEN=secret my-agent serve --addr 127.0.0.1:8080 starling.db
```
## Flags [#flags]
| Flag | Default | Meaning |
| ----------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `--addr` | `127.0.0.1:8080` | HTTP bind address. |
| `--token` | empty | Bearer token. `STARLINGD_TOKEN` is also read. Empty disables auth. |
| `--workers` | `4` | Number of in-process run workers. |
| `--queue` | `100` | Maximum queued runs. |
| `--job-retention` | `5m` | How long terminal in-memory job status is retained after completion, cancellation, or failure. Negative disables cleanup. |
| `--no-inspect` | `false` | Disable the inspector mount. |
## HTTP API [#http-api]
All endpoints require `Authorization: Bearer ` when auth is
configured, including `/metrics` and `/inspect/`.
| Method | Path | Meaning |
| ------ | -------------------------------------------------------- | ------------------------------------------------------------------------- |
| `GET` | `/api/v1/healthz` | Process liveness. |
| `GET` | `/api/v1/readyz` | Event log preflight and queue capacity. |
| `POST` | `/api/v1/runs` | Enqueue a run. Body: `{"goal":"..."}`. Returns `202` with `run_id`. |
| `GET` | `/api/v1/runs?limit=50&offset=0&status=completed&q=text` | List runs from the event log. |
| `GET` | `/api/v1/runs/{runID}` | Run summary/detail. Queued runs return daemon status before events exist. |
| `GET` | `/api/v1/runs/{runID}/events?limit=200&offset=0` | Raw event page. |
| `GET` | `/api/v1/runs/{runID}/stream` | Server-sent events for history plus live updates. |
| `POST` | `/api/v1/runs/{runID}/cancel` | Cancel a queued or running in-process job. |
| `GET` | `/metrics` | Prometheus metrics. |
| `GET` | `/inspect/` | Inspector, when enabled. |
Create and stream a run:
```bash
curl -sS -H "Authorization: Bearer secret" \
-H "Content-Type: application/json" \
-d '{"goal":"summarize this incident"}' \
http://127.0.0.1:8080/api/v1/runs
curl -N -H "Authorization: Bearer secret" \
http://127.0.0.1:8080/api/v1/runs//stream
```
The stream emits SSE events named `status`, `event`, `done`, and
`error`. `event` payloads include `seq`, `kind`, `timestamp`, hash
fields, and the decoded event payload.
The `/events` endpoint pages at the event-log backend when supported.
Default `limit` is 200 and the maximum accepted limit is 1000; use
`offset` to walk long runs without loading the whole event stream.
`daemon_status` is best-effort process memory for recently queued,
running, or terminal jobs. It is retained for `--job-retention`, then
discarded. Use the log-derived `status` field as the authoritative run
state.
## Programmatic server [#programmatic-server]
Use `starlingd.New` when you already own HTTP server setup:
```go
log, _ := eventlog.NewSQLite("starling.db")
inspectorLog, _ := eventlog.NewSQLite("starling.db", eventlog.WithReadOnly())
reg := prometheus.NewRegistry()
metrics := starling.NewMetrics(reg)
srv, err := starlingd.New(starlingd.Config{
Factory: func(context.Context) (*starling.Agent, error) {
return &starling.Agent{Provider: prov, Config: cfg}, nil
},
Log: log,
InspectorLog: inspectorLog,
Metrics: metrics,
Gatherer: reg,
Workers: 8,
QueueSize: 500,
Auth: starlingd.BearerAuth(os.Getenv("STARLINGD_TOKEN")),
Inspector: true,
DBPath: "starling.db",
})
```
The factory must return a fresh agent per run. `starlingd` assigns the
shared event log and metrics sink before execution, then uses
`Agent.RunWithID` so the HTTP response can expose the run id before
the worker starts.
When using `starlingd.Command`, the run log is opened writable and the
inspector is mounted with a separate SQLite `eventlog.WithReadOnly()`
handle. When using `starlingd.New` directly, pass `InspectorLog` if
you want that same hard guard; otherwise the inspector uses `Log`.
`starlingd.Command` does not expose inspector replay wiring. If your
dual-mode binary needs inspector replay, construct the server with
`starlingd.New` and pass `Config.ReplayFactory`.
## Production notes [#production-notes]
* Put TLS, rate limits, request size limits, and user auth at your
edge or reverse proxy. The built-in auth is a single bearer-token
guard for private services.
* Keep `--queue` bounded. A full queue returns `503` instead of hiding
backpressure.
* Keep `--job-retention` finite. The event log remains the source of
truth for completed runs; daemon memory only needs short-lived job
status for recently queued/running requests.
* Use a durable log backend. SQLite is fine for one daemon node;
Postgres is the better fit when other services need to query runs.
* Cancellation only covers jobs in the current process. If the daemon
dies, the event log remains the source of truth, but queued jobs are
not durable.
* Mounting `/inspect` is useful for internal operations. Keep it
behind the same private boundary as the API.
# Build your first agent (/docs/build/agent)
`starling.Agent` is configuration plus dependencies. It holds no
per-run state; all state lives in the event log. Two `Agent` instances
pointing at the same log are interchangeable.
## The struct [#the-struct]
```go
type Agent struct {
Provider provider.Provider // required
Tools []tool.Tool
Log eventlog.EventLog // required
Budget *Budget
Config Config
Namespace string // optional run-id prefix
Metrics *Metrics
}
```
`Run` validates these before starting:
| Field | Required | Failure |
| -------------- | --------------- | --------------------------------------------------- |
| `Provider` | yes | `"starling: Agent.Provider is nil"` |
| `Log` | yes | `"starling: Agent.Log is nil"` |
| `Config.Model` | yes (non-empty) | `"starling: Agent.Config.Model is empty"` |
| `Namespace` | no, but format | must not contain `/` (reserved separator) |
| `Tools` | no | tool `Name()`s must be unique within a single agent |
## Config [#config]
```go
type Config struct {
Model string
SystemPrompt string
Params cborenc.RawMessage
MaxTurns int // 0 = unlimited
RequireRawResponseHash bool
AppVersion string
EmitTimeout time.Duration // 0 = no timeout
SkipSchemaCheck bool
Logger *slog.Logger // nil → slog.Default()
}
```
| Field | Default | Notes |
| ------------------------ | ---------------- | ------------------------------------------------------------------------------ |
| `Model` | required | Provider-specific id, e.g. `"gpt-4o-mini"`, `"claude-haiku-4-5-20251001"`. |
| `SystemPrompt` | `""` | Hashed into `RunStarted.SystemPromptHash`. Changing it makes old runs diverge. |
| `Params` | `nil` | Vendor-specific param blob. Hashed into `RunStarted.ParamsHash`. |
| `MaxTurns` | `0` (unlimited) | Cap the ReAct loop. `0` is allowed but not recommended for production. |
| `RequireRawResponseHash` | `false` | Fail any turn whose `ChunkEnd` lacks a 32-byte response digest. Audit-grade. |
| `AppVersion` | `""` | Stamped into `RunStarted` alongside the Starling version. |
| `EmitTimeout` | `0` (no timeout) | Bounds each terminal-event append under `context.WithoutCancel`. |
| `SkipSchemaCheck` | `false` | Disables `eventlog.Preflight` on `Run` and `Resume`. Tests only. |
| `Logger` | `slog.Default()` | Structured slog records for run lifecycle. |
## Minimal agent [#minimal-agent]
```go
package main
import (
"context"
"os"
starling "github.com/jerkeyray/starling"
"github.com/jerkeyray/starling/eventlog"
"github.com/jerkeyray/starling/provider/openai"
)
func main() {
prov, err := openai.New(openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
if err != nil { panic(err) }
log, err := eventlog.NewSQLite("starling.db")
if err != nil { panic(err) }
defer log.Close()
a := &starling.Agent{
Provider: prov,
Log: log,
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 8},
}
res, err := a.Run(context.Background(), "What is 2+2?")
if err != nil { panic(err) }
println(res.FinalText)
}
```
## RunResult [#runresult]
```go
type RunResult struct {
RunID string
FinalText string
TurnCount int
ToolCallCount int
TotalCostUSD float64
InputTokens int64
OutputTokens int64
Duration time.Duration
TerminalKind event.Kind // RunCompleted | RunFailed | RunCancelled
MerkleRoot []byte
CacheStats CacheStats
}
type CacheStats struct {
Hits int // turns whose CacheReadTokens > 0
Misses int // turns that consumed input but read 0 cached prefix
ReadTokens int64 // sum of CacheReadTokens across turns
CreateTokens int64 // sum of CacheCreateTokens across turns
}
```
All values recoverable from the log; `RunResult` is a convenience.
`CacheStats` aggregates prompt-cache activity from per-turn
`AssistantMessageCompleted` events. Anthropic and other providers
that surface cache token counts populate non-zero values; for
others the field is the zero value.
## Streaming with `RunStream` [#streaming-with-runstream]
For chat-style frontends that want typed events instead of raw
log entries, `Agent.RunStream` projects the lower-level `Stream` onto
four typed variants:
```go
runID, ch, err := a.RunStream(ctx, "Look up customer 42 ...")
if err != nil { return err }
for ev := range ch {
switch e := ev.(type) {
case starling.TextDelta: // emitted on AssistantMessageCompleted
case starling.ToolCallStarted: // ToolCallScheduled
case starling.ToolCallEnded: // ToolCallCompleted/Failed; e.Err set on failure
case starling.Done: // always last; e.TerminalKind, e.FinalText, e.Err
}
}
```
The channel always closes after a single `Done`. Use `Stream`
directly if you need every event with the full envelope (sequence
numbers, every Kind).
## Resume after crash [#resume-after-crash]
```go
res, err := a.Resume(ctx, runID, "" /* extra message */)
// Refuse to re-fire pending tools (use when tools are mutating):
res, err := a.ResumeWith(ctx, runID, "", starling.WithReissueTools(false))
```
Sentinel errors you'll catch:
| Error | Meaning |
| -------------------------- | --------------------------------------------------------- |
| `ErrRunNotFound` | Resume target run id is absent from the log. |
| `ErrRunAlreadyTerminal` | Resume target ended in a terminal event. |
| `ErrPartialToolCall` | `Resume` saw pending tools and `WithReissueTools(false)`. |
| `ErrRunInUse` | Another writer advanced the chain mid-resume. |
| `ErrSchemaVersionMismatch` | Recorded schema version is unsupported by this binary. |
## Namespace [#namespace]
Run IDs are ULIDs. `Namespace = "support-agent"` produces ids like
`support-agent/01HZ8…`. Useful when one `EventLog` holds runs from
multiple agents. The separator is `/`; the namespace must not contain
one.
## Metrics [#metrics]
```go
import "github.com/prometheus/client_golang/prometheus"
reg := prometheus.NewRegistry()
metrics := starling.NewMetrics(reg)
a := &starling.Agent{
Provider: prov,
Log: log,
Metrics: metrics, // nil = no metrics, runtime is no-op
Config: starling.Config{Model: "gpt-4o-mini"},
}
```
Metric names and labels are documented under [Operations](/docs/operations).
## Dev vs prod defaults [#dev-vs-prod-defaults]
| Field | Dev | Prod |
| ------------------------ | -------------------- | -------------------------------------- |
| `Log` | `NewInMemory()` | `NewSQLite(path)` or `NewPostgres(db)` |
| `MaxTurns` | small (3-4) | bounded (8-16) — never `0` |
| `Budget` | optional | always set, especially `MaxUSD` |
| `Metrics` | nil | non-nil, scraped |
| `RequireRawResponseHash` | false | true for audit-critical workloads |
| `EmitTimeout` | `0` | `5 * time.Second` to bound shutdown |
| `SkipSchemaCheck` | true (in tests only) | false (default) |
## Where to next [#where-to-next]
- [Write a tool](/docs/build/tools): Typed tools, side effects, retries, idempotency.
- [Wire providers](/docs/build/providers): OpenAI, Anthropic, Gemini, Bedrock, OpenRouter — pick one or swap mid-fleet.
# Bound your costs (/docs/build/budgets)
`Budget` is a four-axis cap enforced at runtime. Trips emit a
`BudgetExceeded` event with exact context and unwind the run with
`RunFailed{ErrorType:"budget"}`. Inline runtime checks, not after-the-fact
dashboards.
## The struct [#the-struct]
```go
type Budget struct {
MaxInputTokens int64
MaxOutputTokens int64
MaxUSD float64
MaxWallClock time.Duration
}
```
Zero on any field disables that axis.
## Where each axis trips [#where-each-axis-trips]
| Axis | Type | Where enforced | `BudgetExceeded.Where` |
| ----------------- | --------------- | ---------------------------------------------------- | ---------------------- |
| `MaxInputTokens` | `int64` | Pre-call, before every `step.LLMCall` | `"pre_call"` |
| `MaxOutputTokens` | `int64` | Mid-stream, on every `ChunkUsage` | `"mid_stream"` |
| `MaxUSD` | `float64` | Mid-stream, using `budget/prices.go` per-model rates | `"mid_stream"` |
| `MaxWallClock` | `time.Duration` | `context.WithDeadline` wrapping the run | `"mid_stream"` |
The token and USD axes count cumulatively across the whole run, not per
turn. `MaxWallClock` is wall-clock from `Run` entry to terminal event.
## Wiring a Budget [#wiring-a-budget]
```go
a := &starling.Agent{
Provider: prov,
Log: log,
Tools: tools,
Budget: &starling.Budget{
MaxInputTokens: 100_000,
MaxOutputTokens: 8_000,
MaxUSD: 1.50,
MaxWallClock: 2 * time.Minute,
},
Config: starling.Config{Model: "gpt-4o-mini", MaxTurns: 12},
}
```
`nil` Budget disables every axis.
## The BudgetExceeded event [#the-budgetexceeded-event]
```go
type BudgetExceeded struct {
Limit string // "input_tokens" | "output_tokens" | "usd" | "wall_clock"
Cap float64
Actual float64
Where string // "pre_call" | "mid_stream"
TurnID string // omitempty
CallID string // omitempty
PartialText string // omitempty
PartialTokens int64 // omitempty
}
```
Mid-stream trips include `PartialText` and `PartialTokens` so you can
recover what the model produced before the cap was hit. Pre-call trips
don't — the call never started.
After `BudgetExceeded`, the run unwinds with:
```go
// RunFailed payload (truncated):
{
ErrorType: "budget",
Limit: "usd", // matches BudgetExceeded.Limit
// ...
}
```
## Reading a trip in CI [#reading-a-trip-in-ci]
```go
events, err := log.Read(ctx, runID)
if err != nil { return err }
for _, ev := range events {
if ev.Kind != event.KindBudgetExceeded { continue }
var be event.BudgetExceeded
if err := event.AsBudgetExceeded(ev, &be); err != nil { return err }
log.Printf("budget %s tripped at %s: cap=%.2f actual=%.2f",
be.Limit, be.Where, be.Cap, be.Actual)
if be.PartialText != "" {
log.Printf("partial output (%d tokens): %s", be.PartialTokens, be.PartialText)
}
}
```
## USD pricing [#usd-pricing]
The built-in price table lives in `budget/prices.go` and ships
rates for the major-vendor models (OpenAI, Anthropic, Gemini,
Bedrock foundation models). Rates are USD per million input /
output tokens.
For custom in-house models or vendor models the table doesn't yet
cover, register pricing at runtime:
```go
import "github.com/jerkeyray/starling/budget"
budget.RegisterPricing("my-finetune", inPerMtok, outPerMtok)
```
`RegisterPricing` clears the unknown-model warn-once memo so a
stale "no price entry for ..." warning doesn't outlive the
registration. Negative or zero rates are accepted (they multiply
through unmodified) - the intended use is custom models, not
overriding shipped rates.
For models still not in the table, `MaxUSD` enforcement skips that
axis (the runtime can't price what it doesn't know). Always set
`MaxInputTokens` and `MaxOutputTokens` as defense-in-depth.
## Picking values [#picking-values]
| Workload | Sane starting caps |
| ------------------------------- | ---------------------------------------------------------------------------------- |
| Small QA bot, 1-3 turns | `MaxInputTokens: 8k`, `MaxOutputTokens: 2k`, `MaxUSD: 0.05`, `MaxWallClock: 30s` |
| Multi-tool research, 8-12 turns | `MaxInputTokens: 100k`, `MaxOutputTokens: 8k`, `MaxUSD: 1.50`, `MaxWallClock: 2m` |
| Long-running incident triage | `MaxInputTokens: 250k`, `MaxOutputTokens: 16k`, `MaxUSD: 5.00`, `MaxWallClock: 5m` |
| Replay only (no provider call) | All axes 0; replay never pays cost. |
Production agents should always set `MaxUSD`, even if generous. It's
the only axis that scales 1:1 with billing surprises.
## Recovering after a trip [#recovering-after-a-trip]
The terminal `RunFailed{ErrorType:"budget"}` is final — the run cannot
continue under the same id. If you want a follow-up turn under a fresh
budget:
1. Read the recorded run, find the `BudgetExceeded.PartialText` if
any.
2. Construct a new run with a new goal that incorporates the partial
output as context.
3. The old run stays in the log for audit.
## Budgets vs `MaxTurns` [#budgets-vs-maxturns]
`Config.MaxTurns` caps the ReAct loop count. It's not a budget axis;
it does not emit `BudgetExceeded`. A turn cap trip terminates with
`RunFailed{ErrorType:"max_turns"}`. Use both: budgets for cost and
time, `MaxTurns` for runaway tool-use loops.
## Anti-patterns [#anti-patterns]
* **Setting only `MaxUSD`.** Models not in the price table aren't
enforced. Add token caps as defense-in-depth.
* **`MaxWallClock` shorter than your slowest tool.** A long
`step.SideEffect` HTTP call counts against wall-clock. Pick a value
that accounts for tool latency, not just LLM latency.
* **Reading `Cap` and `Actual` as integers.** Both are `float64` for
USD compatibility. Cast explicitly when comparing token counts.
* **Treating budget trips as exceptional.** They're a normal signal in
prod — instrument the `starling_budget_exceeded_total{axis=...}`
metric and alert when an axis trips at unexpected frequency.
## Where to next [#where-to-next]
- [Replay-driven tests](/docs/replay-tests): Capture a budget-tripping run as a regression fixture.
- [Operations](/docs/operations): Metrics, alerting, backups, deployment shapes.
# Persist runs (/docs/build/persistence)
`eventlog.EventLog` is the persistence interface. Three built-in
backends share it; your code only sees the interface, so swapping is a
constructor change.
## The interface [#the-interface]
```go
type EventLog interface {
Append(ctx context.Context, runID string, ev event.Event) error
Read(ctx context.Context, runID string) ([]event.Event, error)
Stream(ctx context.Context, runID string) (<-chan event.Event, error)
Close() error
}
type RunLister interface {
ListRuns(ctx context.Context) ([]RunSummary, error)
}
type RunPageLister interface {
ListRunsPage(ctx context.Context, opts RunPageOptions) (RunPage, error)
}
type RunPruner interface {
PruneRuns(ctx context.Context, opts PruneOptions) (PruneReport, error)
}
type RunPageOptions struct {
Limit int
Offset int
Status string
Query string
StartedAfter time.Time
RequireToolCalls bool
}
type RunPage struct {
Runs []RunSummary
TotalMatching int
Limit int
Offset int
}
type RunSummary struct {
RunID string
StartedAt time.Time
LastSeq uint64
TerminalKind event.Kind
// Aggregates over the run's events. Computed by every backend's
// ListRuns implementation; zero values are valid for runs that
// haven't produced an AssistantMessageCompleted yet.
TurnCount int
ToolCallCount int
InputTokens int64
OutputTokens int64
CostUSD float64
DurationMs int64 // wall time from RunStarted to last event
}
```
All three built-in backends satisfy `RunLister`, `RunPageLister`, and
`RunPruner`. The inspector uses `RunPageLister` when available, and
falls back to `RunLister` for custom backends that only implement the
older listing interface. The aggregate fields on `RunSummary` are
computed at list time so dashboards don't have to re-aggregate event
streams.
## Picking a backend [#picking-a-backend]
| Backend | Use when | Avoid when |
| ---------------------- | --------------------------------------------------------------------- | ---------------------------------------------------- |
| `NewInMemory()` | Tests, demos, ephemeral CLIs. | Anything you want to replay later. |
| `NewSQLite(path, ...)` | Single-host services, edge nodes. WAL mode, one writer, many readers. | Multi-host writers (no cross-host locking). |
| `NewPostgres(db, ...)` | Multi-host services, regulated workloads, anything wanting PITR. | Workloads where the DB is unavailable for stretches. |
## In-memory [#in-memory]
```go
log := eventlog.NewInMemory()
```
No migration, no schema check, no persistence. The whole log is gone
when the process exits. Useful for `go test` and one-shot CLIs.
## SQLite [#sqlite]
```go
log, err := eventlog.NewSQLite("starling.db")
if err != nil { return err }
defer log.Close()
```
What you get:
* **WAL mode + `synchronous=NORMAL`** — fast appends, fsync on commit.
* **Auto-migration on open** — first open installs the schema; later
opens migrate forward to the binary's schema version.
* **Per-run `_txlock=immediate`** — one writer, many readers.
* **File permissions** — `0600`, owned by the agent user.
Options:
| Option | Purpose |
| ---------------- | -------------------------------------------------------------------- |
| `WithReadOnly()` | Open with `mode=ro`. `Append` returns `ErrReadOnly`. Inspector mode. |
Read-only example (e.g., a separate inspector binary against the same
file):
```go
log, err := eventlog.NewSQLite("starling.db", eventlog.WithReadOnly())
```
You can backup a live SQLite log without stopping the agent:
```bash
sqlite3 starling.db ".backup /tmp/starling-backup.db"
```
## Postgres [#postgres]
```go
import (
_ "github.com/jackc/pgx/v5/stdlib"
"github.com/jerkeyray/starling/eventlog"
)
db, err := sql.Open("pgx", os.Getenv("DATABASE_URL"))
if err != nil { return err }
db.SetMaxOpenConns(8)
log, err := eventlog.NewPostgres(db, eventlog.WithAutoMigratePG())
if err != nil { return err }
defer log.Close()
```
What you get:
* **Per-run `pg_advisory_xact_lock`** on the run id hash — appenders to
the same run serialize; different runs are independent.
* **Multi-host safe** — any number of writers across hosts.
* **PITR / replication** — standard Postgres tooling works.
* **Postgres ≥ 11** required (uses `hashtextextended`).
Options:
| Option | Purpose |
| --------------------- | ------------------------------------------------------------------- |
| `WithAutoMigratePG()` | Run `InstallSchema` at open. Without it, run migrations explicitly. |
| `WithReadOnlyPG()` | `Append` returns `ErrReadOnly`. Inspector mode. |
Use a Postgres role with the minimum privileges you need:
```sql
-- writer
GRANT SELECT, INSERT ON eventlog_events TO starling_writer;
-- reader (inspector)
GRANT SELECT ON eventlog_events TO starling_reader;
```
## Migrations [#migrations]
```go
import "github.com/jerkeyray/starling/eventlog"
// Print current version.
v, err := eventlog.SchemaVersion(ctx, log)
// Apply pending migrations (forward-only).
report, err := eventlog.Migrate(ctx, log)
// Dry-run for CI.
report, err := eventlog.Migrate(ctx, log, eventlog.WithDryRun())
```
CLI equivalents:
```bash
starling schema-version /var/lib/starling/log.db
starling migrate /var/lib/starling/log.db
starling migrate --dry-run /var/lib/starling/log.db
```
## Preflight [#preflight]
`Agent.Run` and `Agent.Resume` call `eventlog.Preflight(ctx, log)` on
startup. It returns:
* `nil` if the schema matches.
* `ErrSchemaOutdated` if the database is older than the binary
(run `Migrate`).
* `ErrSchemaTooNew` if the database is newer than the binary
(deploy a newer binary or rollback the schema).
In-memory backends skip the check (return `nil`). Disable with
`Config.SkipSchemaCheck = true` in tests only.
## Validation [#validation]
`eventlog.Validate(events)` re-checks an entire run end to end. Use it
in CI to verify a recorded fixture hasn't drifted:
```go
events, err := log.Read(ctx, runID)
if err != nil { return err }
if err := eventlog.Validate(events); err != nil {
// wraps ErrLogCorrupt with a diagnostic.
}
```
`Validate` checks:
1. Slice non-empty, `events[0].Seq == 1`, monotonic seq with no gaps.
2. `RunID` consistent across all events.
3. Hash chain unbroken.
4. Exactly one terminal event, last in the slice.
5. First event is `RunStarted` with a supported `SchemaVersion`.
6. `TurnStarted` paired with a same-turn terminal.
7. `ToolCallScheduled` paired with `ToolCallCompleted` or
`ToolCallFailed` under the same `(CallID, Attempt)`.
8. Merkle root matches over every pre-terminal event.
## Reading and streaming [#reading-and-streaming]
```go
// One-shot read of a finished run.
events, err := log.Read(ctx, runID)
// Stream as the run unfolds (historical + live).
ch, err := log.Stream(ctx, runID)
for ev := range ch {
// ...
}
```
`Stream` delivers historical events first, then live events. The
channel closes on context cancel, log close, or buffer overflow
(internal buffer is 256 events).
## Paged listings [#paged-listings]
Use `ListRunsPage` for UI or API surfaces that browse many runs:
```go
page, err := log.ListRunsPage(ctx, eventlog.RunPageOptions{
Limit: 50,
Offset: 0,
Status: "completed",
Query: "support-ticket",
})
```
`Limit <= 0` uses the backend default. SQLite and Postgres apply
filters, ordering, and pagination in SQL before materializing run
summaries, so large logs do not need to load every run just to render
the first page. No schema migration is required.
## Retention pruning [#retention-pruning]
Pruning is an explicit operator action outside the append-only
`EventLog` contract. It deletes whole runs only; it never removes a
single event or suffix from a run.
```bash
starling prune --older-than 720h /var/lib/starling/log.db # dry run
starling prune --older-than 720h --confirm /var/lib/starling/log.db
starling prune --before 2026-01-01T00:00:00Z --status completed /var/lib/starling/log.db
```
The default selection is terminal runs (`completed`, `failed`, and
`cancelled`) older than the cutoff. In-progress runs are kept unless
you pass `--status "in progress"` or `--include-in-progress`.
For Postgres, wire the same retention policy as a maintenance job with
a role that has `SELECT` and `DELETE` on `eventlog_events`:
```go
report, err := log.(eventlog.RunPruner).PruneRuns(ctx, eventlog.PruneOptions{
Before: time.Now().Add(-90 * 24 * time.Hour),
DryRun: true,
})
if err != nil { return err }
fmt.Printf("would delete %d runs\n", report.MatchedRuns)
_, err = log.(eventlog.RunPruner).PruneRuns(ctx, eventlog.PruneOptions{
Before: time.Now().Add(-90 * 24 * time.Hour),
})
```
Keep inspector roles read-only (`SELECT` only).
## Helpers [#helpers]
```go
turns, tools, inTok, outTok, cost, durNs :=
eventlog.AggregateRun(events)
```
`AggregateRun` is the single source of truth for per-run totals
across the runtime: the inspector's totals strip, the MCP server's
`summarize_run` tool, and `RunSummary`'s aggregate fields all share
this implementation. An event whose payload fails to decode is
skipped rather than failing the whole aggregation, since callers
are typically presentation surfaces where one broken row should not
blank the dashboard.
```go
err := eventlog.ForkSQLite(ctx, srcPath, dstPath, runID, beforeSeq)
```
WAL-safe SQLite branching. Copies the source via `VACUUM INTO` (the
only way to copy a live WAL-mode database without leaking the
`.db-wal` and `.db-shm` sidecars) and truncates `runID`'s events to
those with `seq < beforeSeq`. Other runs are preserved verbatim.
`beforeSeq=0` keeps every event for `runID` (forks the run as-is);
returns `ErrForkNotFound` when nothing matches in the source. See
[`docs/cookbook/branching.md`](https://github.com/jerkeyray/starling/blob/main/docs/cookbook/branching.md)
in the runtime repo for a worked example.
## Public `merkle` package [#public-merkle-package]
```go
import "github.com/jerkeyray/starling/merkle"
```
The BLAKE3 hash-chain helpers used by `Agent.Run` are exposed as a
public package. Third parties writing their own event producers can
reuse the chain implementation rather than copying it - useful for
non-`Agent.Run` recorders that need to write into an `EventLog` and
maintain compatible chain output (e.g. importers, replay
harnesses, or custom dashboards that want to rebuild a Merkle root).
## Sentinel errors [#sentinel-errors]
```go
var (
ErrLogClosed = errors.New("eventlog: log is closed")
ErrLogCorrupt = errors.New("eventlog: log is corrupt")
ErrInvalidAppend = errors.New("eventlog: invalid append")
ErrReadOnly = errors.New("eventlog: log is read-only")
ErrSchemaOutdated = errors.New("eventlog: schema outdated; run migrate")
ErrSchemaTooNew = errors.New("eventlog: schema too new for this binary")
)
```
## Wrapping a backend with metrics [#wrapping-a-backend-with-metrics]
If you call `Append` directly outside `step.emit`, wrap the log to
capture the same latency histograms:
```go
import "github.com/jerkeyray/starling/eventlog"
obs := starling.NewMetrics(reg).EventLogObserver()
log = eventlog.WithMetrics(log, obs)
```
## Anti-patterns [#anti-patterns]
* **Multiple processes writing to one SQLite file.** Use Postgres.
* **`SkipSchemaCheck: true` in production.** Hides migrations you
forgot to run.
* **Calling `Migrate` on every process start without coordination.**
It's idempotent but wastes a transaction. Run it from your release
pipeline; let the binary preflight on startup.
* **Reusing a `runID` across runs.** Once recorded, ids are retired.
The agent mints fresh ULIDs; don't pass synthetic ids.
## Where to next [#where-to-next]
- [Bound your costs](/docs/build/budgets): Token, USD, wall-clock budgets enforced inside the runtime.
- [Operations](/docs/operations): Backups, retention, metrics, deployment shapes.
# Wire providers (/docs/build/providers)
`provider.Provider` is the streaming-completion abstraction. Adapters
share a request shape, a chunk state machine, and a conformance suite.
Pick one — switching is a one-line change.
## The interface [#the-interface]
```go
type Provider interface {
Info() Info
Stream(ctx context.Context, req *Request) (EventStream, error)
}
type Capabler interface {
Capabilities() Capabilities
}
type Capabilities struct {
Tools bool
ToolChoice bool
Reasoning bool
StopSequences bool
CacheControl bool
RequestID bool
}
```
Adapters that don't support a feature report `false` from
`Capabilities()`. The conformance suite skips capability-gated
assertions when an adapter reports `false`, so adapter authors don't
have to fake support.
## In-tree adapters [#in-tree-adapters]
| Package | Constructor | Default API version |
| --------------------- | ------------------------- | ------------------- |
| `provider/openai` | `openai.New(opts...)` | `"v1"` |
| `provider/anthropic` | `anthropic.New(opts...)` | `"2023-06-01"` |
| `provider/gemini` | `gemini.New(opts...)` | `"v1beta"` |
| `provider/bedrock` | `bedrock.New(opts...)` | AWS SDK default |
| `provider/openrouter` | `openrouter.New(opts...)` | OpenRouter default |
Each `New` returns `(provider.Provider, error)` — the interface,
not a struct pointer. Use the error path: missing API keys fail at
construction, not Run.
## OpenAI [#openai]
```go
import "github.com/jerkeyray/starling/provider/openai"
prov, err := openai.New(
openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
)
```
| Option | Purpose |
| ----------------------- | -------------------------------------------------------------------- |
| `WithAPIKey(key)` | API key. Required unless your endpoint is unauthenticated. |
| `WithBaseURL(url)` | OpenAI-compatible endpoints (Groq, Together, Ollama, vLLM, Azure). |
| `WithOrganization(org)` | OpenAI org id; sets `OpenAI-Organization` header. |
| `WithAPIVersion(v)` | Override the URL prefix (default `"v1"`). |
| `WithProviderID(id)` | Override `Info().ID` (useful when the same code talks to many APIs). |
| `WithHTTPClient(c)` | Custom `*http.Client` (timeouts, proxies, custom transport). |
OpenAI-compatible endpoint examples:
```go
// Groq
prov, _ := openai.New(
openai.WithBaseURL("https://api.groq.com/openai/v1"),
openai.WithAPIKey(os.Getenv("GROQ_API_KEY")),
openai.WithProviderID("groq"),
)
// Local Ollama
prov, _ := openai.New(
openai.WithBaseURL("http://localhost:11434/v1"),
openai.WithProviderID("ollama"),
)
// Azure OpenAI
prov, _ := openai.New(
openai.WithBaseURL("https://YOUR-RESOURCE.openai.azure.com/openai"),
openai.WithAPIKey(os.Getenv("AZURE_OPENAI_KEY")),
openai.WithAPIVersion("2024-08-01-preview"),
)
```
## Anthropic [#anthropic]
```go
import "github.com/jerkeyray/starling/provider/anthropic"
prov, err := anthropic.New(
anthropic.WithAPIKey(os.Getenv("ANTHROPIC_API_KEY")),
)
```
| Option | Purpose |
| -------------------- | ------------------------------------------------------ |
| `WithAPIKey(key)` | API key. |
| `WithBaseURL(url)` | Custom base (proxies, gateways). |
| `WithAPIVersion(v)` | `"anthropic-version"` header (default `"2023-06-01"`). |
| `WithProviderID(id)` | Override `Info().ID`. |
| `WithHTTPClient(c)` | Custom HTTP client. |
Capabilities reported by Anthropic: tool use, extended thinking with
per-block signatures, prompt caching metadata. Cache-control hints are
attached via the `Params` field on the `Request` (CBOR blob).
## Gemini [#gemini]
```go
import "github.com/jerkeyray/starling/provider/gemini"
prov, err := gemini.New(
gemini.WithAPIKey(os.Getenv("GEMINI_API_KEY")),
)
```
| Option | Purpose |
| -------------------- | -------------------------------- |
| `WithAPIKey(key)` | API key. |
| `WithBaseURL(url)` | Custom base. |
| `WithAPIVersion(v)` | URL prefix (default `"v1beta"`). |
| `WithProviderID(id)` | Override `Info().ID`. |
| `WithHTTPClient(c)` | Custom HTTP client. |
## Amazon Bedrock [#amazon-bedrock]
```go
import (
"github.com/aws/aws-sdk-go-v2/config"
"github.com/jerkeyray/starling/provider/bedrock"
)
awsCfg, err := config.LoadDefaultConfig(ctx, config.WithRegion("us-east-1"))
if err != nil {
return err
}
prov, err := bedrock.New(bedrock.WithAWSConfig(awsCfg))
```
| Option | Purpose |
| ----------------------- | ---------------------------------------------------------------- |
| `WithAWSConfig(cfg)` | Pre-built `aws.Config`. Standard AWS credential chain applies. |
| `WithRegion(region)` | Region override (e.g. `"us-east-1"`). |
| `WithBaseEndpoint(url)` | Custom Bedrock runtime endpoint (VPC endpoints, FIPS, gateways). |
| `WithHTTPClient(c)` | Custom `bedrockruntime.HTTPClient` (timeouts, proxies). |
| `WithProviderID(id)` | Override `Info().ID`. |
| `WithAPIVersion(v)` | Override the recorded API version label. |
The adapter calls native `ConverseStream`, so it accepts every model id
Bedrock Converse accepts: foundation model ids, inference profiles,
prompt ARNs, and provisioned-throughput ARNs. Bedrock-specific request
fields — `additionalModelRequestFields`,
`additionalModelResponseFieldPaths`, `requestMetadata`,
`performanceConfig`, `serviceTier`, `promptVariables`, `outputConfig`,
`guardrailConfig` — are passed through `Request.Params`. Unknown keys
are rejected, so misspelled fields don't silently no-op.
Capabilities reported by Bedrock: tool use (no `none` tool-choice),
reasoning content with signatures, redacted thinking, cache-aware usage
counters, `top_k` via `additionalModelRequestFields`.
## OpenRouter [#openrouter]
```go
import "github.com/jerkeyray/starling/provider/openrouter"
prov, err := openrouter.New(
openrouter.WithAPIKey(os.Getenv("OPENROUTER_API_KEY")),
openrouter.WithHTTPReferer("https://your-app.com"),
openrouter.WithXTitle("Your App"),
)
```
| Option | Purpose |
| ---------------------- | ----------------------------------------------- |
| `WithAPIKey(key)` | OpenRouter key. |
| `WithBaseURL(url)` | Override default endpoint. |
| `WithHTTPReferer(url)` | OpenRouter attribution header (`HTTP-Referer`). |
| `WithXTitle(title)` | OpenRouter attribution header (`X-Title`). |
| `WithProviderID(id)` | Override `Info().ID`. |
| `WithHTTPClient(c)` | Custom HTTP client. |
OpenRouter is a thin wrapper over the OpenAI adapter - same chunk
contract, plus attribution headers. Use OpenAI-style model strings.
## Error classification [#error-classification]
Adapters wrap their underlying SDK / HTTP errors with one of four
sentinels so retry policy lives in caller code, not in
vendor-string parsing. Categories are non-overlapping; an error
wraps at most one.
```go
var (
provider.ErrRateLimit // 429 / quota
provider.ErrAuth // 401 / 403
provider.ErrServer // 5xx
provider.ErrNetwork // DNS / dial / TLS / broken stream
)
```
The two helpers used by every in-tree adapter:
```go
provider.WrapHTTPStatus(err, status int) error
provider.ClassifyTransport(err error) error
```
`WrapHTTPStatus` annotates by HTTP status code; `status == 0`
delegates to `ClassifyTransport`, which wraps `net.Error` and
`*url.Error` with `ErrNetwork`. Statuses outside the four
categories pass through unmodified - 4xx errors that are neither
auth nor rate-limit (invalid request, model not found) reflect a
caller bug, not a transient condition, so they stay unwrapped on
purpose.
Caller code:
```go
_, stream, err := prov.Stream(ctx, req)
switch {
case errors.Is(err, provider.ErrRateLimit):
backoff(); retry()
case errors.Is(err, provider.ErrAuth):
return fmt.Errorf("starling: bad credentials: %w", err)
case errors.Is(err, provider.ErrServer), errors.Is(err, provider.ErrNetwork):
backoff(); retry()
case err != nil:
return err // 4xx caller bug; surface unmodified
}
```
If you write your own provider adapter, call `WrapHTTPStatus` from
the entry point that receives the upstream response and your
callers automatically get the same retry contract.
## The streaming chunk contract [#the-streaming-chunk-contract]
Adapters yield `StreamChunk` values. Kinds:
```text
ChunkText
ChunkReasoning
ChunkRedactedThinking
ChunkToolUseStart
ChunkToolUseDelta
ChunkToolUseEnd
ChunkUsage
ChunkEnd
```
`step.LLMCall` enforces the state machine:
* No EOF before `ChunkEnd`.
* No duplicate `ChunkToolUseStart` for the same call id.
* No chunks after `ChunkEnd`.
* `ChunkUsage` is optional; budgets enforce mid-stream when it arrives.
If an adapter violates the contract, `step.ErrInvalidStream` is the
typed error.
## Capability-gated features [#capability-gated-features]
Check what an adapter supports before enabling a feature:
```go
if c, ok := prov.(provider.Capabler); ok {
caps := c.Capabilities()
if caps.CacheControl {
// attach cache markers via Params
}
if !caps.Reasoning {
// don't ask for chain-of-thought
}
}
```
The conformance suite (`provider/conformance/`) is the contract test
adapter authors run against fixtures. If you write your own adapter,
plug it into the suite — it's the cheapest way to get the chunk
ordering, tool id stability, and cancellation right.
## Switching providers [#switching-providers]
Mid-fleet swap is one line:
```go
// Before
prov, _ := openai.New(openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")))
a.Config.Model = "gpt-4o-mini"
// After
prov, _ := anthropic.New(anthropic.WithAPIKey(os.Getenv("ANTHROPIC_API_KEY")))
a.Config.Model = "claude-haiku-4-5-20251001"
```
Note: changing the provider or model **invalidates replay fixtures**.
`RunStarted` records `ProviderID`, `ModelID`, and a hash of the
`Params` blob. Replay against an old fixture with a new provider
fails fast on `RunStarted` payload divergence.
## Anti-patterns [#anti-patterns]
* **Hard-coding the API key in source.** Always read from env.
* **Ignoring the error from `New`.** Construction validates the
config; treating it as infallible hides typos.
* **Using one provider for production but a different one for replay
fixtures.** Replay only works against the recorded provider; pick
one and stick to it per fixture.
* **Setting `WithProviderID` to a value that varies per process.**
The id is hashed into `RunStarted`. Make it stable.
## Where to next [#where-to-next]
- [Persist runs](/docs/build/persistence): In-memory, SQLite, Postgres — pick a backend.
- [Bound your costs](/docs/build/budgets): Token, USD, wall-clock, per-turn caps enforced inside the runtime.
# Write a tool (/docs/build/tools)
A tool is anything implementing `tool.Tool`. The convenience wrapper
`tool.Typed` derives the JSON Schema from your input type via Go
reflection. Most tools should use it.
## The interface [#the-interface]
```go
type Tool interface {
Name() string
Description() string
Schema() json.RawMessage // JSON Schema for input
Execute(ctx context.Context, input json.RawMessage) (json.RawMessage, error)
}
```
## tool.Typed [#tooltyped]
```go
func Typed[In, Out any](
name, description string,
fn func(context.Context, In) (Out, error),
) Tool
```
`In` must be a struct (LLM tool inputs are objects at the top level).
The reflection layer panics at construction on:
* `In` not a struct (use `struct{}` for parameter-less tools)
* maps, interfaces, or recursive struct types in `In`
* duplicate JSON tag names within `In`
`Out` is JSON-marshalled. Empty results become `null`.
`Execute` recovers panics inside `fn` and returns them wrapped with
`tool.ErrPanicked` so the agent loop emits a `ToolCallFailed` instead
of crashing the process.
## A real tool [#a-real-tool]
```go
import (
"context"
"fmt"
"time"
"github.com/jerkeyray/starling/step"
"github.com/jerkeyray/starling/tool"
)
type lookupIn struct {
ID string `json:"id" jsonschema:"description=Customer id"`
}
type lookupOut struct {
Name string `json:"name"`
Plan string `json:"plan"`
LookedUp string `json:"looked_up_at"`
}
var customerLookup = tool.Typed(
"customer_lookup",
"Fetch customer name and plan by id.",
func(ctx context.Context, in lookupIn) (lookupOut, error) {
// step.SideEffect makes the HTTP call replay-safe: live runs hit
// the network, replay reads the recorded value out of the log.
out, err := step.SideEffect(ctx, "customer/"+in.ID, func() (lookupOut, error) {
return fetchCustomer(in.ID) // your real HTTP call
})
if err != nil { return lookupOut{}, err }
out.LookedUp = step.Now(ctx).UTC().Format(time.RFC3339)
return out, nil
},
)
```
Three things this gets right:
1. The HTTP call is wrapped in `step.SideEffect`. On replay, the
recorded result comes back without re-contacting your customer API.
2. The timestamp uses `step.Now(ctx)`, not `time.Now()`. Replay returns
the recorded time, so the tool's output bytes match the recording.
3. The `step.SideEffect` name (`"customer/"+id`) is stable per logical
call. Replay looks up by name; reusing the same name for the same
logical effect is the contract.
## Replay safety: what to wrap, what to not [#replay-safety-what-to-wrap-what-to-not]
| Inside a tool, you wrote… | Replay-safe? | Fix |
| ------------------------- | ------------ | ------------------------------------- |
| `time.Now()` | No | `step.Now(ctx)` |
| `rand.Intn(...)` | No | `step.Random(ctx)` (returns `uint64`) |
| `http.Get(...)` | No | `step.SideEffect(ctx, "name", ...)` |
| `os.ReadFile(...)` | No | `step.SideEffect(...)` |
| pure compute, no I/O | Yes | nothing |
| reading a constant | Yes | nothing |
Calling `step.Now`, `step.Random`, or `step.SideEffect` outside of an
active agent run **panics** — the helpers require a `ctx` derived from
`Agent.Run`. This is the contract; don't call them from background
goroutines you fork inside a tool without propagating ctx.
## Retries on transient errors [#retries-on-transient-errors]
Tools that hit flaky services should mark their errors retryable:
```go
import "github.com/jerkeyray/starling/tool"
func fetchCustomer(id string) (lookupOut, error) {
resp, err := http.Get("https://api.example.com/customers/" + id)
if err != nil {
return lookupOut{}, fmt.Errorf("customer lookup: %w", tool.ErrTransient)
}
if resp.StatusCode >= 500 {
return lookupOut{}, fmt.Errorf("upstream %d: %w", resp.StatusCode, tool.ErrTransient)
}
// ...
}
```
Then declare the tool idempotent so the runtime retries:
```go
import "github.com/jerkeyray/starling/step"
call := step.ToolCall{
Name: "customer_lookup",
Args: argsJSON,
Idempotent: true,
MaxAttempts: 3,
// Backoff defaults to 100ms × 2 with 25% jitter, capped at 10s.
}
result, err := step.CallTool(ctx, call)
```
`step.ToolCall` fields:
| Field | Type | Default if zero |
| ------------- | --------------------------------- | ---------------------------------- |
| `CallID` | string | minted at execution |
| `TurnID` | string | required |
| `Name` | string | required |
| `Args` | `json.RawMessage` | required |
| `Idempotent` | bool | false (no retries) |
| `MaxAttempts` | int | `1` (no retries) when zero |
| `Backoff` | `func(attempt int) time.Duration` | `100ms × 2^n`, jitter 25%, cap 10s |
Each retry emits its own `ToolCallScheduled`+`ToolCallCompleted/Failed`
pair under the same `CallID` with incrementing `Attempt`.
## Parallel tool calls [#parallel-tool-calls]
When the model schedules multiple tools in one turn, the agent
fans them out:
```go
results, err := step.CallTools(ctx, []step.ToolCall{a, b, c})
```
Concurrency cap is `step.DefaultMaxParallelTools = 8`. Replay
re-executes tools in the recorded completion order so byte comparison
is deterministic.
## Middleware: `tool.Wrap` [#middleware-toolwrap]
Compose cross-cutting behavior around `Execute` without
re-implementing the `Tool` interface. `Name`, `Description`, and
`Schema` pass through unchanged so the model sees the same
contract; only the runtime call path is layered.
```go
type Middleware func(
inner func(context.Context, json.RawMessage) (json.RawMessage, error),
) func(context.Context, json.RawMessage) (json.RawMessage, error)
func tool.Wrap(t Tool, mw ...Middleware) Tool
```
Composition matches `net/http.Handler`: the **last middleware
passed runs first**. The first one wraps the inner-most call,
closest to the original `Execute`.
```go
withTiming := func(inner ...) ... {
return func(ctx context.Context, in json.RawMessage) (json.RawMessage, error) {
start := time.Now()
out, err := inner(ctx, in)
slog.Info("tool", "dur", time.Since(start), "err", err)
return out, err
}
}
withAuth := func(inner ...) ... {
return func(ctx context.Context, in json.RawMessage) (json.RawMessage, error) {
if !authorized(ctx) {
return nil, errors.New("unauthorized") // short-circuits inner
}
return inner(ctx, in)
}
}
audited := tool.Wrap(myTool, withTiming, withAuth)
// withAuth runs first; if it short-circuits, withTiming and myTool.Execute are skipped.
```
Common uses: logging, timing, span injection, request
authentication, input validation that runs before the tool, output
redaction.
## Built-in tools [#built-in-tools]
`tool/builtin/` ships two reference implementations:
```go
import "github.com/jerkeyray/starling/tool/builtin"
httpFetch := builtin.Fetch() // HTTP GET; 15s timeout, 1 MiB cap
readFile, err := builtin.ReadFile("./data") // path-escape rejected
```
`Fetch()` takes no options. It only allows public `http` and `https`
URLs, caps responses at 1 MiB, times out after 15 seconds, and rejects
localhost, private, link-local, multicast, unspecified addresses, and
redirects to those addresses. It is still a small reference tool, not a
browser or crawler; wrap or replace it when you need allowlists,
authentication, custom headers, or richer HTTP policy.
`ReadFile(baseDir)` rejects `..`, absolute paths, and symlinks that
escape the base directory. Both tools are good templates for your own
tools.
## When to skip tool.Typed [#when-to-skip-tooltyped]
Reach for the raw `tool.Tool` interface when you need:
* A schema you generate yourself (e.g., dynamic enums from a database
fetched at agent construction).
* A tool whose input doesn't fit a Go struct (extremely rare).
* Tight control over error formatting in `Execute`.
Otherwise stay with `tool.Typed`. It catches more at compile time and
keeps the schema honest.
## Anti-patterns [#anti-patterns]
* **Reading `time.Now()` directly.** Replay diverges every time.
Use `step.Now(ctx)`.
* **Forking a goroutine without propagating ctx.** `step.*` helpers
panic if ctx is detached. Pass `ctx` into `errgroup.WithContext` or
similar.
* **Naming a `step.SideEffect` with a value that varies between runs**
(e.g., the current timestamp). The name *is* the lookup key. Use a
stable per-logical-call key.
* **Returning a tool error wrapping `tool.ErrTransient` for non-retryable
failures.** Wrap only when the runtime should try again. Auth errors,
bad input, and 4xx responses are not transient.
* **Mutating tool arguments inside `Execute`.** The agent records `Args`
before dispatch; mutations don't appear in the log.
## Where to next [#where-to-next]
- [Wire providers](/docs/build/providers): Pick a model, swap endpoints, OpenAI-compatible base URLs.
- [MCP tools](/docs/mcp): Mount remote MCP servers as Starling tools.