Operations
Deploying, securing, observing, and retaining Starling event logs in production.
Process model
Starling is a Go library, not a server. Two common shapes:
- Embedded. The agent runs inside your existing Go service. The event log is a file (SQLite) or connection (Postgres) the service already manages. The inspector ships as a separate binary or a subcommand of your binary, pointed at the same log read-only.
- Sidecar inspector. Agent in service A, inspector in service B with a read-only handle to the same log. Use this when operators need debugging access without redeploying the primary service.
There is no required scheduler. Agent.Run is a blocking Go call.
Picking a backend
| Backend | Use when | Avoid when |
|---|---|---|
eventlog.NewInMemory | Tests, ephemeral CLIs. | Anything you might want to replay later. |
eventlog.NewSQLite | Single-host services, edge nodes. WAL + per-run _txlock=immediate. | Multi-host writers: SQLite has no cross-host locking. |
eventlog.NewPostgres | Multi-host services, regulated workloads, anything needing PITR or replication. | Workloads where the DB is unavailable for long stretches. |
Schema migrations
Every eventlog.NewSQLite call auto-migrates on open. Postgres callers
run migrations explicitly:
# CLI (SQLite)
starling migrate /var/lib/starling/log.db
starling schema-version /var/lib/starling/log.db// In-process (any backend)
log, err := eventlog.NewPostgres(db)
if err != nil { return err }
if _, err := eventlog.Migrate(ctx, log); err != nil { return err }Agent.Run, Agent.Resume, and the inspector run eventlog.Preflight
on startup and refuse to operate against a stale or too-new schema.
Disable with Config.SkipSchemaCheck = true only in tests.
SQLite
log, err := eventlog.NewSQLite("/var/lib/starling/log.db")- WAL mode is on (
PRAGMA journal_mode=WAL); fsync on commit issynchronous=NORMAL. Set=FULLif you need stronger guarantees and can pay the latency. - File permissions:
0600, owned by the agent user. - One process per file. Multiple processes can read concurrently
(
WithReadOnly), but only one process should ever write: use Postgres for multi-writer. - Backup:
sqlite3 log.db ".backup /tmp/log-backup.db"while the agent is running. Restore by stopping the agent, swapping the file, and restarting.
Postgres
db, _ := sql.Open("postgres", os.Getenv("DATABASE_URL"))
db.SetMaxOpenConns(8)
log, err := eventlog.NewPostgres(db, eventlog.WithAutoMigratePG())- Postgres ≥ 11 (uses
hashtextextendedfor advisory locks). - Connection pool: size to expected concurrent runs plus headroom for the inspector.
- Per-run advisory locks (
pg_advisory_xact_lock) serialize appends for a givenrun_id. Different runs are independent. - Backup: standard
pg_dump --table=eventlog_eventsfor logical exports; PITR via WAL archiving for hot recovery. - Restore:
pg_restoreinto an empty schema, then runeventlog.Migrate.
Security
Threat model
| Actor | What they can do | What we defend |
|---|---|---|
| Operator | Runs the process, owns the DB and provider keys. | Trusted; the runtime assumes operator code is benign. |
| End user | Supplies goals and conversation messages. | Tool inputs, event payloads, replay determinism. |
| Provider | LLM API the agent talks to. | Stream chunk validation, raw-response hash via RequireRawResponseHash. |
What the hash chain does and does not prove
Proves: events were appended in a specific order; no event was modified after append; replays match recorded byte-exact behavior.
Does not prove: that the operator wrote the truth into the log
(an operator can construct any valid run); that the provider returned
a specific response (RawResponseHash is a BLAKE3 digest computed
by the adapter over the SDK-level response, not a vendor signature);
that the agent ran on the claimed wall-clock time (Timestamp comes
from step.Now).
For cross-process attestation, sign the terminal RunCompleted
payload externally: the Merkle root is the natural signing target.
Inspector auth and TLS
For the user-facing tour (UI, keyboard shortcuts, diff page, theme), see Inspector. This section covers the deployment posture only.
Bearer auth via inspect.WithAuth(inspect.BearerAuth(token)) or the
STARLING_INSPECT_TOKEN env var. CSRF protection is always on for
state-changing routes (replay POSTs); the inspector plants an
X-CSRF-Token cookie on safe responses.
The inspector's HTTP server is plain HTTP. Always front it with a TLS-terminating reverse proxy for non-loopback access:
server {
listen 443 ssl;
ssl_certificate /etc/ssl/starling.pem;
ssl_certificate_key /etc/ssl/starling.key;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_buffering off; # required for SSE
proxy_read_timeout 1h;
}
}For mTLS, use ssl_verify_client on at the proxy. The inspector itself
does not consume client certs: the proxy decision is authoritative.
Secrets
| Secret | Where | Notes |
|---|---|---|
| Provider API keys | Agent.Provider config | Pass via env, not source. Never log. |
STARLING_INSPECT_TOKEN | Env var | Rotate on operator changes. |
| Postgres DSN | Env var | Use a role with minimum privileges (writer: SELECT, INSERT; retention job: SELECT, DELETE; inspector: SELECT). |
Sensitive event payloads
Payloads carry full conversations, reasoning, and tool I/O. Treat the DB as PII: encrypt at rest, lock down OS access, never expose the inspector outside a trusted audience.
No built-in field-level redaction. Redact at the tool boundary.
Retention
The log is append-only. Mutating any event breaks the hash chain for every later event in the same run.
Cannot do: mutate events, delete a single event from a run, reuse
a run_id. The unit of deletion is the whole run.
Can do:
- Delete whole runs (
DELETE ... WHERE run_id IN (...)). - Archive runs to NDJSON via
starling export, then delete. - Use
starling prunefor dry-run-first rolling retention on SQLite logs. - Partition by time (Postgres
DROP PARTITION,pg_partmanor a cron job rolling monthly partitions). - Filter the inspector view by
RunSummary.StartedAtinstead of deleting.
Rolling-window deletion
starling prune --older-than 2160h /var/lib/starling/log.db
starling prune --older-than 2160h --confirm /var/lib/starling/log.dbprune deletes whole terminal runs only by default. Use --status to
target one status and --limit to break large cleanup jobs into
smaller batches. Run VACUUM during low traffic if you need SQLite to
return freed pages to the filesystem.
For Postgres, call eventlog.RunPruner in your maintenance job with a
role that has SELECT and DELETE on eventlog_events, or use time
partitions when you need instant archival drops. Keep inspector roles
read-only.
PII deletion (right-to-erasure)
Maintain an external user_id → []run_id index: Starling does not
keep one. On request, prune those whole runs and remove any archived
NDJSON. Don't try to selectively redact within a run; the chain depends
on every event.
Metrics
Agent.Metrics = starling.NewMetrics(reg) registers a Prometheus
collection set on the supplied prometheus.Registerer. Highlights:
| Metric | Type | Labels |
|---|---|---|
starling_runs_started_total | Counter | - |
starling_runs_in_flight | Gauge | - |
starling_run_duration_seconds | Histogram | status |
starling_run_terminal_total | Counter | status, error_type |
starling_provider_calls_total | Counter | model, status |
starling_provider_call_duration_seconds | Histogram | model |
starling_provider_tokens_total | Counter | model, type |
starling_tool_calls_total | Counter | tool, status, error_type |
starling_tool_call_duration_seconds | Histogram | tool |
starling_eventlog_appends_total | Counter | kind, status |
starling_eventlog_append_duration_seconds | Histogram | kind |
starling_budget_exceeded_total | Counter | axis |
Wire the /metrics handler into your existing HTTP mux:
metrics := starling.NewMetrics(prometheus.DefaultRegisterer)
http.Handle("/metrics", starling.MetricsHandler(prometheus.DefaultGatherer))For direct Append callers outside step.emit, wrap the log with
eventlog.WithMetrics(log, observer) so latency histograms cover
that path too.
Tracing
OpenTelemetry spans are emitted under the starling instrumentation
name. Wire any OTLP exporter:
exp, _ := otlptracegrpc.New(ctx)
provider := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(provider)Expected span tree per run:
agent.run
└── agent.turn × N
├── provider.stream
└── step.tool × MFailure recovery
A crashed run leaves an open hash chain. Restart the same runID with
Agent.Resume(ctx, runID, ""):
- If the crash happened mid-turn before
AssistantMessageCompleted, Resume reissues pending tool calls under freshCallIDs. - If the assistant turn completed but tools were pending, same path.
- Pass
WithReissueTools(false)to refuse and inspect manually.
Resume goes through the same eventlog.Preflight check as Run, so a
stale schema fails fast.
Operational checklist
- Backups verified by restoring into a staging DB monthly.
- Inspector behind TLS + bearer token.
- Provider API keys in env, not source.
- DB file/connection user has minimum privileges.
-
eventlog.Migratein your release script (Postgres) or trusted to run onNewSQLite. - Metrics scraped; dashboards alert on
starling_eventlog_append_duration_secondsp99 andstarling_provider_call_duration_secondsp99. - Retention policy implemented (rolling window, archive-to-NDJSON, or partitioning).
- Security review for tool-side network/filesystem access.