starling

Operations

Deploying, securing, observing, and retaining Starling event logs in production.

Process model

Starling is a Go library, not a server. Two common shapes:

  1. Embedded. The agent runs inside your existing Go service. The event log is a file (SQLite) or connection (Postgres) the service already manages. The inspector ships as a separate binary or a subcommand of your binary, pointed at the same log read-only.
  2. Sidecar inspector. Agent in service A, inspector in service B with a read-only handle to the same log. Use this when operators need debugging access without redeploying the primary service.

There is no required scheduler. Agent.Run is a blocking Go call.

Picking a backend

BackendUse whenAvoid when
eventlog.NewInMemoryTests, ephemeral CLIs.Anything you might want to replay later.
eventlog.NewSQLiteSingle-host services, edge nodes. WAL + per-run _txlock=immediate.Multi-host writers: SQLite has no cross-host locking.
eventlog.NewPostgresMulti-host services, regulated workloads, anything needing PITR or replication.Workloads where the DB is unavailable for long stretches.

Schema migrations

Every eventlog.NewSQLite call auto-migrates on open. Postgres callers run migrations explicitly:

# CLI (SQLite)
starling migrate /var/lib/starling/log.db
starling schema-version /var/lib/starling/log.db
// In-process (any backend)
log, err := eventlog.NewPostgres(db)
if err != nil { return err }
if _, err := eventlog.Migrate(ctx, log); err != nil { return err }

Agent.Run, Agent.Resume, and the inspector run eventlog.Preflight on startup and refuse to operate against a stale or too-new schema. Disable with Config.SkipSchemaCheck = true only in tests.

SQLite

log, err := eventlog.NewSQLite("/var/lib/starling/log.db")
  • WAL mode is on (PRAGMA journal_mode=WAL); fsync on commit is synchronous=NORMAL. Set =FULL if you need stronger guarantees and can pay the latency.
  • File permissions: 0600, owned by the agent user.
  • One process per file. Multiple processes can read concurrently (WithReadOnly), but only one process should ever write: use Postgres for multi-writer.
  • Backup: sqlite3 log.db ".backup /tmp/log-backup.db" while the agent is running. Restore by stopping the agent, swapping the file, and restarting.

Postgres

db, _ := sql.Open("postgres", os.Getenv("DATABASE_URL"))
db.SetMaxOpenConns(8)
log, err := eventlog.NewPostgres(db, eventlog.WithAutoMigratePG())
  • Postgres ≥ 11 (uses hashtextextended for advisory locks).
  • Connection pool: size to expected concurrent runs plus headroom for the inspector.
  • Per-run advisory locks (pg_advisory_xact_lock) serialize appends for a given run_id. Different runs are independent.
  • Backup: standard pg_dump --table=eventlog_events for logical exports; PITR via WAL archiving for hot recovery.
  • Restore: pg_restore into an empty schema, then run eventlog.Migrate.

Security

Threat model

ActorWhat they can doWhat we defend
OperatorRuns the process, owns the DB and provider keys.Trusted; the runtime assumes operator code is benign.
End userSupplies goals and conversation messages.Tool inputs, event payloads, replay determinism.
ProviderLLM API the agent talks to.Stream chunk validation, raw-response hash via RequireRawResponseHash.

What the hash chain does and does not prove

Proves: events were appended in a specific order; no event was modified after append; replays match recorded byte-exact behavior.

Does not prove: that the operator wrote the truth into the log (an operator can construct any valid run); that the provider returned a specific response (RawResponseHash is a BLAKE3 digest computed by the adapter over the SDK-level response, not a vendor signature); that the agent ran on the claimed wall-clock time (Timestamp comes from step.Now).

For cross-process attestation, sign the terminal RunCompleted payload externally: the Merkle root is the natural signing target.

Inspector auth and TLS

For the user-facing tour (UI, keyboard shortcuts, diff page, theme), see Inspector. This section covers the deployment posture only.

Bearer auth via inspect.WithAuth(inspect.BearerAuth(token)) or the STARLING_INSPECT_TOKEN env var. CSRF protection is always on for state-changing routes (replay POSTs); the inspector plants an X-CSRF-Token cookie on safe responses.

The inspector's HTTP server is plain HTTP. Always front it with a TLS-terminating reverse proxy for non-loopback access:

server {
    listen 443 ssl;
    ssl_certificate     /etc/ssl/starling.pem;
    ssl_certificate_key /etc/ssl/starling.key;

    location / {
        proxy_pass         http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_buffering    off;          # required for SSE
        proxy_read_timeout 1h;
    }
}

For mTLS, use ssl_verify_client on at the proxy. The inspector itself does not consume client certs: the proxy decision is authoritative.

Secrets

SecretWhereNotes
Provider API keysAgent.Provider configPass via env, not source. Never log.
STARLING_INSPECT_TOKENEnv varRotate on operator changes.
Postgres DSNEnv varUse a role with minimum privileges (writer: SELECT, INSERT; retention job: SELECT, DELETE; inspector: SELECT).

Sensitive event payloads

Payloads carry full conversations, reasoning, and tool I/O. Treat the DB as PII: encrypt at rest, lock down OS access, never expose the inspector outside a trusted audience.

No built-in field-level redaction. Redact at the tool boundary.

Retention

The log is append-only. Mutating any event breaks the hash chain for every later event in the same run.

Cannot do: mutate events, delete a single event from a run, reuse a run_id. The unit of deletion is the whole run.

Can do:

  • Delete whole runs (DELETE ... WHERE run_id IN (...)).
  • Archive runs to NDJSON via starling export, then delete.
  • Use starling prune for dry-run-first rolling retention on SQLite logs.
  • Partition by time (Postgres DROP PARTITION, pg_partman or a cron job rolling monthly partitions).
  • Filter the inspector view by RunSummary.StartedAt instead of deleting.

Rolling-window deletion

starling prune --older-than 2160h /var/lib/starling/log.db
starling prune --older-than 2160h --confirm /var/lib/starling/log.db

prune deletes whole terminal runs only by default. Use --status to target one status and --limit to break large cleanup jobs into smaller batches. Run VACUUM during low traffic if you need SQLite to return freed pages to the filesystem.

For Postgres, call eventlog.RunPruner in your maintenance job with a role that has SELECT and DELETE on eventlog_events, or use time partitions when you need instant archival drops. Keep inspector roles read-only.

PII deletion (right-to-erasure)

Maintain an external user_id → []run_id index: Starling does not keep one. On request, prune those whole runs and remove any archived NDJSON. Don't try to selectively redact within a run; the chain depends on every event.

Metrics

Agent.Metrics = starling.NewMetrics(reg) registers a Prometheus collection set on the supplied prometheus.Registerer. Highlights:

MetricTypeLabels
starling_runs_started_totalCounter-
starling_runs_in_flightGauge-
starling_run_duration_secondsHistogramstatus
starling_run_terminal_totalCounterstatus, error_type
starling_provider_calls_totalCountermodel, status
starling_provider_call_duration_secondsHistogrammodel
starling_provider_tokens_totalCountermodel, type
starling_tool_calls_totalCountertool, status, error_type
starling_tool_call_duration_secondsHistogramtool
starling_eventlog_appends_totalCounterkind, status
starling_eventlog_append_duration_secondsHistogramkind
starling_budget_exceeded_totalCounteraxis

Wire the /metrics handler into your existing HTTP mux:

metrics := starling.NewMetrics(prometheus.DefaultRegisterer)
http.Handle("/metrics", starling.MetricsHandler(prometheus.DefaultGatherer))

For direct Append callers outside step.emit, wrap the log with eventlog.WithMetrics(log, observer) so latency histograms cover that path too.

Tracing

OpenTelemetry spans are emitted under the starling instrumentation name. Wire any OTLP exporter:

exp, _ := otlptracegrpc.New(ctx)
provider := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(provider)

Expected span tree per run:

agent.run
└── agent.turn × N
    ├── provider.stream
    └── step.tool × M

Failure recovery

A crashed run leaves an open hash chain. Restart the same runID with Agent.Resume(ctx, runID, ""):

  • If the crash happened mid-turn before AssistantMessageCompleted, Resume reissues pending tool calls under fresh CallIDs.
  • If the assistant turn completed but tools were pending, same path.
  • Pass WithReissueTools(false) to refuse and inspect manually.

Resume goes through the same eventlog.Preflight check as Run, so a stale schema fails fast.

Operational checklist

  • Backups verified by restoring into a staging DB monthly.
  • Inspector behind TLS + bearer token.
  • Provider API keys in env, not source.
  • DB file/connection user has minimum privileges.
  • eventlog.Migrate in your release script (Postgres) or trusted to run on NewSQLite.
  • Metrics scraped; dashboards alert on starling_eventlog_append_duration_seconds p99 and starling_provider_call_duration_seconds p99.
  • Retention policy implemented (rolling window, archive-to-NDJSON, or partitioning).
  • Security review for tool-side network/filesystem access.

On this page