Skip to content

Collector Operations Guide

This guide covers how to run the reference collector in production. It is the day-2 operations story — how the process is deployed, kept running, and observed. It does not duplicate the wire contract (POST /receipts semantics, status codes, validation scope) — that lives in ADR-0020 and the collector README. Cross-reference those for anything about the HTTP API itself.

The companion guide for the adopter side — how your agent code emits receipts to the collector from ephemeral compute — is Ephemeral Compute Deployment.

The collector is a single stateless binary. All state lives in the backing store you configure; the process itself holds nothing between requests. This means:

  • You can run any number of collector instances behind a load balancer with no sticky routing. Receipt uniqueness is enforced by the backing store’s unique constraint on id, not by routing every sender to the same instance.
  • Horizontal scaling is a store choice, not a collector choice. See Scaling and durability.
  • Rolling restarts and zero-downtime redeploys work out of the box — there is no in-memory state to drain (the drain window, --drain-timeout, only covers in-flight HTTP requests).
┌──────────────────────────┐
SDK / HttpEmitter │ Load balancer / proxy │
POST /receipts ───▶ │ (TLS termination, auth) │
└────────────┬─────────────┘
┌──────────▼──────────┐ ┌──────────────────┐
│ collector instance │ │ collector inst. │
│ (stateless binary) │ │ (stateless bin) │
└──────────┬───────────┘ └────────┬─────────┘
│ │
└──────────┬─────────────┘
┌───────▼───────┐
│ backing store │
│ (SQLite / PG) │
└───────────────┘

Build the binary. From the repo root, build the collector’s main package by its module-qualified path and name the output binary explicitly:

Terminal window
go build -o collector github.com/agent-receipts/ar/collector/cmd/collector

(The bare go build ./cmd/collector only resolves from inside the collector/ module directory.)

Run it:

Terminal window
./collector --addr 0.0.0.0:8787 --db /data/collector.db

The default --addr binds loopback only (127.0.0.1:8787) — opt in explicitly when exposing beyond localhost. See Configuration for the full flag reference.

FlagEnv varDefaultNotes
--addrAGENTRECEIPTS_COLLECTOR_ADDR127.0.0.1:8787HTTP listen address
--dbAGENTRECEIPTS_COLLECTOR_DBcollector.dbSQLite path; use :memory: for non-durable
--max-body-bytesAGENTRECEIPTS_COLLECTOR_MAX_BODY_BYTES1048576 (1 MiB)Per-request body cap
--drain-timeoutAGENTRECEIPTS_COLLECTOR_DRAIN_TIMEOUT10sGraceful shutdown window
--versionPrint version and exit

The default store. A single file on local disk. No external dependencies, no server to manage. The store opens with PRAGMA journal_mode=WAL, which improves concurrent read performance and reduces lock contention, but does not by itself change fsync behaviour — durability on crash depends on the OS and filesystem defaults.

When it fits: low-to-moderate volume, single-node deployments, development, single-agent pipelines. SQLite handles thousands of receipts per second on commodity hardware without tuning.

Limits: single writer (enforced by the database file lock); horizontal scaling requires a shared network filesystem or a different store. All query patterns must run against one file. GDPR erasure requires direct file-level tooling or a custom query, since the collector has no deletion endpoint by design (append-only).

Operationally: back up the SQLite file with the sqlite3 shell’s .backup dot-command — sqlite3 /data/collector.db ".backup '/backups/collector.db'" — a filesystem snapshot, or VACUUM INTO. Rotate the file on a schedule if you need bounded retention.

A Postgres backing store is on the roadmap for deployments that need horizontal write scaling or richer query patterns (filtering by chain_id, time range, agent DID). The uniqueness constraint on receipt id maps directly to a UNIQUE index; the append-only requirement means no UPDATE or DELETE statements on the receipts table.

When it fits: multi-node collector fleets, deployments that run SQL-based audit queries directly against the store, or when your organisation already operates Postgres and wants receipts in the same retention and backup pipeline.

Trade-offs: Postgres adds infrastructure complexity and a network hop. For most v0 deployments SQLite is sufficient. Postgres becomes relevant when you need to scale past a single machine or want direct SQL analytics without exporting from SQLite first.

GDPR erasure: Postgres’s row-level operations make targeted deletion easier to implement, but the collector schema is intentionally append-only and has no deletion endpoint. If your data-residency requirements mandate erasure, plan for a separate out-of-band erasure process that operates directly on the store. See ADR-0019 §S3 (tracked in issue #478) for the payload-strategy design, which affects what is stored in receipts versus referenced off-chain — relevant to how much data needs erasing.

Object storage (S3, GCS, R2, Azure Blob) is an append-only archive target — each receipt stored as an individual object keyed by id. Suitable for long-term retention and audit archival where receipts are written once and read rarely.

When it fits: regulatory archive requirements; organisations that already use object storage for audit logs; cross-region replication; very high volume where storage cost matters.

Trade-offs: object storage is not suitable for interactive queries (no SQL, no index). Use it alongside a queryable store (SQLite, Postgres), or fan out receipts to both using a CompositeEmitter. Alternatively, periodically bulk-export from SQLite to S3 for archival.

Object-lock / WORM: Object Lock (S3) or equivalent WORM flags on other platforms enforce immutability at the storage layer — a useful operational control on top of the protocol’s tamper-evidence properties. See ADR-0019 §O2 (tracked in issue #484) for the store-completeness design and rationale.

v0 ships without authentication. This is a deliberate starting point, not an oversight. The client side — HttpEmitter — already supports api-key, bearer, and mTLS via HttpEmitterAuth (ADR-0020), so the authentication vocabulary exists. Server-side enforcement is tracked as future work.

The v0 stopgap is network-level controls:

  • Run the collector inside a private VPC or VNet, accessible only to the subnets your agents run in.
  • Place it behind a reverse proxy (nginx, Caddy, Envoy, a cloud load balancer) that terminates TLS and enforces authentication. The proxy can validate API keys or bearer tokens before forwarding to the collector.
  • Use a service mesh (Linkerd, Istio) for mTLS between the agent’s compute and the collector, without modifying the collector binary.

When server-side HttpEmitterAuth enforcement lands, it will be a native option on the collector binary — you will be able to move auth enforcement in-process without the proxy tier if you prefer.

The collector layer scales horizontally because it is stateless. Add instances behind the load balancer freely — no session affinity required. Uniqueness is enforced by the backing store’s UNIQUE constraint on receipt id; a duplicate receipt arriving at any instance returns 409 Conflict, which the SDK treats as a successful delivery.

For SQLite, horizontal write scaling is limited by the file lock — multiple writers on the same SQLite file are serialised. If you need multiple concurrent writers, use Postgres when it lands.

“Append-only” means receipts are never modified or deleted after insertion. Operationally this means:

  • Backups are straightforward. The store only grows. A backup taken at any point in time is a valid and complete snapshot of everything received up to that moment. You do not need to coordinate backups with the collector process (SQLite’s WAL mode allows hot backups without locking out writers).
  • Immutability flags. For regulated workloads, pair append-only semantics with storage-level immutability — object lock on S3, WORM volumes, or a Postgres row-level security policy that prevents DELETE. The protocol’s tamper-evidence (hash chains) detects post-hoc alteration, but storage-level immutability prevents deletion of entire sessions from going undetected. See ADR-0019 §O2 / issue #484 for the store-completeness rationale.
  • Retention. The collector has no built-in retention policy. Implement retention at the storage layer: S3 lifecycle rules, filesystem rotation with archival, or — with care and documentation — a Postgres cleanup job as a deliberate exception to the append-only rule.

Receipt id values are URNs of the form urn:receipt:<uuid-v4> (a UUID v4 in a urn:receipt: namespace), generated by the SDK before delivery. The store’s unique constraint on the full id string means:

  • Delivering the same receipt twice returns 201 on the first attempt and 409 on subsequent ones. The SDK treats 409 as success.
  • This makes retries safe. You do not need exactly-once delivery guarantees between the SDK and the collector — at-least-once is sufficient.
  • Load balancers can freely retry failed requests without risk of duplicate data.
Terminal window
GET /healthz
200 store is reachable
503 store is unreachable (database connection lost)

Wire /healthz to your load balancer’s health check. An instance that returns 503 should be taken out of rotation — it will reject all writes until the store comes back.

/healthz probes store reachability only; it is not a write-safety probe. A full disk, for example, can still allow reads while insert operations fail — the health check may return 200 while POST /receipts begins returning 500. Monitor 5xx rates on the ingest path as a separate signal.

The collector emits structured JSON logs (log/slog) to stdout. Each record carries the standard slog level, time, and msg fields. Records for successfully parsed receipts add id (when available — early-rejection paths such as body-too-large or malformed JSON log without it); the accept path also adds chain_id and sequence. The collector does not emit an HTTP status field — derive status-code signals from the msg value (or from your proxy’s access logs, see Metrics).

Some early-rejection paths (empty body, trailing-data) return 400 without emitting a structured log record; when the body cannot be parsed, the log line will also lack an id field. Use proxy access logs as the authoritative source for error-rate counts — collector log counts will undercount 400 traffic. Key things to filter on for audit-relevant events:

FilterWhat to watch
level=ERRORStore write failures, connection errors (e.g. msg="receipt insert failed")
msg="receipt rejected: …"Malformed receipts (logged at WARN) — investigate the SDK version or emitter config
msg="receipt already exists, returning 409"Duplicate receipts — expected on retry; a high rate may indicate a retry loop
msg="receipt accepted"A receipt was persisted — the primary audit record of what arrived and when
idCorrelate a specific receipt (a urn:receipt:<uuid> value) across SDK logs and collector logs
chain_idGroup all receipts belonging to a single agent chain
sequenceThe receipt’s position within its chain

For audit purposes, msg="receipt accepted" log lines — with their id, chain_id, and sequence — are the primary record of what arrived and when.

There is no built-in Prometheus metrics endpoint in v0. The collector logs no HTTP status field, so derive status-code metrics from your proxy/load-balancer access logs (which record the response status directly), or by counting the specific collector msg values that map to each outcome:

MetricHow to derive
Ingest rate (receipts/s)Count proxy 2xx on POST /receipts, or msg="receipt accepted" log lines, per second
Conflict rateCount proxy 409 on POST /receipts, or msg="receipt already exists, returning 409" log lines, per second — elevated rate signals retry loops
Client error rateCount proxy 400 on POST /receipts, or msg="receipt rejected: …" log lines, per second — unexpected spikes indicate SDK misconfiguration; note that some early-rejection 400s are not logged (see above)
Server error rateCount proxy 5xx on POST /receipts, or level=ERROR log lines (e.g. msg="receipt insert failed"), per second — correlate with store health
Store write latencyInstrument at the proxy layer or derive from response-time fields in access logs
Queue depthOnly applicable if you add a message queue in front of the collector; monitor at the queue layer

If you add a message queue (SQS, Pub/Sub, Kafka) in front of the collector as a buffer against traffic spikes, monitor queue depth and consumer lag as the primary backpressure signal.

The collector is not a trusted component for chain construction. Every receipt arrives already signed and chained client-side, before it leaves the SDK. The collector stores wire bytes verbatim — it does not re-sign, reorder, recompute chain linkage (previous_receipt_hash), or verify signatures. (It does compute a receipt_hash of the raw body for storage and indexing, but this is independent of chain construction.)

This has two important operational consequences:

  1. Chain verification is the auditor’s job, not the collector’s. Auditors verify chains using only the agent’s public key. They never need to trust the collector operator. This is what makes multi-tenant collector infrastructure safe — tenants can share a collector without trusting each other or the operator. Use the verifier tooling (agent-receipts verify, documented in CLI Commands) to verify chains independently of the collector.

  2. A compromised collector cannot forge or alter receipts, but it can drop them. If a collector is compromised or selectively drops receipts, the resulting chain will have gaps. The SDK’s WALEmitter surfaces undelivered receipts; the verifier will flag a chain with missing sequence numbers. Receipts where a tool call occurred but no tool_result was delivered are classified as incomplete_tool_roundtrip by the verifier (see ADR-0019 §O3), distinguishing deliberate omission from a normal chain gap.

In short: the collector’s role is delivery and storage. Correctness — whether a chain is complete, unforged, and attributable to the right agent — is enforced cryptographically by the agent’s signing key and verified by auditors offline.