Collector Operations Guide
This guide covers how to run the reference collector in production. It is the day-2 operations story — how the process is deployed, kept running, and observed. It does not duplicate the wire contract (POST /receipts semantics, status codes, validation scope) — that lives in ADR-0020 and the collector README. Cross-reference those for anything about the HTTP API itself.
The companion guide for the adopter side — how your agent code emits receipts to the collector from ephemeral compute — is Ephemeral Compute Deployment.
Deployment shape
Section titled “Deployment shape”The collector is a single stateless binary. All state lives in the backing store you configure; the process itself holds nothing between requests. This means:
- You can run any number of collector instances behind a load balancer with no sticky routing. Receipt uniqueness is enforced by the backing store’s unique constraint on
id, not by routing every sender to the same instance. - Horizontal scaling is a store choice, not a collector choice. See Scaling and durability.
- Rolling restarts and zero-downtime redeploys work out of the box — there is no in-memory state to drain (the drain window,
--drain-timeout, only covers in-flight HTTP requests).
┌──────────────────────────┐SDK / HttpEmitter │ Load balancer / proxy │POST /receipts ───▶ │ (TLS termination, auth) │ └────────────┬─────────────┘ │ ┌──────────▼──────────┐ ┌──────────────────┐ │ collector instance │ │ collector inst. │ │ (stateless binary) │ │ (stateless bin) │ └──────────┬───────────┘ └────────┬─────────┘ │ │ └──────────┬─────────────┘ │ ┌───────▼───────┐ │ backing store │ │ (SQLite / PG) │ └───────────────┘Build the binary. From the repo root, build the collector’s main package by its module-qualified path and name the output binary explicitly:
go build -o collector github.com/agent-receipts/ar/collector/cmd/collector(The bare go build ./cmd/collector only resolves from inside the collector/ module directory.)
Run it:
./collector --addr 0.0.0.0:8787 --db /data/collector.dbThe default --addr binds loopback only (127.0.0.1:8787) — opt in explicitly when exposing beyond localhost. See Configuration for the full flag reference.
Configuration
Section titled “Configuration”| Flag | Env var | Default | Notes |
|---|---|---|---|
--addr | AGENTRECEIPTS_COLLECTOR_ADDR | 127.0.0.1:8787 | HTTP listen address |
--db | AGENTRECEIPTS_COLLECTOR_DB | collector.db | SQLite path; use :memory: for non-durable |
--max-body-bytes | AGENTRECEIPTS_COLLECTOR_MAX_BODY_BYTES | 1048576 (1 MiB) | Per-request body cap |
--drain-timeout | AGENTRECEIPTS_COLLECTOR_DRAIN_TIMEOUT | 10s | Graceful shutdown window |
--version | — | — | Print version and exit |
Backing store choices
Section titled “Backing store choices”SQLite (default, v0)
Section titled “SQLite (default, v0)”The default store. A single file on local disk. No external dependencies, no server to manage. The store opens with PRAGMA journal_mode=WAL, which improves concurrent read performance and reduces lock contention, but does not by itself change fsync behaviour — durability on crash depends on the OS and filesystem defaults.
When it fits: low-to-moderate volume, single-node deployments, development, single-agent pipelines. SQLite handles thousands of receipts per second on commodity hardware without tuning.
Limits: single writer (enforced by the database file lock); horizontal scaling requires a shared network filesystem or a different store. All query patterns must run against one file. GDPR erasure requires direct file-level tooling or a custom query, since the collector has no deletion endpoint by design (append-only).
Operationally: back up the SQLite file with the sqlite3 shell’s .backup dot-command — sqlite3 /data/collector.db ".backup '/backups/collector.db'" — a filesystem snapshot, or VACUUM INTO. Rotate the file on a schedule if you need bounded retention.
Postgres (multi-node, planned)
Section titled “Postgres (multi-node, planned)”A Postgres backing store is on the roadmap for deployments that need horizontal write scaling or richer query patterns (filtering by chain_id, time range, agent DID). The uniqueness constraint on receipt id maps directly to a UNIQUE index; the append-only requirement means no UPDATE or DELETE statements on the receipts table.
When it fits: multi-node collector fleets, deployments that run SQL-based audit queries directly against the store, or when your organisation already operates Postgres and wants receipts in the same retention and backup pipeline.
Trade-offs: Postgres adds infrastructure complexity and a network hop. For most v0 deployments SQLite is sufficient. Postgres becomes relevant when you need to scale past a single machine or want direct SQL analytics without exporting from SQLite first.
GDPR erasure: Postgres’s row-level operations make targeted deletion easier to implement, but the collector schema is intentionally append-only and has no deletion endpoint. If your data-residency requirements mandate erasure, plan for a separate out-of-band erasure process that operates directly on the store. See ADR-0019 §S3 (tracked in issue #478) for the payload-strategy design, which affects what is stored in receipts versus referenced off-chain — relevant to how much data needs erasing.
S3 / object storage (archival, planned)
Section titled “S3 / object storage (archival, planned)”Object storage (S3, GCS, R2, Azure Blob) is an append-only archive target — each receipt stored as an individual object keyed by id. Suitable for long-term retention and audit archival where receipts are written once and read rarely.
When it fits: regulatory archive requirements; organisations that already use object storage for audit logs; cross-region replication; very high volume where storage cost matters.
Trade-offs: object storage is not suitable for interactive queries (no SQL, no index). Use it alongside a queryable store (SQLite, Postgres), or fan out receipts to both using a CompositeEmitter. Alternatively, periodically bulk-export from SQLite to S3 for archival.
Object-lock / WORM: Object Lock (S3) or equivalent WORM flags on other platforms enforce immutability at the storage layer — a useful operational control on top of the protocol’s tamper-evidence properties. See ADR-0019 §O2 (tracked in issue #484) for the store-completeness design and rationale.
Authentication
Section titled “Authentication”v0 ships without authentication. This is a deliberate starting point, not an oversight. The client side — HttpEmitter — already supports api-key, bearer, and mTLS via HttpEmitterAuth (ADR-0020), so the authentication vocabulary exists. Server-side enforcement is tracked as future work.
The v0 stopgap is network-level controls:
- Run the collector inside a private VPC or VNet, accessible only to the subnets your agents run in.
- Place it behind a reverse proxy (nginx, Caddy, Envoy, a cloud load balancer) that terminates TLS and enforces authentication. The proxy can validate API keys or bearer tokens before forwarding to the collector.
- Use a service mesh (Linkerd, Istio) for mTLS between the agent’s compute and the collector, without modifying the collector binary.
When server-side HttpEmitterAuth enforcement lands, it will be a native option on the collector binary — you will be able to move auth enforcement in-process without the proxy tier if you prefer.
Scaling and durability
Section titled “Scaling and durability”Horizontal scaling
Section titled “Horizontal scaling”The collector layer scales horizontally because it is stateless. Add instances behind the load balancer freely — no session affinity required. Uniqueness is enforced by the backing store’s UNIQUE constraint on receipt id; a duplicate receipt arriving at any instance returns 409 Conflict, which the SDK treats as a successful delivery.
For SQLite, horizontal write scaling is limited by the file lock — multiple writers on the same SQLite file are serialised. If you need multiple concurrent writers, use Postgres when it lands.
Append-only and durability
Section titled “Append-only and durability”“Append-only” means receipts are never modified or deleted after insertion. Operationally this means:
- Backups are straightforward. The store only grows. A backup taken at any point in time is a valid and complete snapshot of everything received up to that moment. You do not need to coordinate backups with the collector process (SQLite’s WAL mode allows hot backups without locking out writers).
- Immutability flags. For regulated workloads, pair append-only semantics with storage-level immutability — object lock on S3, WORM volumes, or a Postgres row-level security policy that prevents
DELETE. The protocol’s tamper-evidence (hash chains) detects post-hoc alteration, but storage-level immutability prevents deletion of entire sessions from going undetected. See ADR-0019 §O2 / issue #484 for the store-completeness rationale. - Retention. The collector has no built-in retention policy. Implement retention at the storage layer: S3 lifecycle rules, filesystem rotation with archival, or — with care and documentation — a Postgres cleanup job as a deliberate exception to the append-only rule.
Idempotency and safe retry
Section titled “Idempotency and safe retry”Receipt id values are URNs of the form urn:receipt:<uuid-v4> (a UUID v4 in a urn:receipt: namespace), generated by the SDK before delivery. The store’s unique constraint on the full id string means:
- Delivering the same receipt twice returns
201on the first attempt and409on subsequent ones. The SDK treats409as success. - This makes retries safe. You do not need exactly-once delivery guarantees between the SDK and the collector — at-least-once is sufficient.
- Load balancers can freely retry failed requests without risk of duplicate data.
Observability
Section titled “Observability”Health check
Section titled “Health check”GET /healthz→ 200 store is reachable→ 503 store is unreachable (database connection lost)Wire /healthz to your load balancer’s health check. An instance that returns 503 should be taken out of rotation — it will reject all writes until the store comes back.
/healthz probes store reachability only; it is not a write-safety probe. A full disk, for example, can still allow reads while insert operations fail — the health check may return 200 while POST /receipts begins returning 500. Monitor 5xx rates on the ingest path as a separate signal.
Structured logging
Section titled “Structured logging”The collector emits structured JSON logs (log/slog) to stdout. Each record carries the standard slog level, time, and msg fields. Records for successfully parsed receipts add id (when available — early-rejection paths such as body-too-large or malformed JSON log without it); the accept path also adds chain_id and sequence. The collector does not emit an HTTP status field — derive status-code signals from the msg value (or from your proxy’s access logs, see Metrics).
Some early-rejection paths (empty body, trailing-data) return 400 without emitting a structured log record; when the body cannot be parsed, the log line will also lack an id field. Use proxy access logs as the authoritative source for error-rate counts — collector log counts will undercount 400 traffic. Key things to filter on for audit-relevant events:
| Filter | What to watch |
|---|---|
level=ERROR | Store write failures, connection errors (e.g. msg="receipt insert failed") |
msg="receipt rejected: …" | Malformed receipts (logged at WARN) — investigate the SDK version or emitter config |
msg="receipt already exists, returning 409" | Duplicate receipts — expected on retry; a high rate may indicate a retry loop |
msg="receipt accepted" | A receipt was persisted — the primary audit record of what arrived and when |
id | Correlate a specific receipt (a urn:receipt:<uuid> value) across SDK logs and collector logs |
chain_id | Group all receipts belonging to a single agent chain |
sequence | The receipt’s position within its chain |
For audit purposes, msg="receipt accepted" log lines — with their id, chain_id, and sequence — are the primary record of what arrived and when.
Metrics
Section titled “Metrics”There is no built-in Prometheus metrics endpoint in v0. The collector logs no HTTP status field, so derive status-code metrics from your proxy/load-balancer access logs (which record the response status directly), or by counting the specific collector msg values that map to each outcome:
| Metric | How to derive |
|---|---|
| Ingest rate (receipts/s) | Count proxy 2xx on POST /receipts, or msg="receipt accepted" log lines, per second |
| Conflict rate | Count proxy 409 on POST /receipts, or msg="receipt already exists, returning 409" log lines, per second — elevated rate signals retry loops |
| Client error rate | Count proxy 400 on POST /receipts, or msg="receipt rejected: …" log lines, per second — unexpected spikes indicate SDK misconfiguration; note that some early-rejection 400s are not logged (see above) |
| Server error rate | Count proxy 5xx on POST /receipts, or level=ERROR log lines (e.g. msg="receipt insert failed"), per second — correlate with store health |
| Store write latency | Instrument at the proxy layer or derive from response-time fields in access logs |
| Queue depth | Only applicable if you add a message queue in front of the collector; monitor at the queue layer |
If you add a message queue (SQS, Pub/Sub, Kafka) in front of the collector as a buffer against traffic spikes, monitor queue depth and consumer lag as the primary backpressure signal.
Trust boundary
Section titled “Trust boundary”The collector is not a trusted component for chain construction. Every receipt arrives already signed and chained client-side, before it leaves the SDK. The collector stores wire bytes verbatim — it does not re-sign, reorder, recompute chain linkage (previous_receipt_hash), or verify signatures. (It does compute a receipt_hash of the raw body for storage and indexing, but this is independent of chain construction.)
This has two important operational consequences:
-
Chain verification is the auditor’s job, not the collector’s. Auditors verify chains using only the agent’s public key. They never need to trust the collector operator. This is what makes multi-tenant collector infrastructure safe — tenants can share a collector without trusting each other or the operator. Use the verifier tooling (
agent-receipts verify, documented in CLI Commands) to verify chains independently of the collector. -
A compromised collector cannot forge or alter receipts, but it can drop them. If a collector is compromised or selectively drops receipts, the resulting chain will have gaps. The SDK’s
WALEmittersurfaces undelivered receipts; the verifier will flag a chain with missing sequence numbers. Receipts where a tool call occurred but notool_resultwas delivered are classified asincomplete_tool_roundtripby the verifier (see ADR-0019 §O3), distinguishing deliberate omission from a normal chain gap.
In short: the collector’s role is delivery and storage. Correctness — whether a chain is complete, unforged, and attributable to the right agent — is enforced cryptographically by the agent’s signing key and verified by auditors offline.
References
Section titled “References”- ADR-0020 — Emitter abstraction and remote receipt delivery
- ADR-0019 — Protocol integrity gaps and mitigations
- Collector README — wire contract, validation scope, configuration flags
- Ephemeral Compute Deployment — adopter-side guide for SDK emitter configuration
- CLI Commands — verifier tooling (
agent-receipts verify)