Monitoring + Troubleshooting¶

This page is scoped to observing correctness and liveness of the Sorcery data plane (metad, feedd, orderd) in production trading environments.

Primary signals¶

Read these directly from ring messages:

Required checks:

Critical events:

Operational rule:

On any of the above, mark affected books invalid and recover via snapshot before reuse.
Recovery contract: Ordering + Sequencing

Critical events:

Operational rule:

On response gap, run QUERY_ORDERS and QUERY_BALANCES reconciliation before resuming order flow.
On epoch change, treat non-terminal orders as uncertain until reconciliation completes.
Recovery contract: Order Routing integration

Monitor metadata generation changes and reload behavior:

Region spec: Metadata
If reload fails or stalls, treat price/qty conversions as unsafe for affected instruments.

Tune thresholds to venue and strategy profile, but alert on:

Trigger	Required action	Resume gate
Market-data gap / `GAP` / `RESET` / `DROP`	Mark affected book INVALID, request snapshot, continue draining	Resume decisions only after valid snapshot applied and deltas reconciled
Order-routing response gap	Pause submissions on affected venue(s), run `QUERY_ORDERS` + `QUERY_BALANCES`	Resume only after reconciliation query responses are applied
Order-routing epoch change	Mark non-terminal orders uncertain, wait for reconciliation stream, run venue queries	Resume only after reconciliation completes and query pass converges
`order_id = 0` reconciliation records	Route to reconciliation handler (not strategy callback path), run venue queries as needed	Resume normal routing only after orphan state is resolved
Unknown non-zero `order_id`	Fail closed: pause venue submissions and rebuild ownership map	Resume only after ownership map validation succeeds

Symptom	Likely cause	Action
No ring traffic	Stack mismatch or producer down	Verify process and `stack` config on producer and consumer
Frequent md gaps	Consumer throughput below ingress rate	Reduce handler work, increase drain batch/ring size
Frequent ord gaps	Response consumer stalled	Prioritize response loop, reconcile before trading
Persistent disconnects	Venue/API/network instability	Disable affected venue routing and monitor reconnect
Unknown orders after restart (`order_id = 0`)	Exchange-side state outside `orderd`	Route to reconciliation stream and reconcile via venue queries before resuming

This documentation does not prescribe a logging format for HFT deployments.

Log only events needed to debug Sorcery integration correctness: