Monitoring + Troubleshooting¶
This page is scoped to observing correctness and liveness of the Sorcery data plane (metad, feedd, orderd) in production trading environments.
Primary signals¶
Read these directly from ring messages:
- Market data
STATUS: Market Data messages - Order routing
STATUS: Order Routing messages
Required checks:
conn_state == CONNECTEDon venues you tradelast_rx_age_nsremains bounded for each venuereconnect_countdoes not trend upward continuously
Market-data correctness¶
Critical events:
- Ring gap frame (
frame.is_gap()) - Header flags:
GAP,RESET,DROP - Seq discontinuity in
(venue, msg_type, inst_id)domain
Operational rule:
- On any of the above, mark affected books invalid and recover via snapshot before reuse.
- Recovery contract: Ordering + Sequencing
Order-routing correctness¶
Critical events:
- Response ring gap
- Epoch change per venue
ORDER_REJECTwithDISCONNECTED- Unknown reconciliation records (
order_id = 0)
Operational rule:
- On response gap, run
QUERY_ORDERSandQUERY_BALANCESreconciliation before resuming order flow. - On epoch change, treat non-terminal orders as uncertain until reconciliation completes.
- Recovery contract: Order Routing integration
Metadata freshness¶
Monitor metadata generation changes and reload behavior:
- Region spec: Metadata
- If reload fails or stalls, treat price/qty conversions as unsafe for affected instruments.
Alert conditions¶
Tune thresholds to venue and strategy profile, but alert on:
- Loss of
CONNECTEDon any active venue - No STATUS for greater than three expected intervals
- Reconnect storm (monotone reconnect-count increase)
- Any market-data gap/reset/drop event
- Any order-routing response ring gap event
Recovery invariants¶
| Trigger | Required action | Resume gate |
|---|---|---|
Market-data gap / GAP / RESET / DROP |
Mark affected book INVALID, request snapshot, continue draining | Resume decisions only after valid snapshot applied and deltas reconciled |
| Order-routing response gap | Pause submissions on affected venue(s), run QUERY_ORDERS + QUERY_BALANCES |
Resume only after reconciliation query responses are applied |
| Order-routing epoch change | Mark non-terminal orders uncertain, wait for reconciliation stream, run venue queries | Resume only after reconciliation completes and query pass converges |
order_id = 0 reconciliation records |
Route to reconciliation handler (not strategy callback path), run venue queries as needed | Resume normal routing only after orphan state is resolved |
Unknown non-zero order_id |
Fail closed: pause venue submissions and rebuild ownership map | Resume only after ownership map validation succeeds |
Troubleshooting map¶
| Symptom | Likely cause | Action |
|---|---|---|
| No ring traffic | Stack mismatch or producer down | Verify process and stack config on producer and consumer |
| Frequent md gaps | Consumer throughput below ingress rate | Reduce handler work, increase drain batch/ring size |
| Frequent ord gaps | Response consumer stalled | Prioritize response loop, reconcile before trading |
| Persistent disconnects | Venue/API/network instability | Disable affected venue routing and monitor reconnect |
Unknown orders after restart (order_id = 0) |
Exchange-side state outside orderd |
Route to reconciliation stream and reconcile via venue queries before resuming |
Logging policy¶
This documentation does not prescribe a logging format for HFT deployments.
Log only events needed to debug Sorcery integration correctness:
- Connection/epoch transitions
- Gap detection and recovery start/end
- Reject reason distributions
- Reconciliation outcomes