Skip to content

Monitoring + Troubleshooting

This page is scoped to observing correctness and liveness of the Sorcery data plane (metad, feedd, orderd) in production trading environments.

Primary signals

Read these directly from ring messages:

Required checks:

  • conn_state == CONNECTED on venues you trade
  • last_rx_age_ns remains bounded for each venue
  • reconnect_count does not trend upward continuously

Market-data correctness

Critical events:

  • Ring gap frame (frame.is_gap())
  • Header flags: GAP, RESET, DROP
  • Seq discontinuity in (venue, msg_type, inst_id) domain

Operational rule:

  • On any of the above, mark affected books invalid and recover via snapshot before reuse.
  • Recovery contract: Ordering + Sequencing

Order-routing correctness

Critical events:

  • Response ring gap
  • Epoch change per venue
  • ORDER_REJECT with DISCONNECTED
  • Unknown reconciliation records (order_id = 0)

Operational rule:

  • On response gap, run QUERY_ORDERS and QUERY_BALANCES reconciliation before resuming order flow.
  • On epoch change, treat non-terminal orders as uncertain until reconciliation completes.
  • Recovery contract: Order Routing integration

Metadata freshness

Monitor metadata generation changes and reload behavior:

  • Region spec: Metadata
  • If reload fails or stalls, treat price/qty conversions as unsafe for affected instruments.

Alert conditions

Tune thresholds to venue and strategy profile, but alert on:

  • Loss of CONNECTED on any active venue
  • No STATUS for greater than three expected intervals
  • Reconnect storm (monotone reconnect-count increase)
  • Any market-data gap/reset/drop event
  • Any order-routing response ring gap event

Recovery invariants

Trigger Required action Resume gate
Market-data gap / GAP / RESET / DROP Mark affected book INVALID, request snapshot, continue draining Resume decisions only after valid snapshot applied and deltas reconciled
Order-routing response gap Pause submissions on affected venue(s), run QUERY_ORDERS + QUERY_BALANCES Resume only after reconciliation query responses are applied
Order-routing epoch change Mark non-terminal orders uncertain, wait for reconciliation stream, run venue queries Resume only after reconciliation completes and query pass converges
order_id = 0 reconciliation records Route to reconciliation handler (not strategy callback path), run venue queries as needed Resume normal routing only after orphan state is resolved
Unknown non-zero order_id Fail closed: pause venue submissions and rebuild ownership map Resume only after ownership map validation succeeds

Troubleshooting map

Symptom Likely cause Action
No ring traffic Stack mismatch or producer down Verify process and stack config on producer and consumer
Frequent md gaps Consumer throughput below ingress rate Reduce handler work, increase drain batch/ring size
Frequent ord gaps Response consumer stalled Prioritize response loop, reconcile before trading
Persistent disconnects Venue/API/network instability Disable affected venue routing and monitor reconnect
Unknown orders after restart (order_id = 0) Exchange-side state outside orderd Route to reconciliation stream and reconcile via venue queries before resuming

Logging policy

This documentation does not prescribe a logging format for HFT deployments.

Log only events needed to debug Sorcery integration correctness:

  • Connection/epoch transitions
  • Gap detection and recovery start/end
  • Reject reason distributions
  • Reconciliation outcomes