Reliability

Reliability Posture

This is Vorantiq’s reliability architecture, status discipline, and incident communication doctrine. Pre-launch posture is stated explicitly so this document is not mistaken for an active commercial SLA.

Pre-launch posture

Vorantiq is in pre-general-availability. Paying tenant count is small (design partners + the platform owner’s organization). On-call rotation is single-engineer. Status page comms are manually authored. No paying-customer SLA is contractually enforced today; the targets below describe what we will commit to as we move to GA.

Service Level Objectives — target for GA

Production API availability≥ 99.9% over 30 days

Auth flow latencyp95 ≤ 500 ms

Agent execution latencyp95 ≤ 8 s (single-shot)

Provider call success≥ 99% over 5 m

Webhook ingestion (Stripe)≥ 99.95%

These are targets, not contractually enforced commitments. Until OTel rollout (B.5.2) lands and 30 days of data exist, no SLO is computable.

Incident communication

P0 — production-wide outage / data exposure / billing wrong / audit broken. Status-page first update within 15 min; cadence every 15 min until mitigated.

P1 — single plane degraded for >5% of paying tenants. First update within 60 min; cadence every 30 min.

P2 — sustained anomaly with localized impact. Status-page entry optional.

Tone discipline: factual, infrastructure-grade, no marketing softening. State what happened, what is happening now, what to do, the next update time.

Disaster Recovery

Target RPO 1 hr / RTO 1 hr. Documented runbooks for forward-correction, Neon point-in-time recovery, pg_dump cold-archive restore, and tenant-scoped restore.

DR rehearsal cadence is currently planned, not yet scheduled.

View full document in repository