Product Update
Seaboss now watches itself: 13 self-monitoring signals
azmat••2 min read
A managed platform that doesn't watch itself isn't really managed. As of
RASF-718 Phase 2, the Seaboss management server runs an ops-watchdog
daemon with 13 active monitoring signals, each one looking for a specific
failure mode that's historically caused us pain.
What it watches
The signals split into three rough categories.
Tenant-fleet health:
- Tenant containers in the unhealthy state for >10 minutes.
- Sidecar sync staleness (no usage events from a tenant in >2 hours).
- Disk pressure on the management server itself.
- TLS cert expiry approaching (Let's Encrypt < 14 days).
Billing + provider integrity:
- Stripe webhook lag (events delayed >5 minutes).
- Stripe failed payments not retried.
- Hetzner orphan labels (VPSes labeled for tenants that no longer exist in our database).
Release lifecycle:
- Gold-release lifecycle stuck states (a release sitting in
candidateorcompat-passedlonger than expected without progress togold). - Tenant instance errors trending up.
- OpenClaw upstream release watcher (new versions worth evaluating).
Plus a usage-rollup staleness check and a self-health monitor that fires if any other signal stops checking in.
How it routes
When a signal fires, it goes through the platform-shared
@riseandshinefutures/ops-watchdog library, which:
- Posts to a Discord webhook (
#seaboss-ops). - Files a Linear ticket via the auto-filer (dedup by signal fingerprint).
- Logs to the audit DB.
The severity scale runs info → warning → critical, with mention
scoping that only escalates to @here or @everyone for true
fleet-affecting events.
What we caught early
In the first 24 hours after Phase 2 deployed, the daemon caught two real issues that unit tests didn't:
- A SQL bug in the
usage-rollup-stalenessquery (wrong column name, fingerprinted as a flap but the underlying query was nonsense). - A genuine deadlock in the Gold-upgrade flow that would have hit the next Gold release.
Both were fixed same-session. The daemon paid for its build cost on day one.
What's next
Phase 2.B will integrate the daemon with a chat interface for interactive investigation — operator-driven for now, autonomous after a soak period.