OPERATIONS

Observability

The **Health** console (App → Health) aggregates metrics and SLA for each provisioned Project. Project, Organization, and platform expose health endpoints with the same semantics.

Health endpoints

  • Project · /api/projetos/{id}/saude — Project metrics and SLA.
  • Organization · /api/tenants/atual/saude — aggregate of the Organization's Projects.
  • Platform · /api/admin/saude — consolidated view. Restricted to Tesserra operators.

Exposed metrics

Available metrics (typical 60 min window unless noted): cpu (%), ram (%), rps (req/s), p50/p95/p99 (ms), erros (%), sla_mes (%), sla_24h (%), uptime (ratio).

Current telemetry is derived from the provisioning plan and external probes. Prometheus and OpenTelemetry integration for metrics directly from Compute is planned.

Reading example

curl https://tesserra.io/api/tenants/atual/saude \
  -H "Authorization: Bearer $TESSERRA_TOKEN"
json
{
  "tenant": "acme",
  "projetos": [
    {
      "id": "...",
      "nome": "Loja Acme",
      "estado": "provisionado",
      "saude": {
        "cpu": 32.4,
        "ram": 51.2,
        "rps": 142,
        "p50": 28, "p95": 96, "p99": 220,
        "erros": 0.12,
        "sla_mes": 99.97,
        "uptime": 1.0
      },
      "zonas": ["uk-wlv-1", "de-ham-1"]
    }
  ],
  "agregado": {
    "incidentes_abertos": 0,
    "ultima_atualizacao": "2026-05-28T20:31:14Z"
  }
}

Alerts

Insight resources with alertas: true email Organization managers when:

  • errors > 1% for more than 5 minutes;
  • p95 above 2× the week's baseline, sustained for 10 minutes;
  • a Compute is unavailable for 60 consecutive seconds;
  • monthly SLA falls below the plan commitment.

Custom webhooks (Slack, Discord, Opsgenie, PagerDuty) are planned.

Audit

Sensitive actions are stored in auth_eventos (timestamp, user, IP, detail). Managers export CSV at App → Audit; platform admin at Admin → Audit.

Documentation · Tesserra