AionDB v0.3 is live: vector search becomes a first-class engine surface with pgvector-style SQL, HNSW, IVF-flat, Qdrant-style filters, and published recall/latency benchmarks. See the v0.3 vector update.

Observability

AionDB starts an HTTP observability server for local health and metrics.

New in v0.3: SQL, graph, and vector retrieval share one engine. See What's New in v0.3.

Defaults

AIONDB_OBSERVABILITY_BIND=127.0.0.1
AIONDB_OBSERVABILITY_PORT=9187

Endpoints:

Local checks

curl http://127.0.0.1:9187/livez
curl http://127.0.0.1:9187/healthz
curl http://127.0.0.1:9187/readyz
curl http://127.0.0.1:9187/info
curl http://127.0.0.1:9187/metrics

Startup behavior

By default, the server can continue in degraded mode if observability startup fails. To fail startup instead:

AIONDB_OBSERVABILITY_FAIL_FAST=true

Security posture

The server treats public observability exposure as unsafe in v0.1. Keep observability on loopback unless the environment is explicitly secured.

What to check first

Use /livez to check whether the observability HTTP process is alive. Use /readyz for supervisors and load balancers that need a pgwire readiness gate. /healthz is kept as a compatibility alias for readiness. Use /info to inspect basic runtime identity and configuration. Use /metrics for counters and gauges that are useful during local evaluation.

Exposed metric families

/metrics emits Prometheus-compatible plain-text counters and gauges. The metric names below are stable enough to be used by local evaluation dashboards, but the exact set may grow between releases.

FamilyNames
Query lifecycleaiondb_queries_total, aiondb_queries_failed_total, aiondb_rows_returned_total, aiondb_rows_affected_total
Query latencyaiondb_query_duration_micros_total, aiondb_query_duration_micros_bucket{le="..."}, aiondb_query_duration_micros_sum, aiondb_query_duration_micros_count, aiondb_query_duration_micros_p50, aiondb_query_duration_micros_p95, aiondb_query_duration_micros_p99
Concurrencyaiondb_query_queue_depth_current, aiondb_query_queue_depth_peak, aiondb_session_lock_wait_total, aiondb_session_lock_wait_micros_total, aiondb_session_lock_wait_micros_max
Graph DDLaiondb_graph_ddl_operations_total
Distributed executionaiondb_distributed_fragments_total, aiondb_distributed_fragment_errors_total
pgwire listeneraiondb_pgwire_connections_total, aiondb_pgwire_connections_active, aiondb_pgwire_queries_total, aiondb_pgwire_successful_startups_total, aiondb_pgwire_failed_startups_total, aiondb_pgwire_authentication_failures_total
Product contractaiondb_product_single_node_mode, aiondb_product_clustering_supported, aiondb_product_encryption_at_rest_supported, aiondb_product_backup_restore_supported
Distributed topologyaiondb_distributed_remote_nodes_total, aiondb_distributed_remote_nodes_available, aiondb_distributed_remote_circuits_open, aiondb_distributed_remote_circuits_half_open, aiondb_distributed_remote_node_available{node=...}, aiondb_distributed_remote_node_circuit_state{node=...}, aiondb_distributed_remote_node_consecutive_failures{node=...}
Control planeaiondb_distributed_control_plane_nodes_total, aiondb_distributed_control_plane_nodes_live, aiondb_distributed_control_plane_node_live{node=...}, aiondb_distributed_control_plane_shards_total, aiondb_distributed_control_plane_placement_epoch
Distributed replicationaiondb_distributed_replication_shards_total, aiondb_distributed_replication_shards_with_live_quorum, aiondb_distributed_replication_shards_without_live_quorum, aiondb_distributed_replication_under_replicated_shards, aiondb_distributed_replication_shards_with_down_voters, aiondb_distributed_replication_shards_with_learners, aiondb_distributed_replication_learner_replicas, aiondb_distributed_replication_shard_live_quorum{shard_id=...}, aiondb_distributed_replication_node_leaders{node_id=...}, aiondb_distributed_replication_node_voters{node_id=...}, aiondb_distributed_replication_node_learners{node_id=...}
Replica runtimeaiondb_replica_runtime_sessions_started, aiondb_replica_runtime_sessions_succeeded, aiondb_replica_runtime_sessions_failed, aiondb_replica_runtime_reconnects, aiondb_replica_runtime_wal_bytes_received, aiondb_replica_runtime_standby_status_updates_sent, aiondb_replica_runtime_last_session_started_at_us
Replica WAL receiveraiondb_replica_wal_receiver_write_lsn, aiondb_replica_wal_receiver_flush_lsn, aiondb_replica_wal_receiver_apply_lsn, aiondb_replica_wal_receiver_write_apply_lag_lsn, aiondb_replica_wal_receiver_flush_apply_lag_lsn

The aiondb_product_* gauges are dimensional booleans that describe what the running binary actually supports. They are useful for dashboards that need to refuse production-readiness claims a build cannot back.

During a benchmark or compatibility run, record:

That information makes a performance or reliability report useful after the machine has been shut down.

Local debugging pattern

Start the server in one terminal:

AIONDB_BOOTSTRAP_USER=dev \
AIONDB_BOOTSTRAP_PASSWORD='ReplaceWithLongUniquePassword42!' \
cargo run -p aiondb-server --bin aiondb -- --ephemeral

Check health from another terminal:

curl -s http://127.0.0.1:9187/livez
curl -s http://127.0.0.1:9187/healthz
curl -s http://127.0.0.1:9187/readyz
curl -s http://127.0.0.1:9187/info

If the database accepts client connections but observability does not respond, check the bind address, port, and whether another process already owns the port.

Graph EXPLAIN JSON

AionDB also exposes a structured graph observability payload through SQL EXPLAIN.

For the full JSON contract, field list, examples, and engine helper API, use Explain JSON.

Supported forms:

EXPLAIN (FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;

EXPLAIN (ANALYZE, FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;

FORMAT JSON returns a single-row JSON payload instead of the usual multi-line text plan. ANALYZE keeps the same JSON shape and adds runtime fields such as actual rows, actual selectivity, clause input/output rows, and lightweight timings.

Contract

The payload is versioned:

Top-level fields:

FieldMeaning
query_plan_linesFull text EXPLAIN output preserved as an array of lines.
plan_linesNon-graph EXPLAIN lines.
structural_plan_linesplan_lines without runtime summary lines such as Execution: or Rows Returned:.
graph_linesHuman-readable graph observability lines.
plan_overviewStable summary of the non-graph plan root and primary operator.
graph_summaryStable machine-readable summary of graph risk, pivots, joins, and drift.
graph_detailClause-level and pattern-level graph details.
execution_summaryRuntime summary when ANALYZE is used.

plan_overview

plan_overview is meant to be a small stable entry point for UI and automation.

Fields:

Current plan_category values include:

Current plan_subcategory values include:

graph_summary

graph_summary is the compact machine-readable graph health block.

Important fields include:

Current severity values are:

graph_detail

graph_detail contains:

Each clause can expose:

join_risk can expose:

Each pattern detail can expose:

Provenance and trust

The graph payload distinguishes between values that were observed at runtime and values that were inferred from static plan shape.

Typical provenance fields include:

Current values are:

Under plain EXPLAIN (FORMAT JSON), most runtime-facing fields are inferred or unavailable.

Under EXPLAIN (ANALYZE, FORMAT JSON), the payload can carry observed or mixed values when the engine has real runtime evidence.

Text EXPLAIN lines

The plain text graph lines also carry provenance on the main summaries and warnings.

Examples:

This is mainly for operator readability. Product logic should still prefer the JSON payload.

execution_summary

execution_summary is present in both modes, but runtime values are only populated under ANALYZE.

Fields:

Under plain EXPLAIN (FORMAT JSON), these runtime fields can be null.

Example

Abbreviated payload:

{
  "schema_version": 1,
  "format_kind": "aiondb.explain_json",
  "plan_overview": {
    "root_kind": "Cypher Query",
    "primary_operator_kind": "Nested Loop",
    "plan_category": "join",
    "plan_subcategory": "nested_loop"
  },
  "graph_summary": {
    "severity": "watch",
    "fragile_pivots": 1,
    "risky_join_clauses": 0,
    "max_fanout": null
  },
  "graph_detail": {
    "summary": {
      "severity": "watch"
    },
    "clauses": [
      {
        "kind": "PipelineMatch",
        "pattern_details": [
          {
            "seed_mode": "label_scan",
            "pivot_decision": "retained_leftmost"
          }
        ]
      }
    ]
  },
  "execution_summary": {
    "kind": "Query",
    "rows_returned": 1,
    "memory_used_bytes": 5283
  }
}

The contract is intended for local tooling, UI work, and future planner feedback loops. Keep clients tolerant to additive fields and reject only on incompatible schema_version or format_kind.

Consuming the payload

From a SQL client such as psql, FORMAT JSON returns a single text cell that contains the JSON document:

EXPLAIN (FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;

That is the right path for ad hoc inspection, shell tooling, and compatibility with existing SQL clients.

Inside the engine, prefer the structured helpers instead of reparsing text output:

Those helpers:

Minimal Rust sketch:

use aiondb_engine::engine::api::QueryEngine;

fn load_graph_summary(
    engine: &dyn QueryEngine,
    session: &aiondb_engine::session::SessionHandle,
) -> aiondb_core::error::DbResult<serde_json::Value> {
    engine.execute_explain_graph_summary_json(
        session,
        "MATCH (a)-[:KNOWS]->(b) RETURN b.id",
        true,
    )
}

For UI or telemetry work:

Production-style guidance

For v0.1, do not expose observability directly to the public internet. Put it behind local networking, firewall rules, or a trusted collection agent. Treat metrics as operational data: they may reveal database names, runtime shape, workload volume, or error patterns.

What is not covered yet

The v0.1 observability story is intentionally small. A mature deployment story would also need structured tracing, stable metric names, documented alert thresholds, log redaction policy, dashboard examples, and integration tests for degraded observability startup.