Observability
AionDB starts an HTTP observability server for local health and metrics.
New in v0.3: SQL, graph, and vector retrieval share one engine. See What's New in v0.3.
Defaults
AIONDB_OBSERVABILITY_BIND=127.0.0.1
AIONDB_OBSERVABILITY_PORT=9187
Endpoints:
/livez/healthz/readyz/metrics/info
Local checks
curl http://127.0.0.1:9187/livez
curl http://127.0.0.1:9187/healthz
curl http://127.0.0.1:9187/readyz
curl http://127.0.0.1:9187/info
curl http://127.0.0.1:9187/metrics
Startup behavior
By default, the server can continue in degraded mode if observability startup fails. To fail startup instead:
AIONDB_OBSERVABILITY_FAIL_FAST=true
Security posture
The server treats public observability exposure as unsafe in v0.1. Keep observability on loopback unless the environment is explicitly secured.
What to check first
Use /livez to check whether the observability HTTP process is alive. Use /readyz for supervisors and load balancers that need a pgwire readiness gate. /healthz is kept as a compatibility alias for readiness. Use /info to inspect basic runtime identity and configuration. Use /metrics for counters and gauges that are useful during local evaluation.
Exposed metric families
/metrics emits Prometheus-compatible plain-text counters and gauges. The metric names below are stable enough to be used by local evaluation dashboards, but the exact set may grow between releases.
| Family | Names |
|---|---|
| Query lifecycle | aiondb_queries_total, aiondb_queries_failed_total, aiondb_rows_returned_total, aiondb_rows_affected_total |
| Query latency | aiondb_query_duration_micros_total, aiondb_query_duration_micros_bucket{le="..."}, aiondb_query_duration_micros_sum, aiondb_query_duration_micros_count, aiondb_query_duration_micros_p50, aiondb_query_duration_micros_p95, aiondb_query_duration_micros_p99 |
| Concurrency | aiondb_query_queue_depth_current, aiondb_query_queue_depth_peak, aiondb_session_lock_wait_total, aiondb_session_lock_wait_micros_total, aiondb_session_lock_wait_micros_max |
| Graph DDL | aiondb_graph_ddl_operations_total |
| Distributed execution | aiondb_distributed_fragments_total, aiondb_distributed_fragment_errors_total |
| pgwire listener | aiondb_pgwire_connections_total, aiondb_pgwire_connections_active, aiondb_pgwire_queries_total, aiondb_pgwire_successful_startups_total, aiondb_pgwire_failed_startups_total, aiondb_pgwire_authentication_failures_total |
| Product contract | aiondb_product_single_node_mode, aiondb_product_clustering_supported, aiondb_product_encryption_at_rest_supported, aiondb_product_backup_restore_supported |
| Distributed topology | aiondb_distributed_remote_nodes_total, aiondb_distributed_remote_nodes_available, aiondb_distributed_remote_circuits_open, aiondb_distributed_remote_circuits_half_open, aiondb_distributed_remote_node_available{node=...}, aiondb_distributed_remote_node_circuit_state{node=...}, aiondb_distributed_remote_node_consecutive_failures{node=...} |
| Control plane | aiondb_distributed_control_plane_nodes_total, aiondb_distributed_control_plane_nodes_live, aiondb_distributed_control_plane_node_live{node=...}, aiondb_distributed_control_plane_shards_total, aiondb_distributed_control_plane_placement_epoch |
| Distributed replication | aiondb_distributed_replication_shards_total, aiondb_distributed_replication_shards_with_live_quorum, aiondb_distributed_replication_shards_without_live_quorum, aiondb_distributed_replication_under_replicated_shards, aiondb_distributed_replication_shards_with_down_voters, aiondb_distributed_replication_shards_with_learners, aiondb_distributed_replication_learner_replicas, aiondb_distributed_replication_shard_live_quorum{shard_id=...}, aiondb_distributed_replication_node_leaders{node_id=...}, aiondb_distributed_replication_node_voters{node_id=...}, aiondb_distributed_replication_node_learners{node_id=...} |
| Replica runtime | aiondb_replica_runtime_sessions_started, aiondb_replica_runtime_sessions_succeeded, aiondb_replica_runtime_sessions_failed, aiondb_replica_runtime_reconnects, aiondb_replica_runtime_wal_bytes_received, aiondb_replica_runtime_standby_status_updates_sent, aiondb_replica_runtime_last_session_started_at_us |
| Replica WAL receiver | aiondb_replica_wal_receiver_write_lsn, aiondb_replica_wal_receiver_flush_lsn, aiondb_replica_wal_receiver_apply_lsn, aiondb_replica_wal_receiver_write_apply_lag_lsn, aiondb_replica_wal_receiver_flush_apply_lag_lsn |
The aiondb_product_* gauges are dimensional booleans that describe what the running binary actually supports. They are useful for dashboards that need to refuse production-readiness claims a build cannot back.
During a benchmark or compatibility run, record:
- AionDB commit hash;
- server command and environment variables;
- observability bind address;
- start time;
- benchmark command;
- raw benchmark output.
That information makes a performance or reliability report useful after the machine has been shut down.
Local debugging pattern
Start the server in one terminal:
AIONDB_BOOTSTRAP_USER=dev \
AIONDB_BOOTSTRAP_PASSWORD='ReplaceWithLongUniquePassword42!' \
cargo run -p aiondb-server --bin aiondb -- --ephemeral
Check health from another terminal:
curl -s http://127.0.0.1:9187/livez
curl -s http://127.0.0.1:9187/healthz
curl -s http://127.0.0.1:9187/readyz
curl -s http://127.0.0.1:9187/info
If the database accepts client connections but observability does not respond, check the bind address, port, and whether another process already owns the port.
Graph EXPLAIN JSON
AionDB also exposes a structured graph observability payload through SQL EXPLAIN.
For the full JSON contract, field list, examples, and engine helper API, use Explain JSON.
Supported forms:
EXPLAIN (FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;
EXPLAIN (ANALYZE, FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;
FORMAT JSON returns a single-row JSON payload instead of the usual multi-line text plan. ANALYZE keeps the same JSON shape and adds runtime fields such as actual rows, actual selectivity, clause input/output rows, and lightweight timings.
Contract
The payload is versioned:
schema_version = 1format_kind = "aiondb.explain_json"
Top-level fields:
| Field | Meaning |
|---|---|
query_plan_lines | Full text EXPLAIN output preserved as an array of lines. |
plan_lines | Non-graph EXPLAIN lines. |
structural_plan_lines | plan_lines without runtime summary lines such as Execution: or Rows Returned:. |
graph_lines | Human-readable graph observability lines. |
plan_overview | Stable summary of the non-graph plan root and primary operator. |
graph_summary | Stable machine-readable summary of graph risk, pivots, joins, and drift. |
graph_detail | Clause-level and pattern-level graph details. |
execution_summary | Runtime summary when ANALYZE is used. |
plan_overview
plan_overview is meant to be a small stable entry point for UI and automation.
Fields:
root_lineroot_kindprimary_operator_lineprimary_operator_kindplan_categoryplan_subcategoryline_countstructural_line_countgraph_line_count
Current plan_category values include:
joinscansortaggregatelimitprojectother
Current plan_subcategory values include:
nested_loophash_joinmerge_joinindex_scanseq_scansortaggregatelimitprojectquery_wrapperother
graph_summary
graph_summary is the compact machine-readable graph health block.
Important fields include:
severitypivotable_patternsfragile_pivotsblocked_pivotsselected_non_leftmostselected_non_leftmost_sourcepivot_driver_metrics_sourcemulti_pattern_clausescorrelated_clausesshared_anchor_clausescorrelated_shared_anchorcorrelated_non_sharedshared_anchor_uncorrelatedindependent_multi_scandrift_patternshigh_drift_patternsdrift_metrics_sourcerisky_join_clauseshigh_risk_join_clausesjoin_risk_metrics_sourcemax_fanout
Current severity values are:
okwatchrisk
graph_detail
graph_detail contains:
summaryclauses[]
Each clause can expose:
kindclause_indexoptionalpatternsactual_input_rowsactual_output_rowsactual_selectivityactual_time_msjoin_riskpattern_details[]
join_risk can expose:
severityfanoutbasisjoin_risk_sourcecorrelatedcorrelated_sourceshared_anchorshared_anchor_sourcejoin_shapejoin_shape_sourcepatterns
Each pattern detail can expose:
estimated_rowsactual_rowsestimate_error_ratioestimated_selectivityactual_selectivityactual_time_msseedseed_modeseed_mode_sourceseed_binding_stateseed_binding_state_sourcecorrelated_varscorrelated_vars_sourceseed_constraintsseed_constraints_sourcepattern_runtime_strategypattern_runtime_strategy_sourcepattern_runtime_reasonpattern_runtime_reason_sourcepivot_driverpivot_driver_sourcepivot_reasonpivot_reason_sourcepivot_decisionpivot_decision_sourcepivot_marginpivot_competitionpivot_scoresfirst_relfirst_rel_sourcefirst_rel_modefirst_rel_mode_sourcefirst_rel_constraintsfirst_rel_constraints_sourcebound_varsbound_vars_sourceshapeshape_sourceflagsflags_sourcewarning_severity
Provenance and trust
The graph payload distinguishes between values that were observed at runtime and values that were inferred from static plan shape.
Typical provenance fields include:
query_runtime_sourceselected_non_leftmost_sourcepivot_driver_metrics_sourcedrift_metrics_sourcejoin_risk_metrics_sourcegraph_detail.clauses[*].join_risk.join_risk_sourcegraph_detail.clauses[*].join_risk.correlated_sourcegraph_detail.clauses[*].join_risk.shared_anchor_sourcegraph_detail.clauses[*].join_risk.join_shape_sourceruntime_strategy_sourcepattern_runtime_strategy_sourcepattern_runtime_reason_sourceseed_mode_sourceseed_binding_state_sourcecorrelated_vars_sourceseed_constraints_sourcepivot_driver_sourcepivot_reason_sourcepivot_decision_sourcefirst_rel_sourcefirst_rel_mode_sourcefirst_rel_constraints_sourcebound_vars_sourceflags_sourceshape_source
Current values are:
observedinferredmixedunavailable
Under plain EXPLAIN (FORMAT JSON), most runtime-facing fields are inferred or unavailable.
Under EXPLAIN (ANALYZE, FORMAT JSON), the payload can carry observed or mixed values when the engine has real runtime evidence.
Text EXPLAIN lines
The plain text graph lines also carry provenance on the main summaries and warnings.
Examples:
Graph Summary Severity: ... source=inferred|observed|mixedGraph Planner Warning: ... source=inferred|observedGraph Pivot Hint: ... source=inferred|observedGraph Join Hint: ... source=inferredGraph Access Summary: ... source=inferredGraph Drift Summary: ... source=observedGraph Join Fanout Summary: ... source=observed
This is mainly for operator readability. Product logic should still prefer the JSON payload.
execution_summary
execution_summary is present in both modes, but runtime values are only populated under ANALYZE.
Fields:
kindrows_returnedmemory_used_bytes
Under plain EXPLAIN (FORMAT JSON), these runtime fields can be null.
Example
Abbreviated payload:
{
"schema_version": 1,
"format_kind": "aiondb.explain_json",
"plan_overview": {
"root_kind": "Cypher Query",
"primary_operator_kind": "Nested Loop",
"plan_category": "join",
"plan_subcategory": "nested_loop"
},
"graph_summary": {
"severity": "watch",
"fragile_pivots": 1,
"risky_join_clauses": 0,
"max_fanout": null
},
"graph_detail": {
"summary": {
"severity": "watch"
},
"clauses": [
{
"kind": "PipelineMatch",
"pattern_details": [
{
"seed_mode": "label_scan",
"pivot_decision": "retained_leftmost"
}
]
}
]
},
"execution_summary": {
"kind": "Query",
"rows_returned": 1,
"memory_used_bytes": 5283
}
}
The contract is intended for local tooling, UI work, and future planner feedback loops. Keep clients tolerant to additive fields and reject only on incompatible schema_version or format_kind.
Consuming the payload
From a SQL client such as psql, FORMAT JSON returns a single text cell that contains the JSON document:
EXPLAIN (FORMAT JSON)
MATCH (a)-[:KNOWS]->(b)
RETURN b.id;
That is the right path for ad hoc inspection, shell tooling, and compatibility with existing SQL clients.
Inside the engine, prefer the structured helpers instead of reparsing text output:
QueryEngine::execute_explain_graph_summary_json(session, sql, analyze)QueryEngine::execute_explain_graph_detail_json(session, sql, analyze)
Those helpers:
- prepend
EXPLAINorEXPLAIN ANALYZE; - execute the statement;
- extract the structured graph payload;
- return
serde_json::Value.
Minimal Rust sketch:
use aiondb_engine::engine::api::QueryEngine;
fn load_graph_summary(
engine: &dyn QueryEngine,
session: &aiondb_engine::session::SessionHandle,
) -> aiondb_core::error::DbResult<serde_json::Value> {
engine.execute_explain_graph_summary_json(
session,
"MATCH (a)-[:KNOWS]->(b) RETURN b.id",
true,
)
}
For UI or telemetry work:
- use
graph_summaryfor badges, coarse severity, and top-level warnings; - use
graph_detailfor clause and pattern drill-down; - use
plan_overviewfor quick SQL plan labeling; - keep
query_plan_linesonly for raw rendering or debugging.
Production-style guidance
For v0.1, do not expose observability directly to the public internet. Put it behind local networking, firewall rules, or a trusted collection agent. Treat metrics as operational data: they may reveal database names, runtime shape, workload volume, or error patterns.
What is not covered yet
The v0.1 observability story is intentionally small. A mature deployment story would also need structured tracing, stable metric names, documented alert thresholds, log redaction policy, dashboard examples, and integration tests for degraded observability startup.