aiondb-cluster

Multi-catalog and multi-storage cluster contracts (ADR-0014). Defines the cluster-level identity types, descriptors, and traits the rest of the engine uses to scope catalog and storage operations to a database, plus the interface-only distributed control-plane contracts that frontends, planners, shard engines, and the fragment transport share. The distributed traits in distributed are interface-only; the only concrete control-plane impl shipped here is InMemoryControlPlane.

cargo

[dependencies]
aiondb-cluster = { path = "../aiondb-cluster" }

modules

module	purpose
`id`	`DatabaseId`, `TablespaceId`.
`descriptor`	`DatabaseDescriptor`, `CreateDatabaseRequest`.
`role`	`ClusterRoleDescriptor`.
`scope`	`DatabaseCatalog`, `DatabaseStorage`, `DatabaseHandle`, `ClusterCatalog`, `InMemoryClusterCatalog`.
`distributed`	Cluster control-plane and txn-coordination traits with in-memory implementations.
`replication`	Pure replica-placement planners plus maintenance helpers for repair and leadership balancing.

key types

identifiers and descriptors

type	role
`DatabaseId`	`u32` newtype, with `CLUSTER = 0` and `DEFAULT = 1`.
`TablespaceId`	`u32` newtype, with `PG_DEFAULT = 1663` and `PG_GLOBAL = 1664`.
`DatabaseDescriptor`	persisted metadata for a database, modeled on `pg_database`.
`CreateDatabaseRequest`	input to `ClusterCatalog::create_database`.
`ClusterRoleDescriptor`	cluster-wide role (mirrors `pg_authid` / `pg_roles`).
`DatabaseHandle`	`(DatabaseDescriptor, Arc<dyn DatabaseCatalog>, Arc<dyn DatabaseStorage>)` tuple held by the engine per active database.

scope traits

trait	role
`DatabaseCatalog`	marker for a catalog scoped to one database.
`DatabaseStorage`	marker for storage scoped to one database.
`ClusterCatalog`	source of truth for which databases exist, plus `ALTER DATABASE` operations.
`InMemoryClusterCatalog`	default in-memory `ClusterCatalog` implementation.

distributed control-plane

type	role
`NodeId`	string-backed cluster node identity.
`ShardId`	`u32` shard identifier.
`QueryId`	`u128` query identifier.
`FragmentId`	`u64` fragment identifier.
`CatalogVersion`, `PlacementEpoch`, `SnapshotTimestamp`	monotonic counters.
`ReplicaRole`	`Leader`, `Follower`, or `Learner`; leaders and followers are voting replicas.
`NodeDescriptor`, `ControlPlaneNodeSnapshot`, `ControlPlaneSnapshot`	membership snapshots.
`ShardDescriptor`, `ShardPlacement`	shard placement metadata.
`NodeAttributeConstraint`, `ReplicaPlacementPolicy`	placement filters, lease preferences, and failure-domain spread controls.
`EpochLease`	leadership lease at a `PlacementEpoch`.
`TxnScope`, `TxnParticipant`, `TxnDecision`, `TxnRecord`, `TxnRecordStatus`	distributed transaction record.
`FragmentRuntimeOptions`	per-fragment execution caps.
`MetadataReader`, `MetadataWriter`	catalog metadata access.
`NodeMembership`, `ShardResolver`	membership and placement lookup.
`DataPlaneLocalExecutor`, `RemoteExecutor`	local and remote fragment execution.
`TxnCoordinator`, `ReplicaController`	distributed txn and replica control.
`ControlPlane`	super-trait composing the above.
`InMemoryControlPlane`, `InMemoryTxnCoordinator`	in-memory test/default impls.
`validate_txn_scope_fragment_metadata`	shared validation helper.

replication maintenance

The control plane now enforces Cockroach-style majority safety for read/write shard resolution, lease lookup, and explicit leadership transfer: the current leader must be live and a majority of voting placements must be live. Leadership transfer only targets live voting replicas and preserves learner roles.

The replication module exposes ReplicationMaintenanceOptions, ReplicaRepairOptions, LeadershipBalanceOptions, and ReplicationStatusSnapshot. maintain_replication first repairs safe replica topology drift, then balances hot leaders across live voters. Engine-level maintain_distributed_replication_from_config derives the repair factor and learner throttles from distributed.sharding, and mark_distributed_node_live_and_maintain runs that maintenance automatically when distributed.sharding.enabled and distributed.sharding.auto_rebalance are both enabled. The repair planner is load-aware: new replicas are assigned to the least-loaded live candidates so replacement work does not pile onto the first available node.

Leadership balancing is rate-limited from runtime config. AIONDB_SHARDING_LEADERSHIP_MAX_TRANSFERS_PER_MAINTENANCE caps transfers planned per maintenance pass, while AIONDB_SHARDING_LEADERSHIP_MIN_LOAD_DELTA controls how much hotter a live leader host must be before moving leases for balance. Down-leader failover can still transfer leadership when a live voting quorum remains.

ReplicaRepairMode::LearnerFirst stages replacements as learners. maintain_replication_with_caught_up_learners applies the Cockroach-style second phase by promoting only learners explicitly reported as caught up, then continuing normal repair and leader balance. caught_up_learner_keys_for_live_nodes bridges coarser external catch-up signals into shard-specific ReplicaCatchupKeys by keeping only live registered learner placements on the reported nodes. The engine wraps this as maintain_distributed_replication_from_config_with_caught_up_nodes, so a replica runtime or HA supervisor can report "node X is caught up" without knowing table and shard ids.

ReplicaPlacementPolicy adds Cockroach-style placement controls to the pure planner. required_attributes filter voter candidates, lease_preferences select preferred initial leaders/leaseholders, and spread_attributes avoid co-locating voting replicas on the same failure-domain values when alternatives exist. The same policy is used by plan_initial_shard_replica_placements_with_policy, plan_replica_repairs_with_policy, plan_leadership_balance_preferences_with_policy, and the engine's configured replication maintenance path.

Primary-side WAL progress can also feed that bridge through maintain_distributed_replication_from_config_with_primary_progress. The pgwire replication handler records the startup application_name on each connected replica state; operators should set it to the same value as the distributed NodeId so caught-up WAL receivers can promote their matching staged learner placements. When HA and distributed auto-rebalance are both enabled, the server's HA tick invokes that primary-progress bridge on the current primary.

The server /metrics endpoint exports replication health gauges for live quorum, live voters, down voters, total learners, shards with learners, under-replicated shards, and per-node leader/voter/learner load. Replica processes additionally export runtime counters for streaming sessions, reconnects, received WAL bytes, and standby status updates.

example

use aiondb_cluster::{
    CreateDatabaseRequest, DatabaseDescriptor, DatabaseId, InMemoryClusterCatalog,
};

let cluster = InMemoryClusterCatalog::new();
cluster.bootstrap_default("postgres").expect("bootstrap default db");
let _all: Vec<DatabaseDescriptor> = cluster
    .list_databases()
    .expect("in-memory catalog never fails on read");

assert_eq!(DatabaseId::DEFAULT.get(), 1);
assert!(DatabaseId::CLUSTER.is_cluster());

let _req = CreateDatabaseRequest {
    name: "analytics".to_owned(),
    owner: "postgres".to_owned(),
    template: None,
    encoding: Some("UTF8".to_owned()),
    collate: None,
    ctype: None,
    tablespace_id: None,
    connection_limit: None,
    is_template: false,
    allow_connections: true,
};