# distributed database design ## introduction this document outlines the design principles and implementation strategies for a distributed database system capable of automatically utilizing the combined storage space and processing power of multiple hosts. the primary goals of this system are: * reliability through redundancy: achieving fault tolerance by replicating data across multiple hosts. * parallel query processing: enhancing performance through synchronized parallel execution of queries. * scalability: expanding storage capacity seamlessly by distributing datasets across hosts. * automated networking: reducing maintenance efforts through self-managed network configurations. the intended setup involves installing the software package on each host, specifying the address of at least one seed host, and starting the network daemon application. ## concepts ### objectives * automatic resource utilization: efficiently leverage the collective resources of all participating hosts. * unlimited scalability: support an unrestricted number of reader and writer hosts. * minimal maintenance: facilitate low-effort administration through automation. * high availability: ensure continuous data accessibility even in the event of host failures. ### core features * commit log: a robust mechanism for tracking changes and facilitating synchronization. * sharding and replication: distributing and duplicating data for load balancing and redundancy. * automated networking: self-configuring network that requires minimal manual intervention. ## commit log the commit log serves as a critical component for maintaining data consistency across the distributed system. * atomic operations: data writes and commit log entries occur within the same transaction to ensure atomicity. * synchronization points: each host stores metadata such as the last reset modification time (`mtime`) and the last commit id. * structure: ~~~ timestamp | commit_id | change_type | element_id | data mtime -> last_commit_id ~~~ * change types: * `00`: create * `01`: delete * `10`: update this structure enables efficient change tracking and synchronization between hosts. ## host identifiers managing host identifiers is crucial for coordinating operations and maintaining system integrity. * identifier ranges: each host is allocated a unique identifier range, which can be subdivided for descendant hosts. * host lifecycle management: * offline hosts: hosts that remain offline beyond a certain threshold lose their identifier range. * rejoining hosts: reappearing hosts must advertise any changes, such as new ip addresses, and gain network consensus for reacceptance. ## data distribution effective data distribution is essential for performance optimization and resource utilization. * partitioning: * dataset splitting: the dataset is divided into partitions to be allocated across hosts. * partition sizing: partitions are sized appropriately to accommodate hosts with varying storage capacities. * replication: * redundancy: hosts maintain duplicates of certain partitions for data redundancy. * responsibility lists: each host keeps track of the partitions it is responsible for duplicating. * deletion protocol: * consensus deletion: deleting an item requires agreement from all hosts responsible for the relevant partitions to ensure data integrity. ## query processing ### routing mechanisms * universal entry points: any host can accept queries, serving as a gateway to the distributed system. * host selection: the receiving host determines the optimal data host(s) based on the query. * proxying and forwarding: queries may be proxied or forwarded to the appropriate host if direct access is unavailable. * direct connections: while preferable for efficiency, direct connections depend on network accssibility and may not always be feasible. ### execution strategies * serialization: queries and transactions are serialized into a standardized rpc format for network transmission. * timeout management: remote transactions are equipped with timeouts to prevent indefinite execution hangs. * background processing: * concurrent queries: the system may initiate background queries to multiple hosts to gather necessary data. * transaction synchronization: write transactions spanning multiple hosts require coordinated commits to maintain atomicity. * state management: * remote states: handling of remote transaction states and cursors is necessary to track query progress and results. ## required tools and technologies * socket management: * persistent connections: implementing socket connections with automatic reconnection capabilities to handle network fluctuations. * network messaging format: * standardization: defining a consistent messaging protocol for inter-host communication, possibly using formats like protocol buffers or grpc. ## network architecture ### network information data structure (ni) the ni is a shared data structure that maintains network topology and host status information. #### host status * states: * `write`: host is available for write operations. * `read`: host is available for read operations. * `inactive`: host is temporarily unavailable. * `proposal`: host is proposing changes or updates to the network configuration. #### data attributes * size and checksum: metadata for data integrity and synchronization. * voting mechanism: * host id: unique identifier for each host in the voting process. * source: the originator of a network update or proposal. * messaging attributes: * decay: a counter that decreases over time to manage the lifespan of messages in the network. #### host metrics * distance and direction: metrics derived from network latency measurements to optimize routing and data placement. * status and timestamps: operational state and last update time for each host. * resource information: * partition ids: identifiers of data partitions managed by the host. * data ranges: ranges of data identifiers allocated to the host. * free space and cpu rating: metrics for load balancing and task allocation. ### host initialization process * ni retrieval: new hosts request the ni from seed addresses upon startup. * role determination: * analysis: hosts analyze the ni to determine their role (eg, data storage, replication, query processing). * acceptance: hosts request network acceptance, which may involve a consensus or voting mechanism. * identifier allocation: hosts receive partition identifier ranges and update the ni accordingly. ### synchronization protocols * commit log alignment: * status advertising: hosts with out-of-sync commit logs advertise their `inactive` status. * full resynchronization: hosts request a complete data resync to realign with the network state. ### network topology optimization * directional metrics: * latency-based calculations: using response times between hosts to infer network topology. * formula: ~~~ direction(a, c) = arccos((ac^2 + ab^2 * bc^2) / (2 * ac * ab)) ~~~ * application: optimizing data distribution and query routing based on calculated distances and directions. ## reliability and assessment ### unreliability scoring * metrics: * timeout tracking: incrementing unreliability scores upon communication timeouts. * thresholds: setting limits that, when exceeded, mark a host as `unavailable`. * dynamic adjustments: * score reduction: decreasing unreliability scores after successful communications. * status updates: regularly updating host statuses based on current unreiability scores. ### network status evaluation * retry logic: * counts and timeouts: implementing retry mechanisms with defined counts and timeout intervals. * historical data: * since-time tracking: monitoring unreliability scores over time to detect patterns. ## challenges and current issues ### global element identification * key issue: establishing a method for globally unique element identifiers without incurring significant storage overhead. * considerations: * inter-host relations: elements may need to reference others stored on different hosts. * identifier size: balancing uniqueness with efficiency; large identifiers like uuids may be impractical. * record type uniqueness: determining whether record type ids need to be globally unique or can be localized. ### potential solutions * identifier mapping: * short ids: mapping long, globally unique identifiers to shorter, locally unique ones within each host. * partitioned sequences: * host-based ranges: allocating specific identifier ranges to hosts to prevent collisions. * type-specific b-trees: * separate structures: using individual b-trees for each record type to improve organization and search efficiency. ## architectural patterns ### multi-process databases * parallel processing: * independent databases: running separate database instances or processes for different data segments. * performance gains: leveraging multi-core systems for concurrent processing. * data isolation: * separate lmdb instances: using multiple lightning memory-mapped database (lmdb) instances to isolate workloads. ### cluster abstraction * hierarchical structure: * leaders and followers: implementing a hierarchy where leaders manage clusters of followers. * role flexibility: leaders can also act as followers in higher-level clusters. * operational dynamics: * task delegation: leaders assign tasks and manage data distribution among followers. * fault tolerance: hierarchical clustering enhances resilience against host failures. ### linked list of hosts * sequential resource utilization: * write operations: if one host reaches capacity, operations proceed to the next host in the sequence. * read operations: hosts forward requests to the next in line if the data is unavailable locally. * simplified management: * ease of implementation: linear structure simplifies routing logic and host management. ## notes and considerations * b-tree limitations: * prefix deduplication: standard b-trees may not efficiently handle common key prefixes, leading to space inefficiencies. * data prefixing strategies: * separate b-trees: utilizing multiple b-trees to prefix data without significant space overhead. * process isolation: employing multiple processes or databases to manage distinct data partitions. * application-level partitioning: * custom solutions: applications may implement their own partitioning schemes based on specific data requirements. * client-server architecture: * necessity in distribution: a distributed database inherently requires a client-server model due to network communication. * data locality: * optimal distribution: while keeping related data together is beneficial, distribution is sometimes necessary for scalability. ## unsorted thoughts * partition assignments: * single partition ids: assigning individual partition ids to hosts, expanding to ranges as they become primary writers. * data clustering: * logical grouping: keeping similar records together, such as those of the same type or creation time. * mandatory distribution: * scalability requirement: distribution is essential when local storage is insufficient or to enhance performance. * network mesh formation: * neighbor awareness: hosts need to be aware of neighboring hosts to form a resilient network mesh. ## tentatively rejected ideas ### global id registry / global sequence * concept: * centralized allocation: a set of hosts or a cluster responsible for issuing global identifier ranges. * drawbacks: * single point of failure: the unavailability of the registry cluster halts identifier allocation. * state loss: if the cluster is lost, the entire identifier range and allocation state may be unrecoverable. * scalability issues: centralization conflicts with the distributed nature of the system. ### use of uuids * reasons for rejection: * size overhead: uuids are 128 bits, significantly increasing storage and transmission costs. * collision handling: while uuid collisions are rare, the system must account for potential duplicates. * performance impact: larger identifiers can degrade indexing and query performance.