2024-10-13

distributed database design

introduction

this document outlines the design principles and implementation strategies for a distributed database system capable of automatically utilizing the combined storage space and processing power of multiple hosts. the primary goals of this system are:

  • reliability through redundancy: achieving fault tolerance by replicating data across multiple hosts.
  • parallel query processing: enhancing performance through synchronized parallel execution of queries.
  • scalability: expanding storage capacity seamlessly by distributing datasets across hosts.
  • automated networking: reducing maintenance efforts through self-managed network configurations.

the intended setup involves installing the software package on each host, specifying the address of at least one seed host, and starting the network daemon application.

concepts

objectives

  • automatic resource utilization: efficiently leverage the collective resources of all participating hosts.
  • unlimited scalability: support an unrestricted number of reader and writer hosts.
  • minimal maintenance: facilitate low-effort administration through automation.
  • high availability: ensure continuous data accessibility even in the event of host failures.

core features

  • commit log: a robust mechanism for tracking changes and facilitating synchronization.
  • sharding and replication: distributing and duplicating data for load balancing and redundancy.
  • automated networking: self-configuring network that requires minimal manual intervention.

commit log

the commit log serves as a critical component for maintaining data consistency across the distributed system.

  • atomic operations: data writes and commit log entries occur within the same transaction to ensure atomicity.
  • synchronization points: each host stores metadata such as the last reset modification time (mtime) and the last commit id.
  • structure:
timestamp | commit_id | change_type | element_id | data
mtime -> last_commit_id
  • change types:

    • 00: create

    • 01: delete

    • 10: update

this structure enables efficient change tracking and synchronization between hosts.

host identifiers

managing host identifiers is crucial for coordinating operations and maintaining system integrity.

  • identifier ranges: each host is allocated a unique identifier range, which can be subdivided for descendant hosts.
  • host lifecycle management:

    • offline hosts: hosts that remain offline beyond a certain threshold lose their identifier range.

    • rejoining hosts: reappearing hosts must advertise any changes, such as new ip addresses, and gain network consensus for reacceptance.

data distribution

effective data distribution is essential for performance optimization and resource utilization.

  • partitioning:

    • dataset splitting: the dataset is divided into partitions to be allocated across hosts.
    • partition sizing: partitions are sized appropriately to accommodate hosts with varying storage capacities.
  • replication:

    • redundancy: hosts maintain duplicates of certain partitions for data redundancy.
    • responsibility lists: each host keeps track of the partitions it is responsible for duplicating.
  • deletion protocol:

    • consensus deletion: deleting an item requires agreement from all hosts responsible for the relevant partitions to ensure data integrity.

query processing

routing mechanisms

  • universal entry points: any host can accept queries, serving as a gateway to the distributed system.
  • host selection: the receiving host determines the optimal data host(s) based on the query.
  • proxying and forwarding: queries may be proxied or forwarded to the appropriate host if direct access is unavailable.
  • direct connections: while preferable for efficiency, direct connections depend on network accssibility and may not always be feasible.

execution strategies

  • serialization: queries and transactions are serialized into a standardized rpc format for network transmission.
  • timeout management: remote transactions are equipped with timeouts to prevent indefinite execution hangs.
  • background processing:

    • concurrent queries: the system may initiate background queries to multiple hosts to gather necessary data.
    • transaction synchronization: write transactions spanning multiple hosts require coordinated commits to maintain atomicity.
  • state management:

    • remote states: handling of remote transaction states and cursors is necessary to track query progress and results.

required tools and technologies

  • socket management:

    • persistent connections: implementing socket connections with automatic reconnection capabilities to handle network fluctuations.
  • network messaging format:

    • standardization: defining a consistent messaging protocol for inter-host communication, possibly using formats like protocol buffers or grpc.

network architecture

network information data structure (ni)

the ni is a shared data structure that maintains network topology and host status information.

host status

  • states:

    • write: host is available for write operations.

    • read: host is available for read operations.

    • inactive: host is temporarily unavailable.

    • proposal: host is proposing changes or updates to the network configuration.

data attributes

  • size and checksum: metadata for data integrity and synchronization.
  • voting mechanism:

    • host id: unique identifier for each host in the voting process.
    • source: the originator of a network update or proposal.
  • messaging attributes:

    • decay: a counter that decreases over time to manage the lifespan of messages in the network.

host metrics

  • distance and direction: metrics derived from network latency measurements to optimize routing and data placement.
  • status and timestamps: operational state and last update time for each host.
  • resource information:

    • partition ids: identifiers of data partitions managed by the host.

    • data ranges: ranges of data identifiers allocated to the host.

    • free space and cpu rating: metrics for load balancing and task allocation.

host initialization process

  • ni retrieval: new hosts request the ni from seed addresses upon startup.
  • role determination:

    • analysis: hosts analyze the ni to determine their role (eg, data storage, replication, query processing).
    • acceptance: hosts request network acceptance, which may involve a consensus or voting mechanism.
  • identifier allocation: hosts receive partition identifier ranges and update the ni accordingly.

synchronization protocols

  • commit log alignment:

    • status advertising: hosts with out-of-sync commit logs advertise their inactive status.

    • full resynchronization: hosts request a complete data resync to realign with the network state.

network topology optimization

  • directional metrics:

    • latency-based calculations: using response times between hosts to infer network topology.
    • formula:
direction(a, c) = arccos((ac^2 + ab^2 * bc^2) / (2 * ac * ab))
  • application: optimizing data distribution and query routing based on calculated distances and directions.

reliability and assessment

unreliability scoring

  • metrics:

    • timeout tracking: incrementing unreliability scores upon communication timeouts.
    • thresholds: setting limits that, when exceeded, mark a host as unavailable.
  • dynamic adjustments:

    • score reduction: decreasing unreliability scores after successful communications.

    • status updates: regularly updating host statuses based on current unreiability scores.

network status evaluation

  • retry logic:

    • counts and timeouts: implementing retry mechanisms with defined counts and timeout intervals.
  • historical data:

    • since-time tracking: monitoring unreliability scores over time to detect patterns.

challenges and current issues

global element identification

  • key issue: establishing a method for globally unique element identifiers without incurring significant storage overhead.
  • considerations:

    • inter-host relations: elements may need to reference others stored on different hosts.

    • identifier size: balancing uniqueness with efficiency; large identifiers like uuids may be impractical.

    • record type uniqueness: determining whether record type ids need to be globally unique or can be localized.

potential solutions

  • identifier mapping:

    • short ids: mapping long, globally unique identifiers to shorter, locally unique ones within each host.
  • partitioned sequences:

    • host-based ranges: allocating specific identifier ranges to hosts to prevent collisions.
  • type-specific b-trees:

    • separate structures: using individual b-trees for each record type to improve organization and search efficiency.

architectural patterns

multi-process databases

  • parallel processing:

    • independent databases: running separate database instances or processes for different data segments.
    • performance gains: leveraging multi-core systems for concurrent processing.
  • data isolation:

    • separate lmdb instances: using multiple lightning memory-mapped database (lmdb) instances to isolate workloads.

cluster abstraction

  • hierarchical structure:

    • leaders and followers: implementing a hierarchy where leaders manage clusters of followers.
    • role flexibility: leaders can also act as followers in higher-level clusters.
  • operational dynamics:

    • task delegation: leaders assign tasks and manage data distribution among followers.

    • fault tolerance: hierarchical clustering enhances resilience against host failures.

linked list of hosts

  • sequential resource utilization:

    • write operations: if one host reaches capacity, operations proceed to the next host in the sequence.
    • read operations: hosts forward requests to the next in line if the data is unavailable locally.
  • simplified management:

    • ease of implementation: linear structure simplifies routing logic and host management.

notes and considerations

  • b-tree limitations:

    • prefix deduplication: standard b-trees may not efficiently handle common key prefixes, leading to space inefficiencies.
  • data prefixing strategies:

    • separate b-trees: utilizing multiple b-trees to prefix data without significant space overhead.
    • process isolation: employing multiple processes or databases to manage distinct data partitions.
  • application-level partitioning:

    • custom solutions: applications may implement their own partitioning schemes based on specific data requirements.
  • client-server architecture:

    • necessity in distribution: a distributed database inherently requires a client-server model due to network communication.
  • data locality:

    • optimal distribution: while keeping related data together is beneficial, distribution is sometimes necessary for scalability.

unsorted thoughts

  • partition assignments:

    • single partition ids: assigning individual partition ids to hosts, expanding to ranges as they become primary writers.
  • data clustering:

    • logical grouping: keeping similar records together, such as those of the same type or creation time.
  • mandatory distribution:

    • scalability requirement: distribution is essential when local storage is insufficient or to enhance performance.
  • network mesh formation:

    • neighbor awareness: hosts need to be aware of neighboring hosts to form a resilient network mesh.

tentatively rejected ideas

global id registry / global sequence

  • concept:

    • centralized allocation: a set of hosts or a cluster responsible for issuing global identifier ranges.
  • drawbacks:

    • single point of failure: the unavailability of the registry cluster halts identifier allocation.

    • state loss: if the cluster is lost, the entire identifier range and allocation state may be unrecoverable.

    • scalability issues: centralization conflicts with the distributed nature of the system.

use of uuids

  • reasons for rejection:

    • size overhead: uuids are 128 bits, significantly increasing storage and transmission costs.
    • collision handling: while uuid collisions are rare, the system must account for potential duplicates.
    • performance impact: larger identifiers can degrade indexing and query performance.