# distributed database design
## introduction
this document outlines the design principles and implementation strategies for a distributed database system capable of automatically utilizing the combined storage space and processing power of multiple hosts. the primary goals of this system are:
* reliability through redundancy: achieving fault tolerance by replicating data across multiple hosts.
* parallel query processing: enhancing performance through synchronized parallel execution of queries.
* scalability: expanding storage capacity seamlessly by distributing datasets across hosts.
* automated networking: reducing maintenance efforts through self-managed network configurations.

the intended setup involves installing the software package on each host, specifying the address of at least one seed host, and starting the network daemon application.

## concepts
### objectives
* automatic resource utilization: efficiently leverage the collective resources of all participating hosts.
* unlimited scalability: support an unrestricted number of reader and writer hosts.
* minimal maintenance: facilitate low-effort administration through automation.
* high availability: ensure continuous data accessibility even in the event of host failures.

### core features
* commit log: a robust mechanism for tracking changes and facilitating synchronization.
* sharding and replication: distributing and duplicating data for load balancing and redundancy.
* automated networking: self-configuring network that requires minimal manual intervention.

## commit log
the commit log serves as a critical component for maintaining data consistency across the distributed system.

* atomic operations: data writes and commit log entries occur within the same transaction to ensure atomicity.
* synchronization points: each host stores metadata such as the last reset modification time (`mtime`) and the last commit id.
* structure:
~~~
timestamp | commit_id | change_type | element_id | data
mtime -> last_commit_id
~~~
* change types:
  * `00`: create
  * `01`: delete
  * `10`: update

this structure enables efficient change tracking and synchronization between hosts.

## host identifiers
managing host identifiers is crucial for coordinating operations and maintaining system integrity.

* identifier ranges: each host is allocated a unique identifier range, which can be subdivided for descendant hosts.
* host lifecycle management:
  * offline hosts: hosts that remain offline beyond a certain threshold lose their identifier range.
  * rejoining hosts: reappearing hosts must advertise any changes, such as new ip addresses, and gain network consensus for reacceptance.

## data distribution
effective data distribution is essential for performance optimization and resource utilization.

* partitioning:
  * dataset splitting: the dataset is divided into partitions to be allocated across hosts.
  * partition sizing: partitions are sized appropriately to accommodate hosts with varying storage capacities.
* replication:
  * redundancy: hosts maintain duplicates of certain partitions for data redundancy.
  * responsibility lists: each host keeps track of the partitions it is responsible for duplicating.
* deletion protocol:
  * consensus deletion: deleting an item requires agreement from all hosts responsible for the relevant partitions to ensure data integrity.

## query processing
### routing mechanisms
* universal entry points: any host can accept queries, serving as a gateway to the distributed system.
* host selection: the receiving host determines the optimal data host(s) based on the query.
* proxying and forwarding: queries may be proxied or forwarded to the appropriate host if direct access is unavailable.
* direct connections: while preferable for efficiency, direct connections depend on network accssibility and may not always be feasible.

### execution strategies
* serialization: queries and transactions are serialized into a standardized rpc format for network transmission.
* timeout management: remote transactions are equipped with timeouts to prevent indefinite execution hangs.
* background processing:
  * concurrent queries: the system may initiate background queries to multiple hosts to gather necessary data.
  * transaction synchronization: write transactions spanning multiple hosts require coordinated commits to maintain atomicity.
* state management:
  * remote states: handling of remote transaction states and cursors is necessary to track query progress and results.

## required tools and technologies
* socket management:
  * persistent connections: implementing socket connections with automatic reconnection capabilities to handle network fluctuations.
* network messaging format:
  * standardization: defining a consistent messaging protocol for inter-host communication, possibly using formats like protocol buffers or grpc.

## network architecture
### network information data structure (ni)
the ni is a shared data structure that maintains network topology and host status information.

#### host status
* states:
  * `write`: host is available for write operations.
  * `read`: host is available for read operations.
  * `inactive`: host is temporarily unavailable.
  * `proposal`: host is proposing changes or updates to the network configuration.

#### data attributes
* size and checksum: metadata for data integrity and synchronization.
* voting mechanism:
  * host id: unique identifier for each host in the voting process.
  * source: the originator of a network update or proposal.
* messaging attributes:
  * decay: a counter that decreases over time to manage the lifespan of messages in the network.

#### host metrics
* distance and direction: metrics derived from network latency measurements to optimize routing and data placement.
* status and timestamps: operational state and last update time for each host.
* resource information:
  * partition ids: identifiers of data partitions managed by the host.
  * data ranges: ranges of data identifiers allocated to the host.
  * free space and cpu rating: metrics for load balancing and task allocation.

### host initialization process
* ni retrieval: new hosts request the ni from seed addresses upon startup.
* role determination:
  * analysis: hosts analyze the ni to determine their role (eg, data storage, replication, query processing).
  * acceptance: hosts request network acceptance, which may involve a consensus or voting mechanism.
* identifier allocation: hosts receive partition identifier ranges and update the ni accordingly.

### synchronization protocols
* commit log alignment:
  * status advertising: hosts with out-of-sync commit logs advertise their `inactive` status.
  * full resynchronization: hosts request a complete data resync to realign with the network state.

### network topology optimization
* directional metrics:
  * latency-based calculations: using response times between hosts to infer network topology.
  * formula:
~~~
direction(a, c) = arccos((ac^2 + ab^2 * bc^2) / (2 * ac * ab))
~~~
* application: optimizing data distribution and query routing based on calculated distances and directions.

## reliability and assessment
### unreliability scoring
* metrics:
  * timeout tracking: incrementing unreliability scores upon communication timeouts.
  * thresholds: setting limits that, when exceeded, mark a host as `unavailable`.
* dynamic adjustments:
  * score reduction: decreasing unreliability scores after successful communications.
  * status updates: regularly updating host statuses based on current unreiability scores.

### network status evaluation
* retry logic:
  * counts and timeouts: implementing retry mechanisms with defined counts and timeout intervals.
* historical data:
  * since-time tracking: monitoring unreliability scores over time to detect patterns.

## challenges and current issues
### global element identification
* key issue: establishing a method for globally unique element identifiers without incurring significant storage overhead.
* considerations:
  * inter-host relations: elements may need to reference others stored on different hosts.
  * identifier size: balancing uniqueness with efficiency; large identifiers like uuids may be impractical.
  * record type uniqueness: determining whether record type ids need to be globally unique or can be localized.

### potential solutions
* identifier mapping:
  * short ids: mapping long, globally unique identifiers to shorter, locally unique ones within each host.
* partitioned sequences:
  * host-based ranges: allocating specific identifier ranges to hosts to prevent collisions.
* type-specific b-trees:
  * separate structures: using individual b-trees for each record type to improve organization and search efficiency.

## architectural patterns
### multi-process databases
* parallel processing:
  * independent databases: running separate database instances or processes for different data segments.
  * performance gains: leveraging multi-core systems for concurrent processing.
* data isolation:
  * separate lmdb instances: using multiple lightning memory-mapped database (lmdb) instances to isolate workloads.

### cluster abstraction
* hierarchical structure:
  * leaders and followers: implementing a hierarchy where leaders manage clusters of followers.
  * role flexibility: leaders can also act as followers in higher-level clusters.
* operational dynamics:
  * task delegation: leaders assign tasks and manage data distribution among followers.
  * fault tolerance: hierarchical clustering enhances resilience against host failures.

### linked list of hosts
* sequential resource utilization:
  * write operations: if one host reaches capacity, operations proceed to the next host in the sequence.
  * read operations: hosts forward requests to the next in line if the data is unavailable locally.
* simplified management:
  * ease of implementation: linear structure simplifies routing logic and host management.

## notes and considerations
* b-tree limitations:
  * prefix deduplication: standard b-trees may not efficiently handle common key prefixes, leading to space inefficiencies.
* data prefixing strategies:
  * separate b-trees: utilizing multiple b-trees to prefix data without significant space overhead.
  * process isolation: employing multiple processes or databases to manage distinct data partitions.
* application-level partitioning:
  * custom solutions: applications may implement their own partitioning schemes based on specific data requirements.
* client-server architecture:
  * necessity in distribution: a distributed database inherently requires a client-server model due to network communication.
* data locality:
  * optimal distribution: while keeping related data together is beneficial, distribution is sometimes necessary for scalability.

## unsorted thoughts
* partition assignments:
  * single partition ids: assigning individual partition ids to hosts, expanding to ranges as they become primary writers.
* data clustering:
  * logical grouping: keeping similar records together, such as those of the same type or creation time.
* mandatory distribution:
  * scalability requirement: distribution is essential when local storage is insufficient or to enhance performance.
* network mesh formation:
  * neighbor awareness: hosts need to be aware of neighboring hosts to form a resilient network mesh.

## tentatively rejected ideas
### global id registry / global sequence
* concept:
  * centralized allocation: a set of hosts or a cluster responsible for issuing global identifier ranges.
* drawbacks:
  * single point of failure: the unavailability of the registry cluster halts identifier allocation.
  * state loss: if the cluster is lost, the entire identifier range and allocation state may be unrecoverable.
  * scalability issues: centralization conflicts with the distributed nature of the system.

### use of uuids
* reasons for rejection:
  * size overhead: uuids are 128 bits, significantly increasing storage and transmission costs.
  * collision handling: while uuid collisions are rare, the system must account for potential duplicates.
  * performance impact: larger identifiers can degrade indexing and query performance.