database distribution

a collection of notes for a potential implementation



database capability for automatically utilising the combined storage space and processing power of multiple hosts

reliability through redundancy, parallel query processing through synchronisation, storage space expansion through splitting of the dataset, low effort maintenance through automated networking

setup: install software package, add address of at least one other host, start network daemon application


currently datasets are seemingly confined to a single host because no mechanism has been devised to allow combining the space and processing capability of multiple hosts

new features

commit log

sharding for redundancy, replication and generally distribution

unlimited number of reader and writer hosts

automated networking managed by hosts themselves with only a manually configured seed host required

commit log

write to log in same transaction as other data write

store commit-log last reset mtime and commit id

timestamp commit-id -> change-type element-id data
mtime -> last-commit-id

change-type: 00/create 01/delete 10/update

host identifiers

identifier range per host and hosts can assign sub-ranges to descendant hosts

hosts that are offline for too long get reset. lose their identifier range

reappearing hosts can advertise changed ip. needs consensus


dataset split into partitions that can be distributed among hosts

the size of the partitions should be small enough to accommodate hosts with varying storage space

each hosts has a list of partitions it will duplicate

to delete an item, all hosts that are responsible for the partitions are queried



any host can receive query

chooses data host

proxies messages

direct connections would be better but wont predict if accessible by querying host at this point


serialise to rpc format

remote transactions must time out

might have to query multiple hosts in the background

a write transaction split over multiple hosts. can only synchronise commits. queue for failed sub-transactions? gets big

remote states: transaction, cursors

needed tools

socket connections with automatic reconnect

network messaging format


network information data structure (ni)


write, read, inactive, proposal




vote ...


source: host-id

messages ...

decay: integer

host ...










new host starts

request ni using seed addresses

analyse ni to decide purpose

ask network for acceptance

receive partition identifier range

commit log out of sync

advertise ni update with status inactive

request full-resync


could be used to spread out distribution

angle from a near host calculated using response times between hosts

rt: response time

a, b, c: hosts

ac = rt(a, c)

ab = rt(a, b)

bc = rt(b, c)

direction(a, c) = inverse-cos((ac ** 2 + ab ** 2 - bc ** 2) / (2 * ac * ab))


many nodes initialised at the same time

if they receive similar ni, they decide for similar purposes

ni synchronisation

ni serialisation, is cached


regularly. rarely

from random hosts


on update. no update, no push

to random hosts

ni updates

if timeout, inrease unreliability by two

if unreliability at certain limit, set status unavailable

decrease unreliability by one on successful communication

network status assessment

retry count and timeout

unreliability-score since-time

current issues

global identification of elements

currently the biggest obstacle: how to allow for relations between elements that might be stored on different hosts, without using large (128b+) identifiers

relations between all nodes/records of any type

record type ids might need to be globally unique

one btree per record type or all types in one btree


multiple separate lmdb processes/databases

multiple btrees

map some long identifiers to a set of short identifiers

an application could use two separate databases for different separate parts of the data. not all datasets are unsplittable

with distributed databases operations are not only in-process anymore and a client/server architecture is needed

linked list of hosts

for writes, if one is full, proceed to next in list

for reads, if does not have data, proceed to next in list

cluster abstraction

leaders and followers, leaders can themselves be followers to at most one leader


btrees do not usually deduplicate common key prefixes

separate btrees are the only same-process way to prefix all data without losing space per element

multiple processes/databases would be another way. parallel write. i dont suppose there is enough separate partition activity

other databases have separate id series / sequences per table

dg can be extended by including main.c in the extension program


give hosts only single partition ids. only give ranges if they become writers

try to keep similar things together (record types, created on the same partition)

if you can distribute, you must distribute. when remote machines store key ranges for example, you cant store locally or the data would not be found

ideally only distribute if you cant keep things together when there is not enough storage space

a host does not need to know all about the whole network but only neighbors to span a network, that is more than knowing only about itself and it meshes

tentatively rejected

global id registry / global sequence

set of computers / cluster gives out id ranges

sync new id ranges before giving them out

range is lost when hosts go down

multiple clusters have separate ranges

if cluster is down, the whole range is unregisterable

if cluster is lost, the whole range is lost as the exact counter state is only known in the cluster


too big. duplicate id generation should not be catastrophical

tags: start other computer sph-dg database design software network distribution conception scaling sph-db