database distribution

draft notes



  • database capability for automatically utilising the combined storage space and processing power of multiple hosts
  • reliability through redundancy, parallel query processing through synchronisation, storage space expansion through splitting of the dataset, low effort maintenance through automated networking
  • setup: install software package, add address of at least one other host, start network daemon application


  • commit log
  • sharding for redundancy, replication and generally distribution
  • unlimited number of reader and writer hosts
  • automated networking managed by hosts themselves with only a manually configured seed host required

commit log

  • write to log in same transaction as other data write
  • store commit-log last reset mtime and commit id
timestamp commit-id -> change-type element-id data
mtime -> last-commit-id

change-type: 00/create 01/delete 10/update

host identifiers

  • identifier range per host and hosts can assign sub-ranges to descendant hosts
  • hosts that are offline for too long get reset. lose their identifier range
  • reappearing hosts can advertise changed ip. needs consensus


  • dataset split into partitions that can be distributed among hosts
  • the size of the partitions should be small enough to accommodate hosts with varying storage space
  • each hosts has a list of partitions it will duplicate
  • to delete an item, all hosts that are responsible for the partitions are queried



  • any host can receive query
  • chooses data host
  • proxies messages
  • direct connections would be better but will not predict if accessible by querying host at this point


  • serialise to rpc format
  • remote transactions must time out
  • might have to query multiple hosts in the background
  • a write transaction split over multiple hosts. can only synchronise commits. queue for failed sub-transactions? gets big
  • remote states: transaction, cursors

needed tools

  • socket connections with automatic reconnect
  • network messaging format


network information data structure (ni)


write, read, inactive, proposal


  • size
  • checksum
vote ...
  • host-id
  • source: host-id
messages ...

decay: integer

host ...
  • distance
  • direction
  • status
  • host-id
  • mtime
  • partition-ids
  • data-ranges
  • free-space
  • cpu-rating

new host starts

  • request ni using seed addresses
  • analyse ni to decide purpose
  • ask network for acceptance
  • receive partition identifier range

commit log out of sync

  • advertise ni update with status inactive
  • request full-resync


  • could be used to spread out distribution
  • angle from a near host calculated using response times between hosts
  • rt: response time
  • a, b, c: hosts
  • ac = rt(a, c)
  • ab = rt(a, b)
  • bc = rt(b, c)
  • direction(a, c) = inverse-cos((ac ** 2 + ab ** 2 - bc ** 2) / (2 * ac * ab))


many nodes initialised at the same time

if they receive similar ni, they decide for similar purposes

ni synchronisation

ni serialisation, is cached


  • regularly. rarely
  • from random hosts


  • on update. no update, no push
  • to random hosts

ni updates

  • if timeout, inrease unreliability by two
  • if unreliability at certain limit, set status unavailable
  • decrease unreliability by one on successful communication

network status assessment

  • retry count and timeout
  • unreliability-score since-time

current issues

global identification of elements

  • currently the biggest obstacle: how to allow for relations between elements that might be stored on different hosts, without using large (128b+) identifiers
  • relations between all nodes/records of any type
  • record type ids might need to be globally unique

one btree per record type or all types in one btree


  • multiple separate lmdb processes/databases
  • multiple btrees
  • map some long identifiers to a set of short identifiers
  • an application could use two separate databases for different separate parts of the data. not all datasets are unsplittable
  • with distributed databases operations are not only in-process anymore and a client/server architecture is needed

linked list of hosts

  • for writes, if one is full, proceed to next in list
  • for reads, if does not have data, proceed to next in list

cluster abstraction

leaders and followers, leaders can themselves be followers to at most one leader


  • btrees do not usually deduplicate common key prefixes
  • separate btrees are the only same-process way to prefix all data without losing space per element
  • multiple processes/databases would be another way. parallel write. i do not suppose there is enough separate partition activity
  • other databases have separate id series / sequences per table
  • dg can be extended by including main.c in the extension program


  • give hosts only single partition ids. only give ranges if they become writers
  • try to keep similar things together (record types, created on the same partition)
  • if you can distribute, you must distribute. when remote machines store key ranges for example, you can not store locally or the data would not be found
  • ideally only distribute if you can not keep things together when there is not enough storage space
  • a host does not need to know all about the whole network but only neighbors to span a network, that is more than knowing only about itself and it meshes

tentatively rejected

global id registry / global sequence

  • set of computers / cluster gives out id ranges
  • sync new id ranges before giving them out
  • range is lost when hosts go down
  • multiple clusters have separate ranges
  • if cluster is down, the whole range is unregisterable
  • if cluster is lost, the whole range is lost as the exact counter state is only known in the cluster


too big. duplicate id generation should not be catastrophical