2018-05-12

sph-dg

design of a general purpose database

v2018 currently work in progress

wip code on github

motivation

i want an embeddable database to store records without having to construct high-level query language strings for each query. the next best thing would probably be sqlite, but there i would waste resources with frequently building sql strings and having them parsed

graph-like relations in relational databases are usually modelled with tables (and custom naming schemes). it seems there is a more efficient way to store and query relations and it could be simpler to use with specialised routines

basic concept

the original idea

each distinct binary pattern to be stored gets an identifier and becomes a node

labelled relations between nodes are the basis of all data models

data changes are relation changes

downside: wasteful if most data patterns are distinct (because of binary pattern deduplication with a reverse index)

the common database design

tables for typed sets, dictionaries or records. one btree per table

indices for table columns in extra btrees

need to choose appropriate keys

extra tables for relations between specific types. one table for each type combination

table cells are modifiable

the v2018 design

nodes

every node has a unique identifier

typed to be strings, numbers, arrays or other custom record types

indices can be created and used for fields and field combinations

virtual type nodes only exist in relations and carry the data in the identifier

relations

are labelled to give meaning to relations

have a direction

targets are optionally ordered by weight-like values

general features

direct programming language interface, no high-level query language

shared library written in c (embeddable database)

excluded features: re-order on read, grouping

lmdb for btrees (read optimised. is the best/fastest)

core btrees usable with all lmdb features (for example cursor prev/next/last/first/set-range), custom btrees possible

nodes

identifiers

include type information

for cheap filtering of relation targets by specific types

sequencing

new identifiers are monotonical increments

counter initialised on start from max identifier

type independent. maybe separate sequences for standard types. 64b ids are almost too small for this. bigger native c datatype needed

null

a special "null" node of type id must be available. for example the identifier zero

use cases for this are unspecified relation labels, sources or targets

node types

fields are accessed by column index and indirectly name

fixed size columns stored first to make use of predictable offsets

node data format: fixed-size-data ... variable-size-len variable-size-data ...

type info cache

contains information about types and associated indices

max type id size limited to 16 bit with initial implementation

to make type id array index ("arrays are the fastest hash tables")

format

{type-id}:
  field-count:
  field-fixed-count:
  field-fixed-offsets:
  fields:
  flags:
  id:
  indices:
  indices-count:
  name:
  sequence:

field types

have integer ids

fixed length field type ids are even numbers, variable length field type ids are odd numbers

there are many possible fixed length integer and string types with ids and data size calculated from a formula

fixed length types

integer

3b:size-exponent 1b:signed 4b:id-prefix:0000

data size in bits: 2 ** (size-exponent + 3)

string

id: 4b:size-exponent 4b:id-prefix:0010

data size in bits: 2 ** (size-exponent + 4)

float32

float64

variable length types

vbinary

vstring

btrees

system: config-label [id/custom ...] -> data

format-label -> id-size id-type-size ordinal-size

type-label id -> name-len name field-count (field-type name-len name) ...

index-label type-id field-index ... -> ()

relations

when a node is deleted then all relations that contain its id must be deleted

there must only be one relation for every combination of left, label and right. cyclic relations may be allowed. relational integrity may be ignored when creating relations to avoid existence checks

targets are stored ordered by ordinal value. ordinal values are stored only for the left to right direction to save space

by convention left to right corresponds to general to specific

btrees

left:node label:node -> ordinal:float right:node
right:node label:node -> left:node

indices

separate btrees with data as the key and node id as the value

update on each insert of the associated node type using the same transaction

might use a hook mechanism available for extensions

programming interface (roughly)

dg-node-

dg-graph-

dg-index-

dg-index-node-

dg-type-

dg-env-

dg-txn-

evaluation of other databases

sqlite

sqlite pros

multi platform, well tested, well documented, query optimiser, views, order on select, group, filtering sub-selects, triggers, sql-functions, integer primary keys are automatically auto increment,

sqlite cons

sqlite can use transactions only for insert/update/delete, not for example for schema changes. lmdb based sph-dg wouldnt have that limitation

building sql strings requires quoting, escaping and generally protection against sql-injection. this adds another step where strings are analysed and possibly rewritten

a table where all fields make up the key might not be possible as efficiently. in lmdb btrees one could use duplicate keys or store all fields in the key without values

cant predict or guarantee order of values as it is stored

all types are variable length and all data fields are therefore prefixed with type information

checking if an index exists is not supported

no duplicate keys

might require copying of selected data for reading instead of offering pointers to it directly

there is always an integer row id, even when using for example strings as the primary key

shared features

create indices selectively for columns

filter table rows whose columns match column content and logical conditions

automatically find and use index

sph-dg pros

better performance (fewer code paths, more minimalistic design, lmdb)

uses c data structures for queries instead of sql strings which need to be constructed by concatenating strings then parsed and evaluated

sph-dg gives access to data without copying. preparator (select) -> row-positioner (next) -> column-accessor (get)

direct lmdb btree access support. custom key-value btrees can be added to the environment

no relation tables with specific layout and indices needs to be managed

automatic deletion of dependent relations when deleting nodes instead of having to create triggers

low-level filter functions that can make range searches on compound key parts and dont have to analyse table schemas or find indices

graph databases

graph databases typically store relations in an adjacency list, which tends to be efficient for traversing long paths from node to node. but adjacency lists alone arent particularly well suited for frequent random access to nodes as if they are records or finding all relations a node is in. sph-dg uses b+trees instead because its focus is on mapping data to identifiers (random access) and filtering on the set of all relations (optimised for queries such as "is x related to y" and "which nodes are adjacent to x" but not optimised for long path traversals)

considerations

schema changes require cache updates

cache update should only be made after the mdb transaction committed

option - current choice

make schema changes have their own transaction each time

pro

less complex

con

lower performance for a high frequency of schema changes. but is a significant amount at all likely

option

track schema updates with transaction and apply after successful mdb commit

inserts while type deletion

scenario

delete/insert might both happen on the same or concurrent transactions

data for a deleted type could be inserted

consequences

data becomes inaccessible and wastes space

if type id is ever reused, previously existing data becomes accessible again. this can be confusing in the new data model. sensitive data can become re-accessible with inadequate treatment

option - first choice

as long as type id re-use is not implemented, ignore the issue for initial versions

option - second choice

relieve integrity issues and delete excess data on demand

feature

delete all data with type id when a new type is created

old data might still waste space but can never be confused with new data

feature

add a garbage collection helper that deletes all data for type ids that are unused

pro

low complexity, performant solution

option

prevent excess type data from ever existing

feature

register inserted types in transaction

acquire schema lock before each commit with inserts or type delete

before commit with inserts check if all inserted types still exist

con

mutex locking/unlocking on every insert

extra complexity to track cross-transaction conflicts

index access after deletion

index open -> index delete -> index read -> error

all index handles become invalid with delete

how to choose and use indices

manually

either selecting only node ids or nodes

cursor to make custom btree searches

database handles or one open database per process

handles (chosen)

can use multiple databases in one process

almost all api routines require a transaction object which encapsulates the environment

process

dont have to pass handle argument to any api routine

must use separate processes and ipc to work with multiple databases

custom extra fields for relations

changing the default btree would add considerable complexity

possible with labels that are compound data structure nodes

use as key-value database

possible with lmdb features

use as traditional record/table database

possible with typed nodes

should partial indices be supported

not yet

clustering same-type nodes

archieved by having the type at the beginning of the key and having sorted keys

read multiple selected records in one call or only one per call

read returns next one (current choice)

pro

copy only id and data pointer to result

custom row content matching can stop at find (main reason for choice)

con

more function calls with state argument

may need to repeatedly extract and cache values from state

read returns list of multiple

con

extra mallocs for the result list entries

when filtering rows one by one more results than needed might have been loaded

pro

less function calls, eventually less preparation to continue search (no re-allocation of cursor, status and mdb-val local variables)

works the same as for relations. less likely to be changed for relations at this point

timely more separated data read and use step separates repeated processes and might improve low-level caching

complex filter conditions for node selection

for example conditions in a nestable linked list with comparison operators. and/or/not/begins/ends/contains/equals/etc

or custom tests and branches

more complex searchers could be an extension

con custom

index selection cant come from conditions

elevating the process to higher-level languages exacerbates costs

pro custom

hardcoded checks. no interpretation of a condition query structure

where to encode the type of each individual node

option

store type in identifier

con

increased node identifier length. affects every node. affects every relation times four (source and target in two directions)

pro

node id is enough to know the type. selecting just relations allows to filter by node type before accessing the nodes

cheap type filtering

relations between all types possible in one btree. no type combination-specific btree selection/creation process or need to access multiple btrees for a query

option

store type with the data

con

store type in data. makes get-all-of-type-x more difficult. possibly needs additional index

duplicated with each node

have to read data to find type

pro

smaller relations

option

group nodes types by using separate btrees

con

types need to be carried with node references to find relevant grouping btrees and read record fields

one btree for each type combination necessary. has to be foaund for queries, possible multiple have to be accessed for different types

pro

space saving not every row needs to store type information

informal column types

the database only needs a fixed and a variable length binary record field datatype to store data

informal datatypes are string, integer and similar, anything beyond binary

are stored with the schema to not avoid having to store it somewhere else for reifying useful datatypes

unidirectional relations to save space

maybe later

issue: how to garbage collect unidirectional relations

option (favorite): go by label. label must be connected

option: maybe decay factor and regular refresh but it should completely be an application issue at this point

database network distribution

i have started thinking about it, but it takes more resources to implement

rudimentary implementation (read scaling): a hook for each modification that gets passed the current write transaction and an extension that manages a simple commit log and applies it on another host

see database distribution

the common database design

tables for typed sets, dictionaries or records. one btree per table

indices for table columns in extra btrees

extra tables for relations between rows of different tables/types. one table for each type combination

table cells are modifiable

custom id sizes per table and sequencing. could load current max value per table on startup

have to choose the relation btrees to use for each relation query

type has to be carried anyway with ids

one btree for all tables or one for each table

one for each table because space saving. acts like a prefix to a data collection

graph

should label/ordinal be optional

should graph relations of both directions be kept in the sys-info cache

are extra relation fields needed

example situation: "when every ingredient can be added in different amounts to different recipes"

custom primary keys

a dictionary, a one to one association, could be considered a subset of plain tables. yet rows might have row identifiers identifying a pairing even if that is not needed. this points to custom primary keys

ambiguity of auto-increment and provided keys. each insert requires additional checks if auto-increment and eventually if key has been provided. id key potentially quicker to join as primary and foreign key because efficient data type

sequences per table or one global

global

fast load on startup

per-table

can theoretically be shorter

need to query every table on startup

good for many fragmented types with local cluster relations

variable per-table

size loaded from schema variable

what happens on concurrent delete and update for a row

description

update is reinsert

last committed transaction wins

that is why updates are problematic

"update might prevent deletion"

how to create custom virtual types

set the corresponding flag with dg-type-new

how to select type ids by name

option - current choice

cache data scan

no extra work

relatively fast operation

option

index btree

option

system data scan

how to alter a type

initially not supported

planned to require complete rewrite of all rows unless append

reason is maximum space saving and performance for the row encoding

an option might be to create new types for the changed versions,

use multiple as if one and merge at some point (might never become fully merged)

one sequence for all types or one per type

64 bit keys seem desirable because it is the biggest native c datatype. but with that size and a reasonable 16 bit type identifier, only 48 bit are left for row ids

initialise counters in array. type id equals index

at startup, for each type, find max id

selecting from multiple indices

intersection of multiple index search results

are more node operations needed

node first/last/prev/range etc seem not needed because of the data model (ids to for the user unsorted data)

but should allow first/last forward/reverse searches

implementation search performance

nodes and relations are stored in b+trees with a basic o(log n) time complexity for search/insert/delete (average and worst case)

node data

translating between node identifier and data requires only one basic b+tree search. it is as fast as it gets. the data of intern and extern nodes is indexed

relations

relations are filtered by giving lists of ids to match as arguments. 16 different filter combinations for left/label/ordinal/right, the parts of relations, are theoretically possible with dg_relation_select. 2 are unsupported: all combinations that contain a filter for right and ordinal but not left. because the number of these possible combinations is relatively low, pre-compiled code optimised for the exact combination is used when executing queries. the sph-dg relation functionality is specifically designed for queries with these combinations

here is a list of estimated relative dg_relation_read performance for each filter combination from best (fastest execution per element matched) to worst (slowest), some are equal:

* * * *
left * * *
* * * right
left label * *
* label * right
* * ordinal *
left * ordinal *
left * ordinal right
left label ordinal *
left label ordinal right
left label * right
left * * right
* label * *
* label ordinal *

not supported

* * ordinal right
* label ordinal right

implementation space usage

nodes and relations are stored in b+trees with a basic o(log n) space complexity (average and worst case)

the required space depends on the chosen dg_id_t and dg_ordinal_t types. the maximum possible size is restricted by the largest available c type supported by the compiler that is used to compile sph-dg. the commonly available range is 0 to 64 bit. the minimum possible size is 8 bit for identifiers and 0 bit for ordinals (untested)

nodes

type id: id-size

other types: id-size + data-size

virtual: 0 (only exist in relations)

relations

6 * id-size + ordinal-size

example with 64 32

old calculation with bi-directional ordinal

id-size = 64, ordinal-size = 32

relation-size = 6 * 64 + 2 * 32 = 416

416 bit per relation

3e9 relations: ~1.3 terabit of storage

3e6 relations: ~1.3 gigabit of storage

example with 32 0

old calculation with bi-directional ordinal

id-size = 32, ordinal-size = 0

relation-size = 6 * 32 + 2 * 0 = 192

192 bit per relation

3e9 relations: ~576 gigabit

3e6 relations: ~576 megabit

history

2018-04

after exploring designs for a more traditional table based database, the codebase of sph-dg became relevant again after it was found that table based databases lead to complexities with node/row types and relations between them. particularly the fact that the node/row type has to be always known anyway and the complexity of having multiple relation tables and selecting them seemed convincing to come back to have the type encoded in the node id and store all in the same btree again, even if that costs significantly more space especially with relations

2018-03

relations were relatively efficient to store but it became clear that structured data is not. low-complexity records with selective indexing and full table searches were not possible. only unnecessarily costly records could be created, for example using relations with nothing less than maximum normalisation and full indexes for the complete row data. tables, column indexes and custom key-value associations seemed missing

insights: separate btrees are like prefixing a data area and more space saving than prefixing each element with a type for example. when using one big btree to relate any type with any other, relations need to store the type of related nodes when the datatype is required to address elements. the row is the more fundamental abstraction, not the relations/graph. rows are like arrays with eventually variable-length and typed fields

custom node/row/record types and limiting which nodes can be linked has interesting implications for data retrieval and storage space

2016

earlier versions of sph-dg did not support labelled relations, but instead had relations that themselves have ids and identifier aliases. this was found to be overcomplicated and particularly too costly in terms of space usage and processing. one insight was that relations without labels are like relations with all having the same label. these relations are on their own not distinguishable semantically, especially for partitioning data and for deciding if specific relations are not needed anymore. relations between numbers posed another problem: since only a node with data type was specified, creating relations to numbers would potentially lead to a very high number of nodes. to solve this, a new node type that contains a number in the identifier and only exists in relations was introduced. earlier versions also used classifications like "shape" or "connectors" for nodes, but these classifications turned out to be useless at least in this context

applied knowledge

i previously wrote sph-dg, which is an embeddable, lmdb based database focused on graph-like relations. it is used on this website, works reliable and relation queries are fast. i am a long-time user of database systems like sqlite, mysql, lmdb (interface similar to berkeley db), oracle databases and postgresql and have used almost all sql features professionally. ive been contemplating for years about more simplistic and more efficient multi-use information storage and retrieval for contexts like web applications, generative art and ai knowledge storage

related

about databases

database distribution

graph data structures

sph-dg

sph-dg

sph-dg data modeling

sph-dg-guile


tags: start q1 specification computer sph-dg graph database design software sph-db