design for a general purpose database

currently work in progress


tables (record types) with rows (records) and columns (fields). indexes on arbitrary columns of a table

native support for fast directed graph (many-to-many) relations between rows (nodes)

direct programming language interface, no query language (more than a key-value database and less complex and faster than a relational database)

shared library written in c (embeddable database)

custom key-value containers

lmdb for btrees (because it is the best/fastest)


i want a simple, tiny, embeddable database to store records without having to construct high-level query language strings for each request. the next best thing would be sqlite, but there i would be wasting resources with frequently building sql strings and having them parsed

graph-like relations in relational databases are usually modelled with tables (and custom naming schemes). i think there is a more efficient way to store and query relations and it would be simpler to use with native support. i consider many-to-many relations to be a frequently useful feature

container types


properties: sequential-auto-id, row, columns, names, datatypes. optional: data

default: "system" -> key -> data


column-data ... -> id



left [label] -> [ordinal] right [extra ...]

right [label] -> left


left label -> [ordinal] right [extra ...]

right label -> left

optional: ordered, labeled, bi-directional, extra fields


data -> data



per table

custom id size per table

load current max value per table on startup


system cache

table id is array index ("arrays are the fastest hash tables")

field contains table schema, indices, graphs

index is array with column indices



id-size: 64

column-count: 5

column-types: int8 int32 int32 bin8 string32

column-offsets: 0 8 40 72

column-offsets-len: 4

column-names: string ...

name: string

index-count: 2


len id column-index ...

len id column-index ...

graph-left-count: 2


target id flags

target id flags


source id flags

source id flags

column types

0000 integer unsigned {size}

0010 integer signed {size}

0100 float32

0110 float64

1000 bin: size

1010 string: size

0001 binv: exponent-size

0011 stringv: exponent-size

-- alt --

bin: size

binv: exponent-size


columns are identified by name and stored with fixed size columns first (predictable offset)

btree "system"

format-label -> version

table-label id -> name id-size (col-name col-type) ...

index-label table:id id -> col-offset ...

graph-label source:id id -> target:id options:(bidirectional labeled ordinal)

custom-label id -> name


one btree per table

id -> col-data ...

size prefix for variable length types

row: fixed-size-data ... variable-size-len variable-size-data ...


one btree per index

update on each table insert using the same transaction

indices are defined by the table and indexed column offsets. the index id allows referring to that


one btree per type combination that should be related

ordinal stored only in left->right direction to save query complexity from ranged queries in the opposite direction

name is source and target table id


one btree per custom container

name is id


prefix: db-


t: table

i: index

g: graph

s: system

c: custom

infix combinations

st, si, sg, sc


db-t-select :: txn tab-struct ids col-offsets matcher:begins/ends/contains/equals/negation

db-t-delete :: txn tab-struct ids col-offsets matcher:begins/ends/contains/equals/negation

db-t-next :: selection -> row-reference

db-t-read :: selection -> record-struct

db-t-get :: txn tab-struct col-offsets matcher -> record-struct ...

db-t-put :: txn tab-struct (col-offset data) ...

db-g-put :: txn gra-struct left right ordinal extra ...

db-g-select :: txn gra-struct [left right label range]

db-g-read :: selection -> relation-struct ...

db-g-delete :: txn gra-struct [left right label range]

db-c-get :: txn kv-struct key

db-c-ensure :: txn kv-struct key data

db-c-delete :: txn kv-struct key

db-s: struct

db-st-new :: txn name id-size column ... -> tab-struct

db-st-delete :: txn tab-struct

db-st-exists? :: txn name ... -> boolean

db-st-get :: name -> tab-struct

db-st-dbi :: id

db-si-ensure :: txn tab-struct column-offset ...

db-si-delete :: txn index-struct

db-si-rebuilt :: txn index-struct

db-sg-new :: txn tab tab options

db-sg-delete :: txn gra-id

db-sg-dbi :: gra-id

db-sg-get :: source target

db-sc-new :: txn name

db-sc-delete :: txn kv

db-sc-get :: name

db-sc-dbi :: id





other desired features

associations that should be supported

id -> 0 (id only nodes)

data -> id, id -> data (symbols, deduplicated patterns, for example strings)

id -> data, data -> id (references, identifiers for external data like files)

cheap number relations

integers possibly stored in (node) ids of relations

lmdb features

all lmdb features should be available for btrees. for example cursor prev/next/last/first/set-range

to allow this, provide an accessor that gets the lmdb dbi for containers

open questions

should partial indices be supported


should multiple graphs for the same type combination be supported

should label/ordinal be optional

should graph relations of both directions be kept in the sys-info cache

are extra relation fields needed

example situation: "when every ingredient can be added in different amounts to different recipes"

column matchers

how to use column matchers like equals, contains, greater, lesser, begins, ends, etc

and combinations of them

options: passing function addresses, flags, descriptive structure

considered issues

informal column types

the database only needs a fixed and variable length binary record field datatype store data

informal datatypes are string, integer and similar

are stored with the schema to not have it to store somewhere else to reify more useful datatypes

excluded features

sub-selects, native joins, re-order on read, grouping

one btree for all tables or one for each table

one for each table because space saving, acts like a prefix to a data collection

sequences per table or one global


fast load on startup


can theoretically be shorter

need to query every table on startup

good for many fragmented types with local cluster relations

variable per-table

size loaded from schema variable

relations between one and multiple other types

is basically as before, only in separate btrees. reference elements can be used

bit slower because multiple btrees must be asked

unidirectional relations to save space

issue: how to garbage collect unidirectional relations

option: maybe decay factor and regular refresh but it should completely be an application issue at this point

database network distribution

i have started thinking about it, but it takes more resources to implement

see database distribution



renamed to sph-db because it should be less centered on directed graph like relations

relations were relatively efficient to store but it became clear that structured data is not. low-complexity records with selective indexing and full table searches were not possible. only unnecessarily costly records could be created, for example using relations with nothing less than maximum normalisation and full indexes for the complete row data. tables, column indexes and custom key-value associations seemed missing

insights: separate btrees are like prefixing a data area and more space saving than prefixing each element with a type for example. when using one big btree to relate any type with any other, relations need to store the type of related nodes when the datatype is required to address elements. the row is the more fundamental abstraction, not the relations/graph. rows are like arrays with eventually variable-length and typed fields

custom node/row/record types and limiting which nodes can be linked has interesting implications for data retrieval and storage space


earlier versions of sph-dg did not support labelled relations, but instead had relations that themselves have ids and identifier aliases. this was found to be overcomplicated and particularly too costly in terms of space usage and processing. one insight was that relations without labels are like relations with all having the same label. these relations are on their own not distinguishable semantically, especially for partitioning data and for deciding if specific relations are not needed anymore. relations between numbers posed another problem: since only a node with data type was specified, creating relations to numbers would potentially lead to a very high number of nodes. to solve this, a new node type that contains a number in the identifier and only exists in relations was introduced. earlier versions also used classifications like "shape" or "connectors" for nodes, but these classifications turned out to be useless at least in this context

applied knowledge

i previously wrote sph-dg, which is/was an embeddable, lmdb based database focused on graph-like relations. it is used on this website, works reliable and relation queries are fast. i am a long-time user of database systems like sqlite, mysql, lmdb (interface similar to berkeley db), oracle databases and postgresql and have used almost all sql features professionally. ive been contemplating for years about more simplistic and generic information storage and retrieval in the contexts of modern social media web applications and knowledge storage for ai

tags: start specification computer sph-dg graph database abstract base