database as a shared library

state: in development. relations and much of the api is implemented and stable, but new breaking features are in development (mainly record and key-value storage)

implements sph-dg


on github

data modeling

how to store data with sph-dg

compared to other databases

graph databases

graph databases typically store relations in an adjacency list, which tends to be efficient for traversing long paths from node to node. but adjacency lists alone are particularly well suited for frequent random access to nodes or finding all relations a node is in. sph-dg uses b+trees instead because its focus is on mapping data to identifiers (random access) and filtering on the set of all relations (optimised for queries such as "is x related to y" and "which nodes are adjacent to x", but not optimised for long path traversals)

relational databases like sqlite or mariadb

in relational database terms, sph-dg is currently one table for indexed blobs with identifiers and a many-to-many table for relations

to store rows in b-trees, generally speaking, an efficient mapping of key->value could be


where data is in some format that encodes the column data. rows of a table can be stored near each other for faster searches inside the data of many rows of a type. indexes that map column data to row ids can be added to avoid reading all rows when filtering by a column value


sph-dg currently supports associating ids to rows of custom binary format, but it always indexes the complete row data (not any individual columns). it does not not support indexes for column data, only indirect via data node identifiers

key-value databases

a key-value databases stores associations between arbitrary data, data->data. in sph-dg, similar functionality can be archieved with relations between nodes of type intern-small


with the restriction that data size must be smaller or equal to identifier size minus two bit. the more generic but more space intensive alternative are relations between nodes of type intern

data->id, id->data, id->id

which always indexes the data and requires the indirection via references

search performance

nodes and relations are stored in b+trees with a basic o(log n) time complexity for search/insert/delete (average and worst case)

node data

translating between node identifier and data requires only one basic b+tree search. it is as fast as it gets. the data of intern and extern nodes is indexed


relations are filtered by giving lists of ids to match as arguments. 16 different filter combinations for left/label/ordinal/right, the parts of relations, are theoretically possible with dg_relation_select. 2 are unsupported: all combinations that contain a filter for right and ordinal but not left. because the number of these possible combinations is relatively low, pre-compiled code optimised for the exact combination is used when executing queries. the sph-dg relation functionality is specifically designed for queries with these combinations

here is a list of estimated relative dg_relation_read performance for each filter combination from best (fastest execution per element matched) to worst (slowest), some are equal:

* * * *
left * * *
* * * right
left label * *
* label * right
* * ordinal *
left * ordinal *
left * ordinal right
left label ordinal *
left label ordinal right
left label * right
left * * right
* label * *
* label ordinal *

not supported

* * ordinal right
* label ordinal right

space usage

nodes and relations are stored in b+trees with a basic o(log n) space complexity (average and worst case)

the required space depends on the chosen dg_id_t and dg_ordinal_t types. the maximum possible size is restricted by the largest available c type supported by the compiler that is used to compile sph-dg. the commonly available range is 0 to 64 bit. the minimum possible size is 8 bit for identifiers and 0 bit for ordinals (unconfirmed)

64 bit identifiers and 32 bit ordinals are typical for big databases


id: id-size

intern: 2 * id-size + 2 * data-size

extern: intern-size

intern-small: 0 (only exist in relations)


6 * id-size + 2 * ordinal-size

example 64 32

id-size = 64, ordinal-size = 32

relation-size = 6 * 64 + 2 * 32 = 416

416 bit per relation

3e9 relations: ~1.3 terabit of storage

3e6 relations: ~1.3 gigabit of storage

example 32 0

id-size = 32, ordinal-size = 0

relation-size = 6 * 32 + 2 * 0 = 192

192 bit per relation

3e9 relations: ~576 gigabit

3e6 relations: ~576 megabit

node types and the internal trees they use

symbol -> id->data

symbol -> symbol->id

key -> data->data

labeled -> data->data

extern -> id->data

extern -> extern->id

small -> left->right

small -> right->left

relation -> left->right

relation -> right->left

relation -> label->left

potential improvements

scaling over multiple hosts with automated network discovery via seed hosts. ideas here

preliminary development tasks for the next version


reserve custom-length origin prefix

move the node type to the beginning of the local identifier

new counters for record and key ids




"table" -> id len name (type len column-name) ...

"index" -> table-id column-index ...

"key" -> id len name

"format" -> format-version id-type

"commit-log-reset" -> last-commit-id


type: 0/create 1/delete

data: element-type element-data ...

association: commit-id -> type data

for tables/indexes/key-value create


id -> data


data id -> data


data -> data



index: (type table-id (column-index ...))

table: (id name dbi columns: (type size name column-index indices))

key: (id name)

re-implement as table/index





tags: programming library overview start q1 computer project sph-dg database c sc graph-database base shared-library key-value