2018-02-04

database as a shared library

state: in development. relations and much of the api is implemented and stable, but new breaking features are in development (mainly record and key-value storage)

implements sph-db

related: database distribution

on github

graph databases typically store relations in an adjacency list, which tends to be efficient for traversing long paths from node to node. but adjacency lists alone are particularly well suited for frequent random access to nodes or finding all relations a node is in. sph-dg uses b+trees instead because its focus is on mapping data to identifiers (random access) and filtering on the set of all relations (optimised for queries such as "is x related to y" and "which nodes are adjacent to x", but not optimised for long path traversals)

in relational database terms, sph-dg is currently one table for indexed blobs with identifiers and a many-to-many table for relations

to store rows in b-trees, generally speaking, an efficient mapping of key->value could be

row-id->data

where data is in some format that encodes the column data. rows of a table can be stored near each other for faster searches inside the data of many rows of a type. indexes that map column data to row ids can be added to avoid reading all rows when filtering by a column value

column-data->row-id

sph-dg currently supports associating ids to rows of custom binary format, but it always indexes the complete row data (not any individual columns). it does not not support indexes for column data, only indirect via data node identifiers

a key-value databases stores associations between arbitrary data, data->data. in sph-dg, similar functionality can be archieved with relations between nodes of type intern-small

intern-small->intern-small

with the restriction that data size must be smaller or equal to identifier size minus two bit. the more generic but more space intensive alternative are relations between nodes of type intern

data->id, id->data, id->id

which always indexes the data and requires the indirection via references

nodes and relations are stored in b+trees with a basic o(log n) time complexity for search/insert/delete (average and worst case)

translating between node identifier and data requires only one basic b+tree search. it is as fast as it gets. the data of intern and extern nodes is indexed

relations are filtered by giving lists of ids to match as arguments. 16 different filter combinations for left/label/ordinal/right, the parts of relations, are theoretically possible with dg_relation_select. 2 are unsupported: all combinations that contain a filter for right and ordinal but not left. because the number of these possible combinations is relatively low, pre-compiled code optimised for the exact combination is used when executing queries. the sph-dg relation functionality is specifically designed for queries with these combinations

here is a list of estimated relative dg_relation_read performance for each filter combination from best (fastest execution per element matched) to worst (slowest), some are equal:

* * * * left * * * * * * right left label * * * label * right * * ordinal * left * ordinal * left * ordinal right left label ordinal * left label ordinal right left label * right left * * right * label * * * label ordinal *

not supported

* * ordinal right * label ordinal right

nodes and relations are stored in b+trees with a basic o(log n) space complexity (average and worst case)

the required space depends on the chosen dg_id_t and dg_ordinal_t types. the maximum possible size is restricted by the largest available c type supported by the compiler that is used to compile sph-dg. the commonly available range is 0 to 64 bit. the minimum possible size is 8 bit for identifiers and 0 bit for ordinals (unconfirmed)

64 bit identifiers and 32 bit ordinals are typical for big databases

id: id-size

intern: 2 * id-size + 2 * data-size

extern: intern-size

intern-small: 0 (only exist in relations)

6 * id-size + 2 * ordinal-size

id-size = 64, ordinal-size = 32

relation-size = 6 * 64 + 2 * 32 = 416

416 bit per relation

3e9 relations: ~1.3 terabit of storage

3e6 relations: ~1.3 gigabit of storage

id-size = 32, ordinal-size = 0

relation-size = 6 * 32 + 2 * 0 = 192

192 bit per relation

3e9 relations: ~576 gigabit

3e6 relations: ~576 megabit

symbol -> id->data

symbol -> symbol->id

key -> data->data

labeled -> data->data

extern -> id->data

extern -> extern->id

small -> left->right

small -> right->left

relation -> left->right

relation -> right->left

relation -> label->left