2017-10-20

sph-dg

design for a general purpose database

an attempt to find the most useful data organisation patterns and possibly unite what used to be separate kinds of databases

current features

deduplicated data persistence with automatically created identifiers

directed graph relations with labels and order

general

id: identifier, database unique, automatically created, for use as a reference in-place of data. identifiers encode the element type so they can be filtered efficiently by type when used with relations

node: element that can be used in relations, which is every element that can be referenced with a database unique identifier

associations are stored with ordered keys and values. b+trees can be used as the fundamental data structure for implementations

element types

id

database unique without data. this can be useful in relations for example for intermediate nodes that reference multiple others

a special "null" node of type id must be available. for example an identifier with a value of zero

symbol

association: id -> data; data -> id

automatically deduplicated data with an id to be used as a reference. each id corresponds to one distinct binary pattern

use case: short strings, names, numbers, binary data and similar

operations: ensure, get-data-by-id, get-id-by-data

key

association: data:key -> data:value

simple associations of user data. user must manually add ids to the data to use it in relations

use case: basis for custom types, caches

labeled

association: data:label id -> data

non-deduplicated data clustered by label with an automatically created id. label and data are user data

use case: records where label specifies the type, tables

small

data not bigger than id size that only exists in relations. immediately available in relations with no need to dereference it

use case: numbers in relations

extern

association: id -> data; data -> id ...

represents a reference to data stored outside the database. the data part is optional and can contain addresses or similar to locate external data

relations

association: left label -> ordinal id:right; right label -> left

association types: id id -> data id; id id -> id

data is stored numerically sorted. the ordinal allows to store relations in user defined order (weight)

there must only be one relation for every combination of left, label and right. cyclic relations may be allowed. relational integrity may be ignored when creating relations to avoid existence checks

as a convention, left to right should correspond to general to specific

when a node element is deleted then all relations that contain its id in any of left, label or right must be deleted

about this document

history

2017

previously sph-dg was only intended for graph-like data storage with a direct c interface to make data modeling simpler and data access more efficient compared for example to relational databases using sql. the graph representation and forced deduplication raised questions about how to store records, as in indexed tables, efficiently. as well as the question of how to efficiently store sequences with possibly duplicate elements in general. the new element types key and labeled were added and the database design was basically merged with record and key-value database designs

2016

earlier versions of this document did not include labels/labels for relations, but identifiers for pairs and identifier aliases instead. this was found to be overcomplicated and particularly too costly in terms of space usage and processing. the following insight was that relations without labels are like relations with all having the same label. these relations are on its own not distinguishable semantically, especially for partitioning data and for deciding if specific relations are not needed anymore and can be deleted

relations to numbers posed a problem: since only the intern type was specified, creating relations to numbers would potentially lead to a very high number of interns, since for each numeric value to be used, a new intern node with id and data would have to be created. this issue should now be solved with the virtual intern-small nodes that can be used in relations

earlier versions of this document also used classifications like "shape" or "connectors" for nodes, but these classifications turned out to be useless at least in this label. following increased clarity, the document is also shorter now


tags: start q1 specification computer sph-dg graph database abstract base