# about

sph-dg is a database as a shared library for storing records with relations like a directed graph. it is supposed to be embeddable (load the source code or link the shared library), minimalistic and fast.

the current codebase has no known faults but misses typed nodes and because of that is not very useful for now. ideas for the next version are documented [here](http://sph.mn/c/view/si).

# features
* no overhead from sql or other query parsing: direct, high-speed interface using c data structures
* nodes are indexable custom type records with identifiers for random access
* native directed, labelled and ordered data relations. cheap relations for small values
* relation search optimised with hardcoded query processors for most cases
* fully acid compliant, tested, memory mapped database that can grow to any size that fits on the local filesystem, unrestricted by available ram
* transactions for any kind of modification
* parallel database reads
* read optimised

# homepage
[sph.mn](http://sph.mn/c/view/52)

# license
gpl3+

# dependencies
* run-time
  * lmdb - http://symas.com/mdb/ (bsd-style license), code here https://github.com/LMDB/lmdb
  * c standard library, for example glibc
* quick build
  * gcc and shell for the provided compile script
* development build
  * sc - https://github.com/sph-mn/sph-sc
  * clang-format (part of cmake)

# setup
1. install run-time and quick build dependencies
1. change into the project directory and execute ``./exe/compile-c``
1. execute ``./exe/install``. this supports one optional argument: a path prefix to install to

optionally, execute ``./exe/test`` to see if the tests run successful.

# performance
the performance very nearly matches the performance of lmdb. if not, it is a bug.
benchmarks for lmdb can be found [here](https://symas.com/lightning-memory-mapped-database/technical/).

# usage in c

## preliminary
dg routines return a "status_t" object that contains a status and a status-group identifier (error code and source library identifier). bindings to work with this small object are included with the main header "sph-dg.c". the error handling pattern used by sph-dg usually uses a goto label named "exit" per routine where undesired return status ids are handled.

## compilation of programs using sph-dg
for example with gcc:
```bash
gcc example.c -o example-executable -llmdb -lsph-dg
```

## inclusion of bindings
```c
#include "<sph-dg.c>"
```

## error handling pattern
the following examples assume this pattern of calling ``status_init`` beforehand and having a label named ``exit``.

```c
int main() {
  status_init;
  // example code ...
exit:
  return status.id;
}
```

## initialisation and deinitialisation
there can be at most one open database per process (to be changed in the next version), but different processes can open the same database and multiple threads can use it.
```c
status_require_x(dg_init("/tmp/example", 0));
// code that makes use of the database ...
dg_exit();
```

## create nodes with data
there are four node types: ``id``, ``intern``, ``extern`` and ``intern-small``. the following creates nodes of type ``intern``.
```c
dg_data_t data_element = {.size = dg_size_octets_id, .data = 0};
dg_data_list_t* data_list = 0;
dg_ids_t* result_ids = 0;
dg_txn_introduce;

// create data
b32 count = 5;
while(count) {
    data_element.data = calloc(1, sizeof(dg_ids_t));
    *((dg_id_t*)(data_element.data)) = 123 + count;
    data_list = dg_data_list_add(data_list, data_element);
    count = count - 1;
};

// ensure that a node for each data element exists and copy its id to result_ids
dg_txn_write_begin;
status_require_x(dg_intern_ensure(dg_txn, data_list, &result_ids));
dg_txn_commit;

exit:
  if(dg_txn) dg_txn_abort;
  dg_data_list_destroy(data);
  dg_ids_destroy(result_ids);
```

## get node ids from data, and data from ids
```c
dg_ids_t* data_ids = 0;
dg_txn_introduce;

// the data elements to look up
dg_id_t data_element_value = 2;
dg_data_t data_element = {.size = dg_size_octets_id, .data = &data_element_value};
dg_data_list_t* data = dg_data_list_add(0, data_element);

// get ids from data
dg_txn_begin;
status_require_x(dg_intern_data_to_id(dg_txn, data, 1, &data_ids));
dg_data_list_destroy(data);

// get data from ids
data = 0;
status_require_x(dg_intern_id_to_data(dg_txn, data_ids, 1, &data));

exit:
  if(dg_txn) dg_txn_abort;
  dg_data_list_destroy(data);
  dg_ids_destroy(data_ids);
```

## create relations
```c
dg_ids_t* ids_left = 0;
dg_ids_t* ids_right = 0;
dg_ids_t* ids_label = 0;
dg_txn_introduce;

// create some nodes. node ids are needed to create relations.
// in this example nodes of type "id" are created, which do not have data stored with them.
// the second argument to dg_id_create specifies how many new nodes should be created.
dg_txn_write_begin;
status_require_x(dg_id_create(dg_txn, 1, &ids_left));
status_require_x(dg_id_create(dg_txn, 1, &ids_right));
// used as the labels of the relations. labels are ids of nodes
status_require_x(dg_id_create(dg_txn, 1, &ids_label));

// create relations for each label between all the specified left and right nodes (relations = left * right * label)
status_require_x(dg_relation_ensure(dg_txn, left, right, label, 0, 0));
dg_txn_commit;

exit:
  if(dg_txn) dg_txn_abort;
  // deallocate the id lists
  dg_ids_destroy(ids_left);
  dg_ids_destroy(ids_right);
  dg_ids_destroy(ids_label);
```

## read relations
```c
dg_ids_t* ids_left = 0;
dg_ids_t* ids_label = 0;
dg_relation_records_t* records = 0;
dg_relation_read_state_t state;
dg_txn_introduce;

// node ids to be used to filter
ids_left = dg_ids_add(ids_left, 123);
ids_label = dg_ids_add(ids_label, 456);

// select relations whose left side is in "ids_left" and label in "ids_label"
status_require_x(dg_relation_select(dg_txn, ids_left, 0, ids_label, 0, 0, &state))

// read 2 of the selected relations
dg_status_require_read_x(dg_relation_read(&state, 2, &records));

// read as many matching relations as there are left
dg_status_require_read_x(dg_relation_read(&state, 0, &records));

dg_relation_selection_destroy(&state);

// display records. "ordinal" might not be set in the record unless the query uses a filter for a left value
while(records) {
  record = dg_relation_records_first(records);
  printf("record: %lu %lu %lu %lu\n", record.left, record.label, record.ordinal, record.right);
  records = dg_relation_records_rest(records);
};

exit:
  if(dg_txn) dg_txn_abort;
  dg_ids_destroy(ids_left);
  dg_ids_destroy(ids_label);
  dg_relation_records_destroy(records);
```

# types
```c
dg_id_t
dg_ordinal_t
dg_txn_t
status_i_t
dg_ordinal_t (*dg_relation_ordinal_generator_t)(b0*)
status_t (*dg_relation_reader_t)(dg_relation_read_state_t*,b32,dg_relation_records_t**)
dg_data_list_t struct
  struct dg_data_list_struct* link;
  dg_data_t data;
dg_data_record_t struct
  dg_id_t id;
  size_t size;
  void* data;
dg_data_records_t struct
  struct dg_data_records_struct* link;
  dg_data_record_t data;
dg_data_t struct
  void* data;
  size_t size;
dg_ids_t struct
  struct dg_ids_struct* link;
  dg_id_t data;
dg_index_errors_extern_t struct
  uint8_t errors_p;
  dg_ids_t* different_data_extern;
  dg_ids_t* excess_data_extern;
  dg_ids_t* different_id_data;
  dg_ids_t* missing_id_data;
dg_index_errors_intern_t struct
  uint8_t errors_p;
  dg_ids_t* different_data_id;
  dg_ids_t* excess_data_id;
  dg_ids_t* different_id_data;
  dg_ids_t* missing_id_data;
dg_index_errors_relation_t struct
  uint8_t errors_p;
  dg_relation_records_t* missing_right_left;
  dg_relation_records_t* missing_label_left;
  dg_relation_records_t* excess_right_left;
  dg_relation_records_t* excess_label_left;
dg_init_options_t struct
  uint8_t read_only_p;
  size_t maximum_size_octets;
  uint32_t maximum_reader_count;
  uint8_t filesystem_has_ordered_writes_p;
  uint32_t env_open_flags;
  uint16_t file_permissions;
dg_intern_read_state_t struct
  status_t status;
  MDB_cursor* restrict cursor;
  uint8_t options;
dg_node_read_state_t struct
  status_t status;
  MDB_cursor* restrict cursor;
  uint8_t types;
  uint8_t options;
dg_ordinal_match_data_t struct
  dg_ordinal_t min;
  dg_ordinal_t max;
dg_relation_read_state_t struct
  status_t status;
  MDB_cursor* restrict cursor;
  MDB_cursor* restrict cursor_2;
  void* left;
  void* right;
  void* label;
  dg_ids_t* left_first;
  dg_ids_t* right_first;
  dg_ordinal_match_data_t* ordinal;
  uint8_t options;
  void* reader;
dg_relation_record_t struct
  dg_id_t left;
  dg_id_t right;
  dg_id_t label;
  dg_ordinal_t ordinal;
dg_relation_records_t struct
  struct dg_relation_records_struct* link;
  dg_relation_record_t data;
dg_statistics_t struct
  MDB_stat id_to_data;
  MDB_stat data_intern_to_id;
  MDB_stat data_extern_to_extern;
  MDB_stat left_to_right;
  MDB_stat right_to_left;
  MDB_stat label_to_left;
status_t struct
  status_i_t id;
  uint8_t group;
```

# enum
```c
dg_status_id_undefined, dg_status_id_input_type, dg_status_id_max_id,
dg_status_id_data_length, dg_status_id_not_implemented, dg_status_id_duplicate,
dg_status_id_memory, dg_status_id_condition_unfulfilled, dg_status_id_missing_argument_dg_root,
dg_status_id_path_not_accessible_dg_root, dg_status_id_no_more_data, dg_status_group_dg,
dg_status_group_lmdb, dg_status_group_libc
```

# routines
```c
dg_data_list_t* dg_data_list_add(dg_data_list_t* a, dg_data_t value)
dg_data_list_t* dg_data_list_drop(dg_data_list_t* a)
dg_data_records_t* dg_data_records_add(dg_data_records_t* a, dg_data_record_t value)
dg_data_records_t* dg_data_records_drop(dg_data_records_t* a)
dg_ids_t* dg_ids_add(dg_ids_t* a, dg_id_t value)
dg_ids_t* dg_ids_drop(dg_ids_t* a)
dg_init_options_t dg_init_options_set_defaults(dg_init_options_t* a)
dg_relation_records_t* dg_relation_records_add(dg_relation_records_t* a, dg_relation_record_t value)
dg_relation_records_t* dg_relation_records_drop(dg_relation_records_t* a)
size_t dg_data_list_length(dg_data_list_t* a)
size_t dg_data_records_length(dg_data_records_t* a)
size_t dg_ids_length(dg_ids_t* a)
size_t dg_relation_records_length(dg_relation_records_t* a)
status_t dg_debug_count_all_btree_entries(MDB_txn* txn, uint32_t* result)
status_t dg_debug_display_btree_counts(MDB_txn* txn)
status_t dg_debug_display_content_left_to_right(dg_txn_t* txn)
status_t dg_debug_display_content_right_to_left(dg_txn_t* txn)
status_t dg_delete(dg_txn_t* txn, dg_ids_t* ids)
status_t dg_exists_p(dg_txn_t* txn, dg_ids_t* ids, uint8_t* result)
status_t dg_extern_create(dg_txn_t* txn, uint32_t count, dg_data_t* data, dg_ids_t** result)
status_t dg_extern_data_to_id(dg_txn_t* txn, dg_data_t data, dg_ids_t** result)
status_t dg_extern_id_to_data(dg_txn_t* txn, dg_ids_t* ids, uint8_t every_p, dg_data_list_t** result)
status_t dg_extern_update(dg_txn_t* txn, dg_id_t id, dg_data_t data)
status_t dg_id_create(dg_txn_t* txn, uint32_t count, dg_ids_t** result)
status_t dg_identify(dg_txn_t* txn, dg_ids_t* ids, dg_ids_t** result)
status_t dg_index_errors_extern(dg_txn_t* txn, dg_index_errors_extern_t* result)
status_t dg_index_errors_intern(dg_txn_t* txn, dg_index_errors_intern_t* result)
status_t dg_index_errors_relation(dg_txn_t* txn, dg_index_errors_relation_t* result)
status_t dg_index_recreate_extern()
status_t dg_index_recreate_intern()
status_t dg_index_recreate_relation()
status_t dg_init(uint8_t* dg_root_argument, dg_init_options_t* options)
status_t dg_intern_data_to_id(dg_txn_t* txn, dg_data_list_t* data, uint8_t every_p, dg_ids_t** result)
status_t dg_intern_ensure(dg_txn_t* txn, dg_data_list_t* data, dg_ids_t** result)
status_t dg_intern_id_to_data(dg_txn_t* txn, dg_ids_t* ids, uint8_t every_p, dg_data_list_t** result)
status_t dg_intern_update(dg_txn_t* txn, dg_id_t id, dg_data_t data)
status_t dg_node_read(dg_node_read_state_t* state, uint32_t count, dg_data_records_t** result)
status_t dg_node_select(dg_txn_t* txn, uint8_t types, uint32_t offset, dg_node_read_state_t* state)
status_t dg_relation_delete(dg_txn_t* txn, dg_ids_t* left, dg_ids_t* right, dg_ids_t* label, dg_ordinal_match_data_t* ordinal)
status_t dg_relation_ensure(dg_txn_t* txn, dg_ids_t* left, dg_ids_t* right, dg_ids_t* label, dg_relation_ordinal_generator_t ordinal_generator, void* ordinal_generator_state)
status_t dg_relation_read(dg_relation_read_state_t* state, uint32_t count, dg_relation_records_t** result)
status_t dg_relation_select(dg_txn_t* txn, dg_ids_t* left, dg_ids_t* right, dg_ids_t* label, dg_ordinal_match_data_t* ordinal, uint32_t offset, dg_relation_read_state_t* result)
status_t dg_statistics(dg_txn_t* txn, dg_statistics_t* result)
uint8_t dg_extern_p(dg_id_t id)
uint8_t dg_id_p(dg_id_t id)
uint8_t dg_intern_p(dg_id_t id)
uint8_t dg_intern_small_p(dg_id_t id)
uint8_t dg_relation_p(dg_id_t id)
uint8_t* dg_status_description(status_t a)
uint8_t* dg_status_description(status_t a)
uint8_t* dg_status_group_id_to_name(status_i_t a)
uint8_t* dg_status_group_id_to_name(status_i_t a)
uint8_t* dg_status_name(status_t a)
uint8_t* dg_status_name(status_t a)
void dg_debug_display_relation_records(dg_relation_records_t* records)
void dg_debug_log_ids(dg_ids_t* a)
void dg_debug_log_ids_set(imht_set_t a)
void dg_exit()
void dg_node_selection_destroy(dg_node_read_state_t* state)
void dg_relation_selection_destroy(dg_relation_read_state_t* state)
```

# macros
```c
dg_data_list_first
dg_data_list_first_address
dg_data_list_rest
dg_data_records_first
dg_data_records_first_address
dg_data_records_rest
dg_define_ids(name)
dg_define_ids_2(name_1, name_2)
dg_define_ids_3(name_1, name_2, name_3)
dg_id_compare(a, b)
dg_id_equal_p(a, b)
dg_id_max
dg_id_type_step
dg_ids_add_x(target, source, ids_temp)
dg_ids_first
dg_ids_first_address
dg_ids_rest
dg_intern_small_data_to_id(data)
dg_intern_small_id_to_data(id)
dg_null
dg_ordinal_compare
dg_pointer_allocation_set(result, expression, result_temp)
dg_pointer_to_id(a, index)
dg_read_option_initialised
dg_read_option_is_set_left
dg_read_option_is_set_right
dg_read_option_skip
dg_relation_data_set_both(a, ordinal, id)
dg_relation_data_set_id(a, value)
dg_relation_data_set_ordinal(a, value)
dg_relation_data_to_id(a)
dg_relation_data_to_ordinal(a)
dg_relation_records_first
dg_relation_records_first_address
dg_relation_records_rest
dg_size_octets_data_max
dg_size_octets_data_min
dg_size_octets_id
dg_size_octets_ordinal
dg_size_octets_relation_data
dg_size_octets_relation_key
dg_status_require_read_x(expression)
dg_status_set_id_goto(status_id)
dg_status_success_if_mdb_notfound
dg_status_success_if_no_more_data
dg_txn_abort
dg_txn_begin
dg_txn_commit
dg_txn_introduce
dg_txn_write_begin
dg_type_bit_extern
dg_type_bit_id
dg_type_bit_intern
dg_type_bit_intern_small
dg_type_extern
dg_type_id
dg_type_intern
dg_type_intern_small
dg_type_mask
dg_type_p(dg_type_name, id)
status_failure_p
status_goto
status_group_undefined
status_id_is_p(status_id)
status_id_success
status_init
status_require
status_require_x(expression)
status_reset
status_set_both(group_id, status_id)
status_set_both_goto(group_id, status_id)
status_set_group(group_id)
status_set_group_goto(group_id)
status_set_id(status_id)
status_set_id_goto(status_id)
status_success_p
```

# variables
```c
dg_index_errors_extern_t dg_index_errors_extern_null
dg_index_errors_intern_t dg_index_errors_intern_null
dg_index_errors_relation_t dg_index_errors_relation_null
uint8_t dg_initialised
uint8_t* dg_root
```

# additional features and caveats
* the data type of node identifiers for new databases can be set at compile time. currently identifiers can not be structs or arrays (possible extension)
* returned data pointers, not other values like integers, are only valid until the transaction is finished (for example aborted or committed)
* returned data pointers point to data that is immutable or should be treated as such
* make sure that dg-data-list*, dg-ids-t*, dg-relation-records-t* and dg-data-records-t* are set to 0 before adding the first element or otherwise the first link will likely point to a random memory initialisation address
* all readers add elements until they fail, which means there might be elements that need deallocation after an error occurred
* transactions are lmdb transactions. when in doubt you may refer to the documentation of lmdb. dg_txn_begin introduces a local variable dg_txn that carries the current local transaction
* there can/should only be one active transaction per thread. nested transactions are possible though
* make sure that you do not try to insert ordinals or ids bigger than what is defined to be possible by the data types for ordinals and node identifiers. otherwise numerical overflows might occur
* updating an ordinal value requires the re-creation of the corresponding relation. ordinals are primarily intended to store data in a pre-calculated order for fast ordered retrieval but not for frequent updates. for example, the ordinal field is perhaps not well suited for storing a constantly changing vote count or weight value

# customising the data type for node identifiers and ordinals and the maximum size for data
* custom data types can be specified with preprocessor definitions before compiling the sph-dg library in a file named config.c
* internally stored data can not be greater than lmdb max keysize, which is 4088 bit by lmdb default but can be adjusted by configuring and recompiling lmdb and then setting "dg-size-octets-data-max" in config.c
* node identifier data types that are not scalar/basic c datatypes are currently not fully possible only because of the use of the "imht-set" hashtable and "mi-list" linked lists (because they do not work with pointers to values)

# dg-relation-select and dg-relation-read
* "dg-relation-select" internally chooses the reader, relevant databases and other values to use for the search
* if a filter parameter is set to a non-zero value, it is used, otherwise no filter is used for that parameter
* search strategies for most filter combinations are pre-coded to make it fast

## features
* partial reads with max read count: the reader can be called repeatedly to get the next results from the full result set
* optional filter by left, right, label, minimum and maximum ordinal/weight. for filtering by ordinal, "left" filter values must be given
* offset: selected results begin after ``n`` matches
* can return results and indicate the end of the data stream in one call

## development
this section is only intended for when you want to change the core sph-dg itself.

* make sure that the field "mv-size" of dg-null is never set to something other than zero, as it is assumed to be zero. failure to do so can lead to index corruption (which the validator routines can detect)
* "mdb_cursor_close" seems to be ok with null pointers like "free"
* struct dg-index-errors-*-t field names have the format {type-of-error}-{source-key-name}-{source-value-name}
* each node type has a primary dbi and may have secondary dbis that are indexes. for example the primary index for relations is left->right. if a relation is missing in left->right, it is not considered missing and not recreated from other dbi
* when values from an secondary dbi are not found in the primary dbi, they are excess values. when values from a primary dbi are not found in a secondary dbi, they are missing values. with these terms, which are used in the validator routines, one can discern the location of errors

## dg-relation-select
* positions every relevant mdb cursor at the first entry of the dbi or exits with an error status if the database is empty
* chooses an appropiate reader routine
* applies the read offset

## dg-relation-read
readers of type "dg-relation-reader-t" support the following:
* partial reads. for example reading up to ten matches at a time. this is why a state object is used (this has many subtle consequences, like having to get the current cursor value at the beginning of the reader code to check if the right key or data is already set)
* skipping: matching but not copying the result. this is used for preparing reads from an offset
* cursors as arguments are always in position at a valid entry. the reader ends as soon as all results have been read or an error has occurred and eventually rejects additional calls with the same state by returning the same end-of-data or error result
* readers and deleters are built using stacked goto labels because this makes it much easier in this case to control the execution flow, compared to the alternative of nested while loops. especially for choosing the best place for evaluating the read-count stop condition
* dg-relation-read-1001-1101 is a good example of how queries with ordinals make the code more complicated (range lookups), and why using ordinals is only supported when a filter on "left" is given

## dg-relation-delete
* dg-relation-delete differs from dg-relation-read in that it does not need a state because it does not support partial processing
* it also differs in that it always needs to use all three relation dbi to complete the deletion instead of just any dbi necessary to match relations

# other language bindings
* scheme: [sph-dg-guile](https://github.com/sph-mn/sph-dg-guile)