about databases

classification of databases

by the main data structure used

b-tree: used in most databases for its uniform performance characteristics

adjacency list: often used in graph databases for its element locality

by the type of data they associate


type, is-table, id -> packed-data
type, is-index, data -> id ...

store records by type and use indexes to eventually skip over many records for matching specific field values. can be thought of as a table (record type) with rows (records) that have columns (fields). they might store multiple records of the same type together to make iterations over all records of a type faster

example: relational databases, databases that use sql


data -> data

associate arbitrary data, with one being a key to refer to the other part being the value. corresponds directly to fundamental b-tree operations

example: key-value databases



each distinct data pattern is associated with an uniform identifier to refer to it


parent -> child ...

example: filesystem

labeled directed graph

node label -> node ...

associates one element with possibly multiple other elements with label and direction. labels are used to have multiple meanings for relations, instead of for example only containment. typically optimised for traversing long paths of linked nodes

by interface

callable units

for example routines in a programming language

benefit: very little overhead for queries

domain specific language

for example sql

downsides: parsing and optimisation of the query language increases complexity and processing cost. construction of query language strings required and this has costs

benefits: can simplify function call semantics for composition with sub-selects, joins, groups and other higher level programming operations on the data

network distribution

a single database host computer has limits in processing power, available storage space and might fail with potential data loss. to overcome the limitations of single hosts, the database can be distributed across multiple hosts. this is part of what is often called scaling (of capabilities to needs)

for increasing the total available processing power, the full data set can be copied to all hosts (replication) and database requests be divided between hosts (load balancing). to keep changing data synchronised on multiple hosts, usually a kind of log is used that records changes made to a database (commit log) and other hosts receive and apply changes from this log. without a commit log, the whole data set would always need to copied and rewritten or the differences between the actual data sets would have to be calculated by comparing all data elements, which is both much more costly than keeping a log. the log typically stores a limited number of past changes to save space, which means that hosts that have not applied changes outside the range of available log entries will have to resort to a full copy or difference calculation, of which the copy is probably faster. the number of hosts that can make changes to the data is typically limited (primary/replica) to avoid the processing and data transfer overhead of every host connecting to every other to look for and to apply changes and keeping a log. this implies that it is therefore more difficult to scale for write operations than for read operations

for increasing the total available storage space, the data set can be split and distributed to hosts. if there is only one copy of each part and one hosts fails, data would be unavailable or lost. the common solution to this is to distribute copies (redundancy) of parts (shards) of the data among hosts in a way that if one host with one set of parts fails there are still other hosts that collectively have copies of all these parts and can continue to respond to requests for it. having copies of the data increases the total space requirement. for reliability and availability, the data would need to be distributed in a way that makes it unlikely for all copies of a part to become unavailable at the same time. for this, copies can be distributed on hosts physically further away to increase resilience against failures or downtimes of racks or whole datacenters (fires, natural disasters, etc)

to manage data across multiple hosts, a database needs ways to find which host has data relevant to the request. latency can lead to timeouts and add uncertainty to results. automated network discovery starting from a small list of pre-configured initial host addresses to find new nodes to synchronise with and appropriately recognising and dealing with hosts becoming unavailable or available again are additional issues

coordinator nodes

special nodes that store metadata about where data is distributed and decide how it will be distributed


identifiers have an origin-node part and nodes try to cluster elements of the same origin. nodes can give a list of what elements from which origins they contain. routing information can include this

benefits of databases

databases store data in a way that supports processes requiring the encoded information. it is possible for example to write data arrays to disk. to find elements with specific properties in that array, the whole array might have to be searched by checking every element for the desired property to collect all matches. in this example, the array is how the data is organised and matching elements is the process that looks at the elements of the data. a database that more efficiently supports the same process could store the elements sorted, to possibly instantly skip a large number of elements without having to check them, or it could maintain another smaller array that saves offsets at which groupings of elements with certain properties begin, like an alphabetical index that associates initial letters with page numbers, again allowing to ignore much of the data and thereby saving processing cost. this is what databases do and their benefit: they optimise the time and space complexity required for reading, writing, updating and deleting data

storing records

see sqlite architecture documentation for a description of how a sql database can be designed

examples of databases

sqlite (sql record storage), cassandra (distributed, used for the biggest interactive websites), neo4j (graph)

tags: overview start q1 document computer database persistence