2020-04-09

about databases

general information and notes

benefits of databases

databases store data in a way that supports processes requiring the encoded information. it is possible for example to write data arrays to disk. to find elements with specific properties in that array, the whole array could be searched by checking every element for the desired property to collect all matches. in this example, the array is how the data is organised and matching elements is a process that looks at the elements of the data. a database that more efficiently supports the same process could store the elements sorted, to possibly skip elements without having to check them, or it could maintain another smaller array that saves offsets at which groupings of elements with certain properties begin, like an alphabetical index that associates initial letters with page numbers, again allowing to ignore much of the data and thereby saving processing cost. this is what databases do and their benefit is that they optimise the time and space complexity required for reading, writing, updating and deleting data

scaling

a single database host computer has limits in processing power, available storage space and might fail with potential data loss. to overcome the limitations of single hosts, the database can be distributed across multiple hosts. this is part of what is often called scaling (of capabilities to needs). for increasing the total available processing power, the full data set can be copied to all hosts (replication) and database requests be divided between hosts (load balancing). to keep changing data synchronised on multiple hosts, usually a kind of log is used that records changes made to a database (commit log) and other hosts receive and apply changes from this log. without a commit log, the whole data set would always need to copied and rewritten or the differences between the actual data sets would have to be calculated by comparing all data elements, which is both much more costly than keeping a log. the log typically stores a limited number of past changes to save space, which means that hosts that have not applied changes outside the range of available log entries will have to resort to a full copy or difference calculation, of which the copy is probably faster. the number of hosts that can make changes to the data is typically limited (primary/replica) to avoid the processing and data transfer overhead of every host connecting to every other to look for and to apply changes and keeping a log. this implies that it is more difficult to scale for write operations than for read operations. for increasing the total available storage space, the data set can be split and distributed to hosts. if there is only one copy of each part and one hosts fails, this data would be unavailable or lost. the common solution to this is to distribute copies (redundancy) of parts (shards) of the data among hosts in a way that if one host with one set of parts fails there are still other hosts that collectively have copies of all these parts and can continue to respond to requests for it. having copies of the data increases the total space requirement. for reliability and availability, the data would need to be distributed in a way that makes it unlikely for all copies of a part to become unavailable at the same time. for this, copies can be distributed on hosts physically further away to increase resilience against failures or downtimes of racks or whole datacenters (fires, natural disasters, etc). to manage data across multiple hosts, a database needs ways to find which host has data relevant to the request. latency can lead to timeouts and add uncertainty to results. automated network discovery starting from a small list of pre-configured initial host addresses to find new nodes to synchronise with and appropriately recognising and dealing with hosts becoming unavailable or available again are additional issues

coordinator nodes

special nodes that store metadata about where data is distributed and decide how it will be distributed

descendants

identifiers can have an origin-node part and nodes try to cluster elements of the same origin. nodes can give a list of what elements from which origins they contain. routing information can include this

underlying data structures

  • b-tree: used in most databases for its uniform performance characteristics
  • adjacency list: often used in graph databases for its element locality

software examples

  • sqlite (sql record storage), cassandra (distributed, used for some of the biggest interactive websites), neo4j (graph)
  • sqlite architecture documentation in-depth design information for sqlite

models

single value

table with only one column

double value

table with two columns corresponds to a simple key-value association or dictionary the first column being the primary key, the second column can be indexed for reverse dictionary operation

multi value

table with many columns and selective indices indexing only the columns needed for queries saves space

similar

  • record type / table schema, record set / table, record / row, field / column
  • class / table schema, object / row, property / column

database interfaces

  • client/server, single-file, in-memory
  • native programming language data structures for queries

domain specific language for queries

  • for example sql
  • downsides: parsing and optimisation of the query language increases complexity and processing cost. construction of query language strings required and this has costs
  • benefits: can simplify function call semantics for composition with sub-selects, joins, groups and other higher level programming operations on the data