2025-06-21

databases

general overview and design considerations

purpose

a database stores data in a way that enables efficient access, manipulation, and organization. while data can be stored naively - for example, by writing arrays to disk - retrieving or modifying it then requires scanning or rewriting large parts of the data. databases avoid this inefficiency by structuring and indexing data to reduce processing time and storage overhead for common operations like search, insertion, update, and deletion.

benefits

naive storage requires scanning an entire array to locate elements with a desired property. a database can improve on this by:

storing elements in sorted order to allow binary search
maintaining index structures (e.g. offset tables, hash maps, trees) to skip irrelevant data
supporting selective access patterns rather than full traversal in effect, databases reduce the time and space complexity of data - related operations and allow applications to scale beyond simple, monolithic storage formats.

scaling

databases running on a single host are limited by cpu, memory, storage capacity, and risk of failure. scaling addresses these limits via distribution.

replication

the entire dataset is copied across multiple nodes
enables load balancing for read operations
updates are propagated using a commit log to synchronize changes
nodes apply updates from this log rather than syncing full datasets
commit logs have finite length; nodes that fall too far behind require a full resync
typically, only one node (the primary) handles writes; others (the replicas) follow write scalability is limited because coordination between writers is expensive. read scalability is easier since replicas can serve requests in parallel.

sharding

the dataset is partitioned across nodes
increases total storage capacity
each shard can be replicated for fault tolerance
failure of a single node doesn't cause data loss if replicas exist replicas should be placed across physically distinct infrastructure (e.g. different racks or data centers) to improve resilience.

coordination and discovery

a database must track which node holds which data
coordinator nodes manage routing and metadata
automatic network discovery lets nodes join and synchronize dynamically
systems must handle host availability, rejoining, and network latency gracefully

coordinator nodes

coordinator nodes maintain metadata about:

which node holds which data
how new data is distributed
how to route requests efficiently they do not necessarily store the full dataset, but they are essential for distributed routing and load balancing.

descendants and routing

data identifiers may include an origin-node prefix. nodes often try to group elements by origin, which simplifies routing and allows partial routing tables listing which nodes hold data from which origins.

underlying data structures

databases rely on data structures that optimize for their access patterns.

b-tree: used in most general-purpose databases for balanced performance on read/write workloads
adjacency list: used in graph databases for locality and efficient traversal of connected data

data models

databases organize data into tables with typed columns.

single-value

a table with one column. useful for lists or logs.

double-value

a table with two columns, commonly representing key-value pairs. indexing the second column allows reverse lookup.

multi-value

a table with multiple columns. selective indexing enables fast queries on relevant fields while saving space.

conceptual parallels

record type ≈ table schema
record set ≈ table
record ≈ row
field ≈ column
class ≈ table schema
object ≈ row
property ≈ column

interfaces

databases differ in how they are deployed and accessed.

client/server: standalone server accessed via a protocol
single-file: embedded database stored in a single file (e.g. sqlite)
in-memory: volatile, fast, used for caching or transient data querying options include:
native programming interfaces: calling database functions directly
domain-specific query languages

query languages

structured query languages (e.g. sql) are often used for expressing database operations.

benefits

powerful syntax for subqueries, joins, grouping, etc.
allows declarative composition of complex queries

drawbacks

adds complexity in parsing and planning
constructing queries via strings introduces overhead and potential injection risks for low-level or embedded use, direct function calls or struct-based queries may be more efficient and predictable.

software examples

sqlite: embedded, file-based, sql
cassandra: distributed, high-volume, column-oriented
neo4j: graph database using adjacency lists see: sqlite architecture documentation