# databases general overview and design considerations # purpose a database stores data in a way that enables efficient access, manipulation, and organization. while data can be stored naively - for example, by writing arrays to disk - retrieving or modifying it then requires scanning or rewriting large parts of the data. databases avoid this inefficiency by structuring and indexing data to reduce processing time and storage overhead for common operations like search, insertion, update, and deletion. # benefits naive storage requires scanning an entire array to locate elements with a desired property. a database can improve on this by: * storing elements in sorted order to allow binary search * maintaining index structures (e.g. offset tables, hash maps, trees) to skip irrelevant data * supporting selective access patterns rather than full traversal in effect, databases reduce the time and space complexity of data - related operations and allow applications to scale beyond simple, monolithic storage formats. # scaling databases running on a single host are limited by cpu, memory, storage capacity, and risk of failure. scaling addresses these limits via distribution. ## replication * the entire dataset is copied across multiple nodes * enables load balancing for read operations * updates are propagated using a commit log to synchronize changes * nodes apply updates from this log rather than syncing full datasets * commit logs have finite length; nodes that fall too far behind require a full resync * typically, only one node (the primary) handles writes; others (the replicas) follow write scalability is limited because coordination between writers is expensive. read scalability is easier since replicas can serve requests in parallel. ## sharding * the dataset is partitioned across nodes * increases total storage capacity * each shard can be replicated for fault tolerance * failure of a single node doesn't cause data loss if replicas exist replicas should be placed across physically distinct infrastructure (e.g. different racks or data centers) to improve resilience. ## coordination and discovery * a database must track which node holds which data * coordinator nodes manage routing and metadata * automatic network discovery lets nodes join and synchronize dynamically * systems must handle host availability, rejoining, and network latency gracefully # coordinator nodes coordinator nodes maintain metadata about: * which node holds which data * how new data is distributed * how to route requests efficiently they do not necessarily store the full dataset, but they are essential for distributed routing and load balancing. # descendants and routing data identifiers may include an origin-node prefix. nodes often try to group elements by origin, which simplifies routing and allows partial routing tables listing which nodes hold data from which origins. # underlying data structures databases rely on data structures that optimize for their access patterns. * b-tree: used in most general-purpose databases for balanced performance on read/write workloads * adjacency list: used in graph databases for locality and efficient traversal of connected data # data models databases organize data into tables with typed columns. ## single-value a table with one column. useful for lists or logs. ## double-value a table with two columns, commonly representing key-value pairs. indexing the second column allows reverse lookup. ## multi-value a table with multiple columns. selective indexing enables fast queries on relevant fields while saving space. ## conceptual parallels * record type ≈ table schema * record set ≈ table * record ≈ row * field ≈ column * class ≈ table schema * object ≈ row * property ≈ column # interfaces databases differ in how they are deployed and accessed. * client/server: standalone server accessed via a protocol * single-file: embedded database stored in a single file (e.g. sqlite) * in-memory: volatile, fast, used for caching or transient data querying options include: * native programming interfaces: calling database functions directly * domain-specific query languages # query languages structured query languages (e.g. sql) are often used for expressing database operations. ## benefits * powerful syntax for subqueries, joins, grouping, etc. * allows declarative composition of complex queries ## drawbacks * adds complexity in parsing and planning * constructing queries via strings introduces overhead and potential injection risks for low-level or embedded use, direct function calls or struct-based queries may be more efficient and predictable. # software examples * sqlite: embedded, file-based, sql * cassandra: distributed, high-volume, column-oriented * neo4j: graph database using adjacency lists see: [sqlite architecture documentation](https://www.sqlite.org/arch.html)