2023-02-27

about hash tables

a hash table uses data structure and a function to translate values to addresses for locations in the data structure. its special property is that it can guess memory locations to some degree, and because of that can archieve linear access time on average. the input values that are used to identify locations are usually called "keys"

hash function

considerations

range of possible key values
distribution to destination locations
calculation effort
fixed length or variable length key values

examples

identity function

simple array indexing, pointer arithmetic. this one only works if no key is equal or greater than the size of the array

array-start-address + integer-key

modulo

target-index = integer-key % last-array-index

the hash table size should be a prime or at least odd. modulo uses division, and primes have the lowest number of divisors; the number of possible collisions is reduced. even size-values would lead to even hash-values for even keys and odd hash-values for odd keys

multiplication

m = some-value-perhaps-arbitrary-chosen
A = (sqrt(5)-1)/2 = 0.6180339887
floor(m * (k * A - floor(k * A)))

so called perfect hash functions create an unique hash for every key. for the creation of such a function all possible key values have to be known and somewhat considered beforehand

collisions

because the set of possible keys is usually larger than the set of possible hashes it is possible, and likely, that the hash function creates the same hash for different keys

some ways to resolve collisions

chaining

keys map to an array of addresses to linked lists. the linked lists are used to contain possibly several elements for array locations

linear probing

when a value is about to be inserted and a collision is found, the next free location is used. every access of the value then requires a similar search

quadratic probing

like linear probing, but the next free location is found by, if necessary repeatedly, multiplying the index with itself, wrapping the array size. depending on the key values, for example if they are nearly linear or more random, this distribution can reduce clustering of keys and can make collisions less likely. linear probing might have performance advantages through memory locality

container

often a combination of arrays and linked lists

size

big enough to store the desired number of elements, and usually bigger to reduce the likelihood of collisions