Hash

Index
Introduction
Dictionary Problem
Methods to Achieve
Table Doubling
String Matching
Open Addressing

Introduction

A hash function maps a set of keys to a set of organized values (hashes). For example, the set A = {w | w is any English word that starts with the letter “a”} can be mapped by a hash function h(w) to B = {k | k ∈ Z and 1<= k <= 26} by finding the corresponding value of the last letter of any w ∈ A, i.e., h(apple) = 5 and h(append) = 4.

Hash Values Content
1 astasia, Asia, …
2 absorb, adverb, …
26 abuzz, …

Two keys can have the same hash value, and this condition is called collision. Colliding elements can be stored in a linked list such that each elements can be accessed individually; this method is called chaining.

Dictionary Problem

We wish to construct an abstract data type (ADT) that maintains a set of items, each with a key, subject to insert, delete, and search operations. The goal of hashing is to construct such ADT (dictionary) that cost O(1) time per operation. In order to do so, we expect that m = the number of hashes is in the order of n = the number of keys, and the hashing should be uniform. We define the load factor α = n/m. A uniform hashing results in Θ(1+α) expected running time for search operation. The term can be simplified to O(1) if α = O(1), i.e., m = Ω(n).

Methods to Achieve

We introduce 3 methods to achieve this performance. The third method works well.

1. division method:  h(k) = k mod m

This method is practical when m is prime but not close to power of 2 or 10.

2. multiplication method: h(k) = [(a*k) mod 2w] >> (w-r), where a is random, k has w bits, and m = 2r.

This method is practical when a is odd, 2w-1 < a < 2w, and a is not too close to either side.

3. universal hashing: h(k) = [(a*k+b) mod p] mod m, where a and b are random integers ∈ {0, 1, …, p-1} and p is a large prime number > |U|. U is the universe of keys.

For worst-case keys k1 ≠ k2, probability of colliding P(h(k1) = h(k2)) = 1/m. This indeed indicates constant time searching. 

Table Doubling

We want m = Θ(n) at all times, where n can change due to insert and delete operations; hence changing the hash table size and rehashing are required. In short, when n reaches m, we double the table size; when n reaches m/4, we halve the table size. Through analysis, this results in O(1) amortized cost for both insert and delete. Overall, we achieve O(1) cost for insert, delete, and search.

String Matching

Given two strings s and t, we wish to find all occurrences where s is a substring of t. The simple algorithm is to go through every substring of t with length |s| and compare it with s. This algorithm takes O(|s|*|t|) time which can potentially be quadratic.

The Karp-Rabin algorithm uses the simple division method for hashing. Start with the first substring of t with length |s|, for this and every subsequent substring x of t (by adding the successive letter in t and deleting the first letter) where |x| = |s|, find their hash values and compare them with the hash value of s. If h(x) matches h(s), then perform a letter-by-letter comparison to x and s. Doing so would reduce the time cost to O(|t|). Note that the m in the division method is chosen to be a random prime number >|s|. The string s and substrings x’s of t are treated as multi-digit number in base a, where a depends on character encoding method. For example, a = 256 for ASCII. In base a, suppose u(x) is the number associated with string x, u(x appended by a letter c) = u(x)*8 + u(c), and u(x deleting its first letter c) = u(x) – u(c)*a|c|.

Open Addressing

When we want at most 1 item per hash table slot (this requires m >= n, of course), we can use open addressing instead of chaining to deal with collisions. We introduce probing such that the hash function takes both key and trial count as input to