Modular Hash Function

Sooner or later a programmer will need to write a function for hashing a string:


#include <stdint.h>
#include <string.h>

int64_t INITIAL_VALUE = 43;

int64_t
hash64(char *s, int64_t a, int64_t p) {
  size_t i;
  int64_t hash = INITIAL_VALUE;

  for (i = 0; i < strlen(s); ++i) {
    hash = (a * hash + s[i]) % p;
  }

  return hash;
}

The technique illustrated above is a modular hash function.

If we have a byte string, where ci is the value of the i-th byte, then we can map each string to an integer with this function:

(1)
$$ \begin{align} \sum_{i=0}^{k-1} c_i \;256^i \end{align} $$

The function is injective if no strings have leading null bytes.

The multiplicative hash function technique uses Horner's method to calculate the polynomial efficiently, and takes the modulus of the number to reduce the final hash code to the desired size.

However, notice that the C code does not use the number of possible values for a byte as the value to insert in the place of the polynomial indeterminate. Instead it allows the caller to set this value.

What are good choices of a and p.

Knuth's constraints

If we hash M values using a hash function with N distinct values, what is the chance of a collision? If each of the hash values is equally likely, then it is

(2)
$$ \begin{align} 1 - \frac{{N \choose M}}{\frac{N^M}{M!}} = 1 - \frac{N \cdot (N-1) \cdot (N-2) \cdots (N - M + 1)}{N^M} \approx 1 - \bigg(1 \cdot \big(e^{\frac{-1}{N}}\big) \cdot \big(e^{\frac{-2}{N}}\big) \cdots \big(e^{\frac{-(M-1)}{N}}\big)\bigg) = 1 - e^{-\frac{(M-1)(M)}{2N}} \end{align} $$

where the approximation uses a first order Taylor series expansion of ex. This calculation is also known as the birthday problem, since it can be used to determine the chance that a set of randomly chosen people share the same birthday.

The Chi-squared test can be used to check whether a hash function is fair. Suppose that M values are hashed into N buckets, and let Xi be the number of values that hash to the i-th bucket. Compute this statistic:

(3)
$$ \begin{align} \sum_{i=1}^N \frac{N}{M} \bigg(X_i - \frac{M}{N}\bigg)^2 \end{align} $$

The statistic has a Chi-squared distribution with N – 1 degrees of freedom.

Using R to get the p-value:


hash.test = function(bins) {
  m = sum(bins)
  n = length(bins)
  statistic = 0
  for (bin in bins) {
    statistic = statistic + (bin - m/n)^2 / (m/n)
  }

  1 - pchisq(statistic, n - 1)
}

> hash.test(c(2,6,2,4,2,3,1))
[1] 0.434485

Fast Hash Functions

multiplicative hash function

Using a Mersenne prime to avoid the modulo function.

Hash Tables

chaining vs linear probing

Randomized Hash Function

Generate and append a random string to the keys. Is this the same as choosing a random INITIAL_VALUE?

Diffusion property: h(m) gives no information about h(n) when m ≠ n.

Cryptographic Hash Functions


$ echo foo | cksum
3915528286 4

$ echo foo | openssl md5
d3b07384d113edec49eaa6238ad5ff00

$ echo foo | openssl sha1
f1d2d2f924e986ac86fdf7b36c94bcdf32beec15

$ echo foo | openssl dgst -sha256
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c

cksum is a 32-bit cyclic redundancy check.

md5 is 4 times faster than sha1.

Google researchers find a SHA1 collision..

Families of Hash Functions

family of hash functions

aka universal hashing

testing the independence of two hash functions

Bloom Filters

Bloom filters are an example of why it is useful to have a family of hash functions.

A Bloom filter is a bit array and a set of n hash functions. Each time a value is stored in the Bloom filter, we compute the n hash codes and set the corresponding bits to 1. To check whether a value is stored in a Bloom filter, we check the same n bits. If any of the bits are zero, we know for certain that the value is not stored in the Bloom filter.

Chance that an item is really in the Bloom filter if all bits are set...

Minhashing

Suppose that we have a set P of n elements. If we enumerate the elements of P, then the subsets of P can be represented by vectors of length n, where the i-th component of the vector is 1 if the element is in the subset and 0 otherwise.

For each permutation of P, there is a hash function of subsets of P which is the index of the first element in the subset according to the permutation.

The Jaccard similarity of two sets is

(4)
$$ \begin{align} \frac{|\,A \cap B\,|}{|\,A \cup B\,|} \end{align} $$

The chance that two sets have the same minhash is equal to their Jaccard similarity.

Here is some code illustrating how to perform a Fisher-Yates shuffle, which is an efficient way to generate a random permutation:

#!/usr/bin/env python3
import random
import sys

if len(sys.argv) != 2:
    raise Exception('USAGE: {} N')

n = int(sys.argv[1])
nums = list(range(n))
output = []
while nums:
    j = random.sample(nums, 1)[0]
    output.append(j)
    nums.remove(j)

print(output)

Variance on estimates of Jaccard similarity.

Locality Sensitive Hashing

If documents are represented by sets of features (so that Jaccard similiarty is well-defined), then finding all similar documents can be accomplished by a brute force O(m2) search. Locality sensitive hashing is a faster way to find the similar documents.

Given m documents and n independent hash functions, we choose r × b = n. That is, we group the hash functions into b bands of size r.

For each band, we compute the hash functions in the band to get a signature. All documents with the same signature in a band are hashed together and are candidate pairs.

analysis of the technique; plot chance of being a candidate pair vs Jaccardsim

LSH Amplification

Suppose that we have a distance metric d and we want to find all "candidate pairs", which are pairs of items within a certain distance threshold.

Suppose also that we have a family of independent hash functions which are related to the distance metric so as to give rise the following definition:

A (//d,,1,,, d,,2,,, p, q//) family is a family of hash functions such that 0 ≤ //d,,1,,// ≤ //d,,2,,// ≤ 1 and //p// > q and if d(x, y) ≤ d1 then the probability that (x, y) is a candidate pair is at least p and if d(x, y) ≥ d2 then the probability that (x, y) is a candidate pair is at most q.

A (d1, d2, p, q) family gives rise via the AND construction on r hash functions to a (d1, d2, pr, qr) family.

A (d1, d2, p, q) family gives rise via the OR construction on b hash functions to a (d1, d2, 1-(1-p)b, 1 - (1 - q)b) family.

Flajolet-Martin Algorithm

Another application of a family of hash functions is estimating the number of distinct items in a stream without keeping a complete list of the items in memory.

The idea is that for each hash function, we hash the values as they arrive, and we keep track of the hash value with the largest number of rightmost zero bits. If R is the number of rightmost zero bits for that value, then 2R is our estimate for the number of distinct values seen so far in the stream.

The hash values must be sufficiently large. If a value hashed to all zeros, that would possibly represent overflow. If N distinct values are expected, then log2(N) bits are probably sufficient.

If we use a single hash function, then our estimate will be a power of two. We can get a more accurate estimate by using multiple hash functions. When combining them, we cannot use the mean, since it will be biased by large values. The median, meanwhile, will always be a power of two. Hence we divide the hash functions into groups of size k, take the mean of each of those, and then the median of the groups.

Hash Trick

as used by Vowpal Wabbit