Information Theory

The entropy H(X) of a discrete random variable X with probability mass function p(x) is:

(1)
\begin{align} H(X) = - \sum_{x \in \mathcal{X}} p(x) \; \log p(x) \end{align}

Because p(x) is in [0, 1], log p(x) is non-positive and entropy is non-negative.

If the logarithm is base 2, the units are in bits, if the logarithm is base e, the units are in nats.

Thus, a Bernoulli variable with parameter p = 1/2 has one bit of entropy.

The joint entropy H(X, Y) of discrete random variables X, Y with join distribution p(x, y) is:

(2)
\begin{align} H(X, Y) = - \sum_{x \in \mathcal{X}} \sum_{y in \mathcal{Y}} p(x, y) \; \log p(x, y) \end{align}

If X and Y have joint distribution p(x, y), then the conditional probability H(Y|X) is:

(3)
\begin{align} H(Y\,|\,X) = \sum_{x \in \mathcal{X}} p(x)\; H(Y\,|\,X=x) \end{align}

The following chain rule relates the joint and conditional entropy:

(4)
\begin{align} H(X, Y) = H(X) + H(Y\,|\,X) \end{align}

The relative entropy (Kullback Leibler distance) of discrete random variables X and Y with distributions p(x) and q(x):

(5)
\begin{align} D_{KL}(X \parallel Y) = \sum_{x \in \mathcal{X}} p(x) \; \log \frac{p(x)}{q(x)} \end{align}

t-SNE

The mutual information of discrete random variables X and Y with joint distribution p(x, y) and marginal distributions p(x) and p(y):

(6)
\begin{align} I(X; Y) = \sum_{y \in \mathcal{Y}}\sum_{x \in \mathcal{X}} p(x, y) \; \log \frac{p(x, y)}{p(x)\, p(y)} \end{align}

decision trees

more chain rules

A function f(x) is convex on an interval (a, b) if for every x1 and x,,2 in the interval
and λ ∈ (0,1):

(7)
\begin{align} f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2) \end{align}

A function f is concave if –f is convex. The logarithm function is concave.

Jensen's inequality: If f is a convex function and X is a random variable, then:

(8)
\begin{align} E(f(X)) \geq f(E(X) \end{align}
page revision: 29, last edited: 17 Jul 2016 14:31