The entropy *H(X)* of a discrete random variable *X* with probability mass function *p(x)* is:

Because *p(x)* is in [0, 1], *log p(x)* is non-positive and entropy is non-negative.

If the logarithm is base 2, the units are in *bits*, if the logarithm is base *e*, the units are in *nats*.

Thus, a Bernoulli variable with parameter *p = 1/2* has one bit of entropy.

The joint entropy *H(X, Y)* of discrete random variables *X, Y* with join distribution *p(x, y)* is:

If *X* and *Y* have joint distribution *p(x, y)*, then the conditional probability *H(Y|X)* is:

The following chain rule relates the joint and conditional entropy:

(4)The relative entropy (Kullback Leibler distance) of discrete random variables X and Y with distributions p(x) and q(x):

(5)*t-SNE*

The mutual information of discrete random variables X and Y with joint distribution p(x, y) and marginal distributions p(x) and p(y):

(6)*decision trees*

*more chain rules*

A function *f(x)* is convex on an interval *(a, b)* if for every *x _{1}* and

*x,,2*in the interval

and

*λ ∈ (0,1)*:

A function *f* is concave if *–f* is convex. The logarithm function is concave.

**Jensen's inequality:** If *f* is a convex function and *X* is a random variable, then: