a tour of statistical learning using R and scikit-learn

intro | r | scikit learn

nearest neighbors | linear regression | logistic regression | linear discriminant analysis | naive bayes | support vector machine | decision tree | random forest | boosting | neural network

data sets

intro

Statistical learning is a set of techniques for building a model which attempts to predict a response variable from one or more predictor variables.

Labeled data are tuples of predictor values with their associated response values. Typically one divides the labeled data into a training set and a test set. The model is built using the training set and then the test set is used to evaluate the performance of the model.

The test set guards against overfitting, which is when a model performs well on the data used to train it, but poorly on unseen data.

Variables can be quantitative or categorical. When a variable is categorical, it can take one of k of unordered values. We can convert a categorical variable to k-1 dummy variables. At most one of the dummy variables can be set to 1 and all of the categorical variables are 0 when the categorical variable has the baseline value.

When the response variable is quantitative, a model which predicts it is said to be a regression model. When the response variable is categorical, the model is a classification model.

r

How to list the variables in scope and how to inspect their type:

> x = 3
> ls()
[1] "x"

> class(x)
[1] "numeric"
> str(x)
 num 3

How to get documentation for a function:

> ?mean

R has a data type called data frame which equivalent to a relational database table.

The variable iris is always in scope and contains a data frame:

> dim(iris)
[1] 150   5

> names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

The columns are vectors and can be accessed using index notation, dollar sign notation, or the with command. Notice the difference in type of the last expression:

> class(iris[, "Sepal.Length"])
[1] "numeric"

> class(iris$Sepal.Length)
[1] "numeric"

> with(iris, class(Sepal.Length))
[1] "numeric"

> class(iris["Sepal.Length"])
[1] "data.frame"

The attach command makes all of the columns available as variables:

> attach(iris)
> class(Sepal.Length)
[1] "numeric"

How to get a single row of a data frame:

> iris[1, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

> class(iris[1, ])
[1] "data.frame"

The columns of a data frame can have different types:

> class(iris$Species)
[1] "factor"

explain factors

loading data

scikit learn

scikit-learn is a Python library which can be installed with pip:

$ sudo pip install scikit-learn

scikit-learn uses numpy, which is a Python library which provides a homogeneous multidimensional type called ndarray.

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()

>>> type(iris.data)
<class 'numpy.ndarray'>

>>> iris.data.ndim
2
>>> iris.data.shape
(150, 4)

>>> iris.data[0, 0]
5.0999999999999996
>>> iris.data[0, :]
array([ 5.1,  3.5,  1.4,  0.2])
>>> iris.data[0:3, :]
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])

scipy sparse arrays

load_svmlight_file

nearest neighbors

The idea is to find the k nearest neighbors in the labeled data to an unlabeled tuple and use them to make the prediction. If the response is numeric, average the neighbors. If the response is categorical, use majority rule.

> require('class')

> nrow(iris)
[1] 150
> names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
> train = sample(1:nrow(iris), 100)
> iris.predictors = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
> iris.response = iris[, c("Species")]
> train = runif(150) > 0.5
> knn.model = knn(iris.predictors[train, ], iris.predictors[!train, ], iris.response[train], k=3)
> table(knn.model, iris.response[!train])

knn.model    setosa versicolor virginica
  setosa         21          0         0
  versicolor      0         26         0
  virginica       0          4        21
> mean(knn.model == iris.response[!train])
[1] 0.9444444

Nearest neighbors works best when there is lots of input data and four or fewer predictors.

The reason that lots of predictors don't work well is because in high dimensions, the k nearest neighbors are often quite far away.

Here is a graph showing the radius of the p-sphere needed to include a given fraction of the predictors, assuming that they are uniformly distributed in the unit p-cube. The dotted line marks the radius of the p-spheres which include 10% of the predictors.

> x = seq(0, 1, .01)
> plot(x, x, type='l', xlab='fraction volume of unit n-cube', ylab='radius of sphere')
> lines(x, x^.5, col='red')
> lines(x, x^.33333333, col='green')
> lines(x, x^.25, col='orange')
> lines(x, x^.1, col='blue')
> lines(x, x^.2, col='purple')
> lines(rep(.1, 101), seq(0, 1, .01), lty=3, col='gray')
> lines(rep(.1, 101), seq(0, 1, .01), lty=3)
> legend('bottomright', c('p=1', 'p=2', 'p=3', 'p=4', 'p=5', 'p=10'), lty=c(1,1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5,2.5), col=c('black', 'red', 'green', 'orange', 'purple', 'blue'))
curse-of-dimensionality.png

The situation is actually somewhat worse than depicted because in high dimensions the p-sphere must often extend outside of the p-cube.

scikit-learn:

>>> from sklearn.datasets import load_iris
>>> from sklearn.neighbors import KNeighborsClassifier
>>> iris = load_iris()

>>> train = sample(range(0, 150), 75)
>>> test = [i for i in range(0, 150) if i not in train]

>>> clf = KNeighborsClassifier(n_neighbors=5)
>>> clf = clf.fit(iris.data[train], iris.target[train])

>>> confusion_matrix(clf.predict(iris.data[test]), iris.target[test])
array([[22,  0,  0],
       [ 0, 24,  1],
       [ 0,  1, 27]])
>>> sum([1 if x else 0 for x in clf.predict(iris.data[test]) == iris.target[test]]) / len(test)
0.9733333333333334

Nearest neighbors has a parameter k which must be chosen. One way to choose it is to plot the test error for values of k:

> require('class')

> train = sample(1:nrow(iris), 100)
> iris.predictors = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
> iris.response = iris[, c("Species")]
> train = runif(150) > 0.5

> test.err = rep(0, 15)

> for (i in 1:15) {
+   knn.model = knn(iris.predictors[train, ], iris.predictors[!train, ], iris.response[train], k=i)
+   test.err[i] = 1 - mean(knn.model == iris.response[!train])
+ }

> plot(1:15, test.err, xlab='k')
knn-test-error.png

The plot shows 4 ≤ k ≤ 7 works best.

We note that if the test data is used to choose a parameter, then it cannot be used to estimate the accuracy of the final model. Hence, when training a model with a tunable parameter, we may split the data into three sets: training, validation, and test. The validation set is used to choose the parameter.

There a few ways to implement nearest neighbors. The brute force method checks every point in the training set to find the nearest neighbors. One must keep a list of the k nearest neighbors found so far. This is a O(nk) algorithm, or O(n log k) if a priority queue is used for the list of k nearest neighbors.

K-d trees and ball trees are data structures which can be used to index the location of the points in the training set. Whether they improve performance depends on the data; they are generally ineffective in high dimensions.

linear regression

Linear regressions builds a model which is linear in the predictor variables. Usually the data overdetermines the model. The customary way to deal with this is to select the model which minimizes the sum of the squares of the residuals, which are the differences between the predicted and actual response values. This is called a least squares fit.

> require(MASS)
> fit = lm(medv ~ lstat, data=Boston)
> summary(fit)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max
-15.168  -3.990  -1.318   2.034  24.500

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
> plot(Boston$lstat, Boston$medv, xlab='% of pop lower status', ylab='median val of homes in thousands', main='Boston')
> abline(fit, col='red')
boston-lstat-medv.png

In the R formula language, a dot is used to build a multiple linear regression model on all the variables in the data frame:

> fit = lm(medv ~ ., data=Boston)
> summary(fit)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max
-15.595  -2.730  -0.518   1.777  26.199

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 **
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288
chas         2.687e+00  8.616e-01   3.118 0.001925 **
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 **
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7338
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

The adjusted R-squared is 0.73, which seems significantly better than the 0.54 value from simple linear regression on lstat. If we reserve a test set, we can show that this is overfitting and the multiple linear regression model does not perform better than the simple linear model:

> dim(Boston)
[1] 506  14

> set.seed(17)
> train = sample(1:506, 300)
> simple.fit.train = lm(medv ~ lstat, data=Boston[train, ])
> simple.medvhat = predict(simple.fit.train, Boston[-train, ])
> sum(simple.medvhat ^ 2)
[1] 113462.7

> multiple.fit.train = lm(medv ~ ., data=Boston[train, ])
> multiple.medvhat = predict(multiple.fit.train, Boston[-train, ])
> sum(multiple.medvhat ^ 2)
[1] 118683.4
> install.packages("leaps")
> require(leaps)

> fwd.fit = regsubsets(medv ~ ., data=Boston, nvmax=13, method="forward")
> summary(fwd.fit)
Subset selection object
Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 13, method = "forward")
13 Variables  (and intercept)
        Forced in Forced out
crim        FALSE      FALSE
zn          FALSE      FALSE
indus       FALSE      FALSE
chas        FALSE      FALSE
nox         FALSE      FALSE
rm          FALSE      FALSE
age         FALSE      FALSE
dis         FALSE      FALSE
rad         FALSE      FALSE
tax         FALSE      FALSE
ptratio     FALSE      FALSE
black       FALSE      FALSE
lstat       FALSE      FALSE
1 subsets of each size up to 13
Selection Algorithm: forward
          crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"
2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"
3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"
4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"
5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"
6  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"
7  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"
8  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"
9  ( 1 )  "*"  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"
10  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"
11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"
12  ( 1 ) "*"  "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"
13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

> plot(fwd.fit, scale="Cp")

r formula language

stepwise selection

polynomial model

hierarchy principle

lasso

logistic regression

linear discrimant analysis

naive bayes

Naive Bayes is a classification technique based on Bayes' Rule and an independence assumption about the features:

(1)
$$ \begin{align} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;P(C\,|\,X_1, ..., X_n) \;=\; \frac{P(C)\, P(X_1, \ldots, X_n\,|\,C)\;\;\;\;\;}{P(X_1, \ldots, X_n)} \;=\; \frac{P(C)\, P(X_1\,|\,C) \cdots P(X_n\,|\,C)\;\;\;\;\;}{P(X_1) \cdots P(X_n)} \end{align} $$

Naive Bayes is a popular text in text classification, in which case C means that a document is in class C and Xᵢ means that a document has feature i.

Given a corpus of training data, we count the total number of documents, the total number of documents in class C, the total number of documents with feature i, and the total number of documents in class C with feature i. This allows us to estimate the probabilities on the right hand side of the equation above.

support vector machine

find a separating hyperplane to build a classifier

maximal margin classifier

tuning parameter C: sum of distance of points on wrong side of margin

tuning parameter γ: for Guassian kernel

comparison w/ logistic regression: works better than LR when data is separable, but does not give a probability

SVM does not prune the feature space, LR with lasso does

> install.packages('e1071')
> require(e1071)
> X = iris[c("Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width")]
> y = ifelse(iris$Species == 'setosa', 1, -1)
> fit = svm(X, y, kernel="linear", cost=1, scale=F)
> sum(ifelse(predict(fit, X) > 0, 1, -1) == y)
[1] 150
> length(y)
[1] 150
> fit2 = svm(X, y, kernel="radial", cost=1, gamma=0.25, scale=F)
> sum(ifelse(predict(fit2, X) > 0, 1, -1) == y)
[1] 150

decision tree

A decision tree is a binary tree which is used like a flow chart to come to a decision. At each node, one of the predictor variables is examined, and one proceeds to the left or right child based on the outcome.

Finding an optimal decision tree is NP-complete. Hence a greedy algorithm is usually used.

> install.packages('tree')
> require('tree')

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

> tree.iris = tree(Species~., data=iris)

> summary(tree.iris)
Classification tree:
tree(formula = Species ~ ., data = iris)
Variables actually used in tree construction:
[1] "Petal.Length" "Petal.Width"  "Sepal.Length"
Number of terminal nodes:  6
Residual mean deviance:  0.1253 = 18.05 / 144
Misclassification error rate: 0.02667 = 4 / 150

> predict(tree.iris, list(Sepal.Length=5.8, Sepal.Width=3, Petal.Length=4.35, Petal.Width=1.3))
  setosa versicolor virginica
1      0          1         0
>>> from sklearn import tree
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
>>> from sklearn.externals.six import StringIO
>>> with open("/tmp/iris.dot", 'w') as f:
...     f = tree.export_graphviz(clf, out_file=f)
$ dot -Tpng /tmp/iris.dot > /tmp/iris.png
iris-decision-tree.png

pruning

random forest

Decision trees tend to overfit.

Random forests use two techniques to overcome the problem. First, they use bagging, in which samples with replacement are taken repeatedly from the training data, and for each sample a decision tree is built. The final classifier takes the average of all the trees. If the response variable is categorical, the majority rule can be used.

Second, at each node of the tree, a random subset of the features is chosen as candidates for the predictor variable to be examined at that node. This is done to reduce the correlation between the trees.

> install.packages('randomForest')
> require('randomForest')
> require('MASS') # Boston

> train = sample(1:nrow(Boston), 300)
> rf.Boston = randomForest(medv ~ ., data=Boston, subset=train)
> rf.Boston

Call:
 randomForest(formula = medv ~ ., data = Boston, subset = train)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 12.6554
                    % Var explained: 84.2

The Random Forest technique has potentially quite a few parameters, but the only one that requires tuning in practice is the number of features randomly selected at each node. If there are n features, then sqrt(n) is often a good value.

The number of trees usually doesn't have to be tuned because increasing the number of trees never results in overfitting.

Finding a good value for the number of features to randomly select:

> test.err = double(13)
> for (mtry in 1:13) {
+   fit = randomForest(medv ~ ., data=Boston, subtrain=train, mtry=mtry)
+   pred = predict(fit, Boston[-train, ])
+   test.err[mtry] = with(Boston[-train, ], mean(medv - pred)^2)
+ }
> plot(1:mtry, test.err, type='b', main='Random Forest Parameter Tuning', xlab='mtry')
random-forest-param-tuning.png
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(n_estimators=400)
>>> clf = clf.fit(boston.data, boston.target)
>>> clf.predict(boston.data[0,:])
array([ 24.])

boosting

neural network

xor, parity

data sets

http://yann.lecun.com/exdb/mnist/
http://archive.ics.uci.edu/ml/datasets.html

cross validation

The test error can be used to select the best of several models, or to select a value for model with a tune-able parameter.

One technique is to split the data into two parts: training set and a validation set. Using 70% for training and 30% for validation is a popular choice. For each model, train on the training set and use the validation set to compute the test error.

Cross validation can be used to reduce the variance of the test error at the expense of training more models. The technique is to split the data into k equally sized parts. Train each model k on k - 1 of the parts, holding a different part out for computing the test error each time. The final mean-squared error is the average of the k values. k is often 5 or 10.

more topics

  • measurement: accuracy, recall, precision, confusion matrix, auc, roc curve, ...
  • cross validation
  • ensemble methods