a tour of statistical learning using R and scikit-learn

data sets

# intro

Statistical learning is a set of techniques for building a model which attempts to predict a response variable from one or more predictor variables.

Labeled data are tuples of predictor values with their associated response values. Typically one divides the labeled data into a training set and a test set. The model is built using the training set and then the test set is used to evaluate the performance of the model.

The test set guards against overfitting, which is when a model performs well on the data used to train it, but poorly on unseen data.

Variables can be quantitative or categorical. When a variable is categorical, it can take one of k of unordered values. We can convert a categorical variable to k-1 dummy variables. At most one of the dummy variables can be set to 1 and all of the categorical variables are 0 when the categorical variable has the baseline value.

When the response variable is quantitative, a model which predicts it is said to be a regression model. When the response variable is categorical, the model is a classification model.

# r

How to list the variables in scope and how to inspect their type:

> x = 3
> ls()
 "x"

> class(x)
 "numeric"
> str(x)
num 3


How to get documentation for a function:

> ?mean


R has a data type called data frame which equivalent to a relational database table.

The variable iris is always in scope and contains a data frame:

> dim(iris)
 150   5

> names(iris)
 "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"


The columns are vectors and can be accessed using index notation, dollar sign notation, or the with command. Notice the difference in type of the last expression:

> class(iris[, "Sepal.Length"])
 "numeric"

> class(iris$Sepal.Length)  "numeric" > with(iris, class(Sepal.Length))  "numeric" > class(iris["Sepal.Length"])  "data.frame"  The attach command makes all of the columns available as variables: > attach(iris) > class(Sepal.Length)  "numeric"  How to get a single row of a data frame: > iris[1, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa > class(iris[1, ])  "data.frame"  The columns of a data frame can have different types: > class(iris$Species)
 "factor"


explain factors

# scikit learn

scikit-learn is a Python library which can be installed with pip:

$sudo pip install scikit-learn  scikit-learn uses numpy, which is a Python library which provides a homogeneous multidimensional type called ndarray. >>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> type(iris.data) <class 'numpy.ndarray'> >>> iris.data.ndim 2 >>> iris.data.shape (150, 4) >>> iris.data[0, 0] 5.0999999999999996 >>> iris.data[0, :] array([ 5.1, 3.5, 1.4, 0.2]) >>> iris.data[0:3, :] array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2]])  scipy sparse arrays load_svmlight_file # nearest neighbors The idea is to find the k nearest neighbors in the labeled data to an unlabeled tuple and use them to make the prediction. If the response is numeric, average the neighbors. If the response is categorical, use majority rule. > require('class') > nrow(iris)  150 > names(iris)  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > train = sample(1:nrow(iris), 100) > iris.predictors = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] > iris.response = iris[, c("Species")] > train = runif(150) > 0.5 > knn.model = knn(iris.predictors[train, ], iris.predictors[!train, ], iris.response[train], k=3) > table(knn.model, iris.response[!train]) knn.model setosa versicolor virginica setosa 21 0 0 versicolor 0 26 0 virginica 0 4 21 > mean(knn.model == iris.response[!train])  0.9444444  Nearest neighbors works best when there is lots of input data and four or fewer predictors. The reason that lots of predictors don't work well is because in high dimensions, the k nearest neighbors are often quite far away. Here is a graph showing the radius of the p-sphere needed to include a given fraction of the predictors, assuming that they are uniformly distributed in the unit p-cube. The dotted line marks the radius of the p-spheres which include 10% of the predictors. > x = seq(0, 1, .01) > plot(x, x, type='l', xlab='fraction volume of unit n-cube', ylab='radius of sphere') > lines(x, x^.5, col='red') > lines(x, x^.33333333, col='green') > lines(x, x^.25, col='orange') > lines(x, x^.1, col='blue') > lines(x, x^.2, col='purple') > lines(rep(.1, 101), seq(0, 1, .01), lty=3, col='gray') > lines(rep(.1, 101), seq(0, 1, .01), lty=3) > legend('bottomright', c('p=1', 'p=2', 'p=3', 'p=4', 'p=5', 'p=10'), lty=c(1,1,1,1,1,1), lwd=c(2.5,2.5,2.5,2.5,2.5,2.5), col=c('black', 'red', 'green', 'orange', 'purple', 'blue')) The situation is actually somewhat worse than depicted because in high dimensions the p-sphere must often extend outside of the p-cube. scikit-learn: >>> from sklearn.datasets import load_iris >>> from sklearn.neighbors import KNeighborsClassifier >>> iris = load_iris() >>> train = sample(range(0, 150), 75) >>> test = [i for i in range(0, 150) if i not in train] >>> clf = KNeighborsClassifier(n_neighbors=5) >>> clf = clf.fit(iris.data[train], iris.target[train]) >>> confusion_matrix(clf.predict(iris.data[test]), iris.target[test]) array([[22, 0, 0], [ 0, 24, 1], [ 0, 1, 27]]) >>> sum([1 if x else 0 for x in clf.predict(iris.data[test]) == iris.target[test]]) / len(test) 0.9733333333333334  Nearest neighbors has a parameter k which must be chosen. One way to choose it is to plot the test error for values of k: > require('class') > train = sample(1:nrow(iris), 100) > iris.predictors = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] > iris.response = iris[, c("Species")] > train = runif(150) > 0.5 > test.err = rep(0, 15) > for (i in 1:15) { + knn.model = knn(iris.predictors[train, ], iris.predictors[!train, ], iris.response[train], k=i) + test.err[i] = 1 - mean(knn.model == iris.response[!train]) + } > plot(1:15, test.err, xlab='k') The plot shows 4 ≤ k ≤ 7 works best. We note that if the test data is used to choose a parameter, then it cannot be used to estimate the accuracy of the final model. Hence, when training a model with a tunable parameter, we may split the data into three sets: training, validation, and test. The validation set is used to choose the parameter. There a few ways to implement nearest neighbors. The brute force method checks every point in the training set to find the nearest neighbors. One must keep a list of the k nearest neighbors found so far. This is a O(nk) algorithm, or O(n log k) if a priority queue is used for the list of k nearest neighbors. K-d trees and ball trees are data structures which can be used to index the location of the points in the training set. Whether they improve performance depends on the data; they are generally ineffective in high dimensions. # linear regression Linear regressions builds a model which is linear in the predictor variables. Usually the data overdetermines the model. The customary way to deal with this is to select the model which minimizes the sum of the squares of the residuals, which are the differences between the predicted and actual response values. This is called a least squares fit. > require(MASS) > fit = lm(medv ~ lstat, data=Boston) > summary(fit) Call: lm(formula = medv ~ lstat, data = Boston) Residuals: Min 1Q Median 3Q Max -15.168 -3.990 -1.318 2.034 24.500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.55384 0.56263 61.41 <2e-16 *** lstat -0.95005 0.03873 -24.53 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.216 on 504 degrees of freedom Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16  > plot(Boston$lstat, Bostonmedv, xlab='% of pop lower status', ylab='median val of homes in thousands', main='Boston') > abline(fit, col='red') In the R formula language, a dot is used to build a multiple linear regression model on all the variables in the data frame: > fit = lm(medv ~ ., data=Boston) > summary(fit) Call: lm(formula = medv ~ ., data = Boston) Residuals: Min 1Q Median 3Q Max -15.595 -2.730 -0.518 1.777 26.199 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 *** crim -1.080e-01 3.286e-02 -3.287 0.001087 ** zn 4.642e-02 1.373e-02 3.382 0.000778 *** indus 2.056e-02 6.150e-02 0.334 0.738288 chas 2.687e+00 8.616e-01 3.118 0.001925 ** nox -1.777e+01 3.820e+00 -4.651 4.25e-06 *** rm 3.810e+00 4.179e-01 9.116 < 2e-16 *** age 6.922e-04 1.321e-02 0.052 0.958229 dis -1.476e+00 1.995e-01 -7.398 6.01e-13 *** rad 3.060e-01 6.635e-02 4.613 5.07e-06 *** tax -1.233e-02 3.760e-03 -3.280 0.001112 ** ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 *** black 9.312e-03 2.686e-03 3.467 0.000573 *** lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.745 on 492 degrees of freedom Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16  The adjusted R-squared is 0.73, which seems significantly better than the 0.54 value from simple linear regression on lstat. If we reserve a test set, we can show that this is overfitting and the multiple linear regression model does not perform better than the simple linear model: > dim(Boston)  506 14 > set.seed(17) > train = sample(1:506, 300) > simple.fit.train = lm(medv ~ lstat, data=Boston[train, ]) > simple.medvhat = predict(simple.fit.train, Boston[-train, ]) > sum(simple.medvhat ^ 2)  113462.7 > multiple.fit.train = lm(medv ~ ., data=Boston[train, ]) > multiple.medvhat = predict(multiple.fit.train, Boston[-train, ]) > sum(multiple.medvhat ^ 2)  118683.4  > install.packages("leaps") > require(leaps) > fwd.fit = regsubsets(medv ~ ., data=Boston, nvmax=13, method="forward") > summary(fwd.fit) Subset selection object Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 13, method = "forward") 13 Variables (and intercept) Forced in Forced out crim FALSE FALSE zn FALSE FALSE indus FALSE FALSE chas FALSE FALSE nox FALSE FALSE rm FALSE FALSE age FALSE FALSE dis FALSE FALSE rad FALSE FALSE tax FALSE FALSE ptratio FALSE FALSE black FALSE FALSE lstat FALSE FALSE 1 subsets of each size up to 13 Selection Algorithm: forward crim zn indus chas nox rm age dis rad tax ptratio black lstat 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*" 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*" 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*" 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*" 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*" 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*" 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*" 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*" 9 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*" 10 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*" 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*" 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" > plot(fwd.fit, scale="Cp")  r formula language stepwise selection polynomial model hierarchy principle lasso # logistic regression # linear discrimant analysis # naive bayes Naive Bayes is a classification technique based on Bayes' Rule and an independence assumption about the features: (1) \begin{align} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;P(C\,|\,X_1, ..., X_n) \;=\; \frac{P(C)\, P(X_1, \ldots, X_n\,|\,C)\;\;\;\;\;}{P(X_1, \ldots, X_n)} \;=\; \frac{P(C)\, P(X_1\,|\,C) \cdots P(X_n\,|\,C)\;\;\;\;\;}{P(X_1) \cdots P(X_n)} \end{align} Naive Bayes is a popular text in text classification, in which case C means that a document is in class C and Xᵢ means that a document has feature i. Given a corpus of training data, we count the total number of documents, the total number of documents in class C, the total number of documents with feature i, and the total number of documents in class C with feature i. This allows us to estimate the probabilities on the right hand side of the equation above. # support vector machine find a separating hyperplane to build a classifier maximal margin classifier tuning parameter C: sum of distance of points on wrong side of margin tuning parameter γ: for Guassian kernel comparison w/ logistic regression: works better than LR when data is separable, but does not give a probability SVM does not prune the feature space, LR with lasso does > install.packages('e1071') > require(e1071) > X = iris[c("Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width")] > y = ifelse(irisSpecies == 'setosa', 1, -1)
> fit = svm(X, y, kernel="linear", cost=1, scale=F)
> sum(ifelse(predict(fit, X) > 0, 1, -1) == y)
 150
> length(y)
 150
> fit2 = svm(X, y, kernel="radial", cost=1, gamma=0.25, scale=F)
> sum(ifelse(predict(fit2, X) > 0, 1, -1) == y)
 150


# decision tree

A decision tree is a binary tree which is used like a flow chart to come to a decision. At each node, one of the predictor variables is examined, and one proceeds to the left or right child based on the outcome.

Finding an optimal decision tree is NP-complete. Hence a greedy algorithm is usually used.

> install.packages('tree')
> require('tree')

> summary(iris)
Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

> tree.iris = tree(Species~., data=iris)

> summary(tree.iris)
Classification tree:
tree(formula = Species ~ ., data = iris)
Variables actually used in tree construction:
 "Petal.Length" "Petal.Width"  "Sepal.Length"
Number of terminal nodes:  6
Residual mean deviance:  0.1253 = 18.05 / 144
Misclassification error rate: 0.02667 = 4 / 150

> predict(tree.iris, list(Sepal.Length=5.8, Sepal.Width=3, Petal.Length=4.35, Petal.Width=1.3))
setosa versicolor virginica
1      0          1         0

>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
>>> from sklearn.externals.six import StringIO
>>> with open("/tmp/iris.dot", 'w') as f:
...     f = tree.export_graphviz(clf, out_file=f)

\$ dot -Tpng /tmp/iris.dot > /tmp/iris.png pruning

# random forest

Decision trees tend to overfit.

Random forests use two techniques to overcome the problem. First, they use bagging, in which samples with replacement are taken repeatedly from the training data, and for each sample a decision tree is built. The final classifier takes the average of all the trees. If the response variable is categorical, the majority rule can be used.

Second, at each node of the tree, a random subset of the features is chosen as candidates for the predictor variable to be examined at that node. This is done to reduce the correlation between the trees.

> install.packages('randomForest')
> require('randomForest')
> require('MASS') # Boston

> train = sample(1:nrow(Boston), 300)
> rf.Boston = randomForest(medv ~ ., data=Boston, subset=train)
> rf.Boston

Call:
randomForest(formula = medv ~ ., data = Boston, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4

Mean of squared residuals: 12.6554
% Var explained: 84.2


The Random Forest technique has potentially quite a few parameters, but the only one that requires tuning in practice is the number of features randomly selected at each node. If there are n features, then sqrt(n) is often a good value.

The number of trees usually doesn't have to be tuned because increasing the number of trees never results in overfitting.

Finding a good value for the number of features to randomly select:

> test.err = double(13)
> for (mtry in 1:13) {
+   fit = randomForest(medv ~ ., data=Boston, subtrain=train, mtry=mtry)
+   pred = predict(fit, Boston[-train, ])
+   test.err[mtry] = with(Boston[-train, ], mean(medv - pred)^2)
+ }
> plot(1:mtry, test.err, type='b', main='Random Forest Parameter Tuning', xlab='mtry') >>> from sklearn.datasets import load_boston
>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(n_estimators=400)
>>> clf = clf.fit(boston.data, boston.target)
>>> clf.predict(boston.data[0,:])
array([ 24.])


xor, parity

# cross validation

The test error can be used to select the best of several models, or to select a value for model with a tune-able parameter.

One technique is to split the data into two parts: training set and a validation set. Using 70% for training and 30% for validation is a popular choice. For each model, train on the training set and use the validation set to compute the test error.

Cross validation can be used to reduce the variance of the test error at the expense of training more models. The technique is to split the data into k equally sized parts. Train each model k on k - 1 of the parts, holding a different part out for computing the test error each time. The final mean-squared error is the average of the k values. k is often 5 or 10.

# more topics

• measurement: accuracy, recall, precision, confusion matrix, auc, roc curve, ...
• cross validation
• ensemble methods