diff --git a/ch10/lab.Rmd b/ch10/lab.Rmd new file mode 100644 index 0000000..a61fe4b --- /dev/null +++ b/ch10/lab.Rmd @@ -0,0 +1,74 @@ +Principal Components +====================== + +We will use the USArrests data (which is in R) + +```{r} +dimnames(USArrests) +apply(USArrests, 2L, mean) +apply(USArrests, 2L, var) +``` + +We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA + +```{r} +pca.out = prcomp(USArrests, scale = TRUE) +pca.out +names(pca.out) +biplot(pca.out, scale = 0) +``` + +K-Means Clustering +==================== + +K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. Lets make some data with clusters. We do this by shifting the means of the points around. + +```{r} +set.seed(101) +x = matrix(rnorm(100L * 2L), 100L, 2L) +xmean = matrix(rnorm(8L, sd = 4), 4L, 2L) +which = sample(1L:4L, 100L, replace = TRUE) +x = x + xmean[which, ] +plot(x, col = which, pch = 19L) +``` + +We know the “true” cluster IDs, but we wont tell that to the `kmeans` algorithm. + +```{r} +km.out = kmeans(x, 4L, nstart = 15L) +km.out +plot(x, col = km.out$cluster, + cex = 2, pch = 1L, lwd = 2L) +points(x, col = which, pch = 19L) #permuted colors +points(x, col = (4:1)[which], pch = 19L) +``` + +Hierarchical Clustering +========================= + +We will use these same data and use hierarchical clustering + +```{r} +hc.complete = hclust(dist(x), method = "complete") +plot(hc.complete) +hc.single = hclust(dist(x), method = "single") +plot(hc.single) +hc.average = hclust(dist(x), method = "average") +plot(hc.average) +``` + +Lets compare this with the actualy clusters in the data. We will use the function cutree to cut the tree at level 4. This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration. + +We can use table to see how well they match: + +```{r} +hc.cut = cutree(hc.complete, 4L) +table(hc.cut, which) +table(hc.cut, km.out$cluster) +``` + +or we can use our group membership as labels for the leaves of the dendrogram: + +```{r} +plot(hc.complete, labels = which) +``` diff --git a/ch10/lab.html b/ch10/lab.html new file mode 100644 index 0000000..62bcefb --- /dev/null +++ b/ch10/lab.html @@ -0,0 +1,266 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+

Principal Components

+

We will use the USArrests data (which is in R)

+
dimnames(USArrests)
+
## [[1]]
+##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
+##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
+##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
+## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
+## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
+## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
+## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
+## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
+## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
+## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
+## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
+## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
+## [49] "Wisconsin"      "Wyoming"       
+## 
+## [[2]]
+## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
+
apply(USArrests, 2L, mean)
+
##   Murder  Assault UrbanPop     Rape 
+##    7.788  170.760   65.540   21.232
+
apply(USArrests, 2L, var)
+
##     Murder    Assault   UrbanPop       Rape 
+##   18.97047 6945.16571  209.51878   87.72916
+

We see that Assault has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA

+
pca.out = prcomp(USArrests, scale = TRUE)
+pca.out
+
## Standard deviations:
+## [1] 1.5748783 0.9948694 0.5971291 0.4164494
+## 
+## Rotation:
+##                 PC1        PC2        PC3         PC4
+## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
+## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
+## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
+## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
+
names(pca.out)
+
## [1] "sdev"     "rotation" "center"   "scale"    "x"
+
biplot(pca.out, scale = 0)
+

+
+
+

K-Means Clustering

+

K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. Lets make some data with clusters. We do this by shifting the means of the points around.

+
set.seed(101)
+x = matrix(rnorm(100L * 2L), 100L, 2L)
+xmean = matrix(rnorm(8L, sd = 4), 4L, 2L)
+which = sample(1L:4L, 100L, replace = TRUE)
+x = x + xmean[which, ]
+plot(x, col = which, pch = 19L)
+

+

We know the “true” cluster IDs, but we wont tell that to the kmeans algorithm.

+
km.out = kmeans(x, 4L, nstart = 15L)
+km.out
+
## K-means clustering with 4 clusters of sizes 21, 30, 32, 17
+## 
+## Cluster means:
+##         [,1]       [,2]
+## 1 -3.1068542  1.1213302
+## 2  1.7226318 -0.2584919
+## 3 -5.5818142  3.3684991
+## 4 -0.6148368  4.8861032
+## 
+## Clustering vector:
+##   [1] 2 3 3 4 1 1 4 3 2 3 2 1 1 3 1 1 2 3 3 2 2 3 1 3 1 1 2 2 3 1 1 4 3 1 3
+##  [36] 3 1 2 2 3 2 2 3 3 1 3 1 3 4 2 1 2 2 4 3 3 2 2 3 2 1 2 3 4 2 4 3 4 4 2
+##  [71] 2 4 3 2 3 4 4 2 2 1 2 4 4 3 3 2 3 3 1 2 3 2 4 4 4 2 3 3 1 1
+## 
+## Within cluster sum of squares by cluster:
+## [1] 30.82790 54.48008 71.98228 21.04952
+##  (between_SS / total_SS =  87.6 %)
+## 
+## Available components:
+## 
+## [1] "cluster"      "centers"      "totss"        "withinss"    
+## [5] "tot.withinss" "betweenss"    "size"         "iter"        
+## [9] "ifault"
+
plot(x, col = km.out$cluster,
+     cex = 2, pch = 1L, lwd = 2L)
+points(x, col = which, pch = 19L) #permuted colors
+points(x, col = (4:1)[which], pch = 19L)
+

+
+
+

Hierarchical Clustering

+

We will use these same data and use hierarchical clustering

+
hc.complete = hclust(dist(x), method = "complete")
+plot(hc.complete)
+

+
hc.single = hclust(dist(x), method = "single")
+plot(hc.single)
+

+
hc.average = hclust(dist(x), method = "average")
+plot(hc.average)
+

+

Lets compare this with the actualy clusters in the data. We will use the function cutree to cut the tree at level 4. This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration.

+

We can use table to see how well they match:

+
hc.cut = cutree(hc.complete, 4L)
+table(hc.cut, which)
+
##       which
+## hc.cut  1  2  3  4
+##      1  0  0 30  0
+##      2  1 31  0  2
+##      3 17  0  0  0
+##      4  0  0  0 19
+
table(hc.cut, km.out$cluster)
+
##       
+## hc.cut  1  2  3  4
+##      1  0 30  0  0
+##      2  2  0 32  0
+##      3  0  0  0 17
+##      4 19  0  0  0
+

or we can use our group membership as labels for the leaves of the dendrogram:

+
plot(hc.complete, labels = which)
+

+
+ + + + +
+ + + + + + + + diff --git a/ch6/lab.Rmd b/ch6/lab.Rmd new file mode 100644 index 0000000..cf57658 --- /dev/null +++ b/ch6/lab.Rmd @@ -0,0 +1,165 @@ +Model Selection +================ + +This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, and a very nice way of distributing an analysis. It has some very simple syntax rules. + + +```{r} +library(ISLR) +summary(Hitters) +``` + +There are some missing values here, so before we proceed we will remove them: + +```{r} +Hitters = na.omit(Hitters) +with(Hitters, sum(is.na(Salary))) +``` + +Best Subset regression +======================== + +We will now use the package `leaps` to evaluate all the best-subset models. + +```{r} +library(leaps) +regfit.full = regsubsets(Salary ~ ., data = Hitters) +summary(regfit.full) +``` + +It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables + +```{r} +regfit.full = regsubsets(Salary ~ ., data = Hitters, nvmax = 19L) +reg.summary = summary(regfit.full) +names(reg.summary) +plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp") +which.min(reg.summary$cp) +points(10, reg.summary$cp[10L], pch = 20L, col = "red") +``` + +There is a plot method for the `regsubsets` object + +```{r} +plot(regfit.full, scale = "Cp") +coef(regfit.full, 10L) +``` + +Forward Stepwise Selection +============================ + +Here we use the `regsubsets` function but specify the `method=“forward”` option: + +```{r} +regfit.fwd = regsubsets(Salary ~ ., data = Hitters, + nvmax = 19L, method = "forward") +summary(regfit.fwd) +plot(regfit.fwd, scale = "Cp") +``` + +Model Selection Using a Validation Set +======================================== + +Lets make a training and validation set, so that we can choose a good subset model. We will do it using a slightly different approach from what was done in the the book. + +```{r} +dim(Hitters) +set.seed(1) +train = sample(263L, 180L, replace = FALSE) +train +regfit.fwd = regsubsets(Salary ~ ., data = Hitters[train, ], + nvmax = 19L, method = "forward") +``` + +Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no `predict` method for `regsubsets`. + +```{r} +val.errors = numeric(19L) +x.test = model.matrix(Salary ~ ., data = Hitters[-train, ]) # notice the -index! +for (i in 1L:19L) { + coefi = coef(regfit.fwd, id = i) + pred = x.test[ , names(coefi)] %*% coefi + val.errors[i] = mean((Hitters$Salary[-train] - pred)^2) +} +plot(sqrt(val.errors), ylab = "Root MSE", + ylim = c(300, 400), pch = 19L, type = "b") +points(sqrt(regfit.fwd$rss[-1L]/180), + col = "blue", pch = 19L, type = "b") +legend("topright", legend = c("Training", "Validation"), + col = c("blue", "black"), pch = 19L) +``` + +As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error. + +This was a little tedious - not having a predict method for regsubsets. So we will write one! + +```{r} +predict.regsubsets = function(object, newdata, id, ...) { + form = as.formula(object$call[[2L]]) + mat = model.matrix(form, newdata) + coefi = coef(object, id = id) + mat[ , names(coefi)] %*% coefi +} +``` + +Model Selection by Cross-Validation +===================================== + +We will do 10-fold cross-validation. Its really easy! + +```{r} +set.seed(11) +folds = sample(rep(1L:10L, length = nrow(Hitters))) +folds +table(folds) +cv.errors = matrix(NA, 10L, 19L) +for (k in 1L:10L) { + best.fit = regsubsets(Salary ~ ., data = Hitters[folds != k, ], + nvmax = 19L, method = "forward") + for (i in 1L:19L) { + pred = predict(best.fit, Hitters[folds == k, ], id = i) + cv.errors[k, i] = mean((Hitters$Salary[folds == k] - pred)^2) + } +} +rmse.cv = sqrt(apply(cv.errors, 2L, mean)) +plot(rmse.cv, pch = 19L, type = "b") +``` + +Ridge Regression and the Lasso +================================ + +We will use the package `glmnet`, which does not use the model formula language, so we will set up an x and y. + +```{r} +library(glmnet) +x = model.matrix(Salary ~ . - 1, data = Hitters) +y = Hitters$Salary +fit.ridge = glmnet(x, y, alpha = 0) +plot(fit.ridge, xvar = "lambda", label = TRUE) +cv.ridge = cv.glmnet(x, y, alpha = 0) +plot(cv.ridge) +``` + +Now we fit a lasso model; for this we use the default `alpha=1` + +```{r} +fit.lasso = glmnet(x, y) +plot(fit.lasso, xvar = "lambda", label = TRUE) +cv.lasso = cv.glmnet(x, y) +plot(cv.lasso) +coef(cv.lasso) +``` + +Suppose we want to use our earlier train/validation division to select the `lambda` for the lasso. This is easy to do. + +```{r} +lasso.tr = glmnet(x[train, ], y[train]) +lasso.tr +pred = predict(lasso.tr, x[-train, ]) +dim(pred) +rmse = sqrt(apply((y[-train] - pred)^2, 2, mean)) +plot(log(lasso.tr$lambda), rmse, type = "b", xlab = "Log(lambda)") +lam.best = lasso.tr$lambda[order(rmse)[1]] +lam.best +coef(lasso.tr, s = lam.best) +``` diff --git a/ch6/lab.html b/ch6/lab.html new file mode 100644 index 0000000..63de722 --- /dev/null +++ b/ch6/lab.html @@ -0,0 +1,597 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+

Model Selection

+

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, and a very nice way of distributing an analysis. It has some very simple syntax rules.

+
library(ISLR)
+summary(Hitters)
+
##      AtBat            Hits         HmRun            Runs       
+##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
+##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
+##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
+##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
+##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
+##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
+##                                                                
+##       RBI             Walks            Years            CAtBat       
+##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
+##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
+##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
+##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
+##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
+##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
+##                                                                      
+##      CHits            CHmRun           CRuns             CRBI        
+##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
+##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
+##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
+##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
+##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
+##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
+##                                                                      
+##      CWalks        League  Division    PutOuts          Assists     
+##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
+##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
+##  Median : 170.50                    Median : 212.0   Median : 39.5  
+##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
+##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
+##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
+##                                                                     
+##      Errors          Salary       NewLeague
+##  Min.   : 0.00   Min.   :  67.5   A:176    
+##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
+##  Median : 6.00   Median : 425.0            
+##  Mean   : 8.04   Mean   : 535.9            
+##  3rd Qu.:11.00   3rd Qu.: 750.0            
+##  Max.   :32.00   Max.   :2460.0            
+##                  NA's   :59
+

There are some missing values here, so before we proceed we will remove them:

+
Hitters = na.omit(Hitters)
+with(Hitters, sum(is.na(Salary)))
+
## [1] 0
+
+
+

Best Subset regression

+

We will now use the package leaps to evaluate all the best-subset models.

+
library(leaps)
+regfit.full = regsubsets(Salary ~ ., data = Hitters)
+summary(regfit.full)
+
## Subset selection object
+## Call: regsubsets.formula(Salary ~ ., data = Hitters)
+## 19 Variables  (and intercept)
+##            Forced in Forced out
+## AtBat          FALSE      FALSE
+## Hits           FALSE      FALSE
+## HmRun          FALSE      FALSE
+## Runs           FALSE      FALSE
+## RBI            FALSE      FALSE
+## Walks          FALSE      FALSE
+## Years          FALSE      FALSE
+## CAtBat         FALSE      FALSE
+## CHits          FALSE      FALSE
+## CHmRun         FALSE      FALSE
+## CRuns          FALSE      FALSE
+## CRBI           FALSE      FALSE
+## CWalks         FALSE      FALSE
+## LeagueN        FALSE      FALSE
+## DivisionW      FALSE      FALSE
+## PutOuts        FALSE      FALSE
+## Assists        FALSE      FALSE
+## Errors         FALSE      FALSE
+## NewLeagueN     FALSE      FALSE
+## 1 subsets of each size up to 8
+## Selection Algorithm: exhaustive
+##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
+## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
+## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
+## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
+##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
+## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
+## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
+## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
+## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
+## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "
+

It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables

+
regfit.full = regsubsets(Salary ~ ., data = Hitters, nvmax = 19L)
+reg.summary = summary(regfit.full)
+names(reg.summary)
+
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
+
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
+which.min(reg.summary$cp)
+
## [1] 10
+
points(10, reg.summary$cp[10L], pch = 20L, col = "red")
+

+

There is a plot method for the regsubsets object

+
plot(regfit.full, scale = "Cp")
+

+
coef(regfit.full, 10L)
+
##  (Intercept)        AtBat         Hits        Walks       CAtBat 
+##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798 
+##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
+##    1.4082490    0.7743122   -0.8308264 -112.3800575    0.2973726 
+##      Assists 
+##    0.2831680
+
+
+

Forward Stepwise Selection

+

Here we use the regsubsets function but specify the method=“forward” option:

+
regfit.fwd = regsubsets(Salary ~ ., data = Hitters, 
+                        nvmax = 19L, method = "forward")
+summary(regfit.fwd)
+
## Subset selection object
+## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19L, method = "forward")
+## 19 Variables  (and intercept)
+##            Forced in Forced out
+## AtBat          FALSE      FALSE
+## Hits           FALSE      FALSE
+## HmRun          FALSE      FALSE
+## Runs           FALSE      FALSE
+## RBI            FALSE      FALSE
+## Walks          FALSE      FALSE
+## Years          FALSE      FALSE
+## CAtBat         FALSE      FALSE
+## CHits          FALSE      FALSE
+## CHmRun         FALSE      FALSE
+## CRuns          FALSE      FALSE
+## CRBI           FALSE      FALSE
+## CWalks         FALSE      FALSE
+## LeagueN        FALSE      FALSE
+## DivisionW      FALSE      FALSE
+## PutOuts        FALSE      FALSE
+## Assists        FALSE      FALSE
+## Errors         FALSE      FALSE
+## NewLeagueN     FALSE      FALSE
+## 1 subsets of each size up to 19
+## Selection Algorithm: forward
+##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
+## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
+## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
+## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
+## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
+## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
+## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
+## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
+## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
+## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
+## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
+## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
+## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
+## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
+## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
+## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
+##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
+## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
+## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
+## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
+## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
+## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
+## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
+## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
+## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
+## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
+## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
+## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
+## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
+## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
+## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
+## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
+## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
+## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"
+
plot(regfit.fwd, scale = "Cp")
+

+
+
+

Model Selection Using a Validation Set

+

Lets make a training and validation set, so that we can choose a good subset model. We will do it using a slightly different approach from what was done in the the book.

+
dim(Hitters)
+
## [1] 263  20
+
set.seed(1)
+train = sample(263L, 180L, replace = FALSE)
+train
+
##   [1]  70  98 150 237  53 232 243 170 161  16 259  45 173  97 192 124 178
+##  [18] 245  94 190 228  52 158  31  64  92   4  91 205  80 113 140 115  43
+##  [35] 244 153 181  25 163  93 184 144 174 122 117 251   6 104 241 149 102
+##  [52] 183 224 242  15  21  66 107 136  83 186  60 211  67 130 210  95 151
+##  [69]  17 256 207 162 200 239 236 168 249  73 222 177 234 199 203  59 235
+##  [86]  37 126  22 230 226  42  11 110 214 132 134  77  69 188 100 206  58
+## [103]  44 159 101  34 208  75 185 201 261 112  54  65  23   2 106 254 257
+## [120] 154 142  71 166 221 105  63 143  29 240 212 167 172   5  84 120 133
+## [137]  72 191 248 138 182  74 179 135  87 196 157 119  13  99 263 125 247
+## [154]  50  55  20  57   8  30 194 139 238  46  78  88  41   7  33 141  32
+## [171] 180 164 213  36 215  79 225 229 198  76
+
regfit.fwd = regsubsets(Salary ~ ., data = Hitters[train, ],
+                        nvmax = 19L, method = "forward")
+

Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for regsubsets.

+
val.errors = numeric(19L)
+x.test = model.matrix(Salary ~ ., data = Hitters[-train, ])  # notice the -index!
+for (i in 1L:19L) {
+    coefi = coef(regfit.fwd, id = i)
+    pred = x.test[ , names(coefi)] %*% coefi
+    val.errors[i] = mean((Hitters$Salary[-train] - pred)^2)
+}
+plot(sqrt(val.errors), ylab = "Root MSE",
+     ylim = c(300, 400), pch = 19L, type = "b")
+points(sqrt(regfit.fwd$rss[-1L]/180), 
+       col = "blue", pch = 19L, type = "b")
+legend("topright", legend = c("Training", "Validation"), 
+       col = c("blue", "black"), pch = 19L)
+

+

As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error.

+

This was a little tedious - not having a predict method for regsubsets. So we will write one!

+
predict.regsubsets = function(object, newdata, id, ...) {
+    form = as.formula(object$call[[2L]])
+    mat = model.matrix(form, newdata)
+    coefi = coef(object, id = id)
+    mat[ , names(coefi)] %*% coefi
+}
+
+
+

Model Selection by Cross-Validation

+

We will do 10-fold cross-validation. Its really easy!

+
set.seed(11)
+folds = sample(rep(1L:10L, length = nrow(Hitters)))
+folds
+
##   [1]  3  1  4  4  7  7  3  5  5  2  5  2  8  3  3  3  9  2  9  8 10  5  8
+##  [24]  5  5  5  5 10 10  4  4  7  6  7  7  7  3  4  8  3  6  8 10  4  3  9
+##  [47]  9  3  4  9  8  7 10  6 10  3  6  9  4  2  8  2  5  6 10  7  2  8  8
+##  [70]  1  3  6  2  5  8  1  1  2  8  1 10  1  2  3  6  6  5  8  8 10  4  2
+##  [93]  6  1  7  4  8  3  7  8  7  1 10  1  6  2  9 10  1  7  7  4  7  4 10
+## [116]  3  6 10  6  6  9  8 10  6  7  9  6  7  1 10  2  2  5  9  9  6  1  1
+## [139]  2  9  4 10  5  3  7  7 10 10  9  3  3  7  3  1  4  6  6 10  4  9  9
+## [162]  1  3  6  8 10  8  5  4  5  6  2  9 10  3  7  7  6  6  2  3  2  4  4
+## [185]  4  4  8  2  3  5  9  9 10  2  1  3  9  6  7  3  1  9  4 10 10  8  8
+## [208]  8  2  5  9  8 10  5  8  2  4  1  4  4  5  5  2  1  9  5  2  9  9  5
+## [231]  3  2  1  9  1  7  2  5  8  1  1  7  6  6  4  5 10  5  7  4  8  6  9
+## [254]  1  2  5  7  1  3  1  3  1  2
+
table(folds)
+
## folds
+##  1  2  3  4  5  6  7  8  9 10 
+## 27 27 27 26 26 26 26 26 26 26
+
cv.errors = matrix(NA, 10L, 19L)
+for (k in 1L:10L) {
+    best.fit = regsubsets(Salary ~ ., data = Hitters[folds != k, ], 
+                          nvmax = 19L, method = "forward")
+    for (i in 1L:19L) {
+        pred = predict(best.fit, Hitters[folds == k, ], id = i)
+        cv.errors[k, i] = mean((Hitters$Salary[folds == k] - pred)^2)
+    }
+}
+rmse.cv = sqrt(apply(cv.errors, 2L, mean))
+plot(rmse.cv, pch = 19L, type = "b")
+

+
+
+

Ridge Regression and the Lasso

+

We will use the package glmnet, which does not use the model formula language, so we will set up an x and y.

+
library(glmnet)
+
## Loading required package: Matrix
+
## Loading required package: foreach
+
## Loaded glmnet 2.0-5
+
x = model.matrix(Salary ~ . - 1, data = Hitters)
+y = Hitters$Salary
+fit.ridge = glmnet(x, y, alpha = 0)
+plot(fit.ridge, xvar = "lambda", label = TRUE)
+

+
cv.ridge = cv.glmnet(x, y, alpha = 0)
+plot(cv.ridge)
+

+

Now we fit a lasso model; for this we use the default alpha=1

+
fit.lasso = glmnet(x, y)
+plot(fit.lasso, xvar = "lambda", label = TRUE)
+

+
cv.lasso = cv.glmnet(x, y)
+plot(cv.lasso)
+

+
coef(cv.lasso)
+
## 21 x 1 sparse Matrix of class "dgCMatrix"
+##                        1
+## (Intercept) 127.95694754
+## AtBat         .         
+## Hits          1.42342566
+## HmRun         .         
+## Runs          .         
+## RBI           .         
+## Walks         1.58214111
+## Years         .         
+## CAtBat        .         
+## CHits         .         
+## CHmRun        .         
+## CRuns         0.16027975
+## CRBI          0.33667715
+## CWalks        .         
+## LeagueA       .         
+## LeagueN       .         
+## DivisionW    -8.06171262
+## PutOuts       0.08393604
+## Assists       .         
+## Errors        .         
+## NewLeagueN    .
+

Suppose we want to use our earlier train/validation division to select the lambda for the lasso. This is easy to do.

+
lasso.tr = glmnet(x[train, ], y[train])
+lasso.tr
+
## 
+## Call:  glmnet(x = x[train, ], y = y[train]) 
+## 
+##       Df    %Dev    Lambda
+##  [1,]  0 0.00000 246.40000
+##  [2,]  1 0.05013 224.50000
+##  [3,]  1 0.09175 204.60000
+##  [4,]  2 0.13840 186.40000
+##  [5,]  2 0.18000 169.80000
+##  [6,]  3 0.21570 154.80000
+##  [7,]  3 0.24710 141.00000
+##  [8,]  3 0.27320 128.50000
+##  [9,]  4 0.30010 117.10000
+## [10,]  4 0.32360 106.70000
+## [11,]  4 0.34310  97.19000
+## [12,]  4 0.35920  88.56000
+## [13,]  5 0.37360  80.69000
+## [14,]  5 0.38900  73.52000
+## [15,]  5 0.40190  66.99000
+## [16,]  5 0.41260  61.04000
+## [17,]  5 0.42140  55.62000
+## [18,]  5 0.42880  50.67000
+## [19,]  5 0.43490  46.17000
+## [20,]  5 0.43990  42.07000
+## [21,]  5 0.44410  38.33000
+## [22,]  5 0.44760  34.93000
+## [23,]  6 0.45140  31.83000
+## [24,]  7 0.45480  29.00000
+## [25,]  7 0.45770  26.42000
+## [26,]  7 0.46010  24.07000
+## [27,]  8 0.46220  21.94000
+## [28,]  8 0.46380  19.99000
+## [29,]  8 0.46520  18.21000
+## [30,]  8 0.46630  16.59000
+## [31,]  8 0.46730  15.12000
+## [32,]  8 0.46810  13.78000
+## [33,]  9 0.47110  12.55000
+## [34,]  9 0.47380  11.44000
+## [35,]  9 0.47620  10.42000
+## [36,] 10 0.48050   9.49500
+## [37,]  9 0.48450   8.65200
+## [38,] 10 0.48770   7.88300
+## [39,] 10 0.49360   7.18300
+## [40,] 11 0.49890   6.54500
+## [41,] 12 0.50450   5.96300
+## [42,] 12 0.51010   5.43400
+## [43,] 13 0.51470   4.95100
+## [44,] 13 0.51850   4.51100
+## [45,] 13 0.52170   4.11000
+## [46,] 14 0.52440   3.74500
+## [47,] 14 0.52670   3.41200
+## [48,] 15 0.52870   3.10900
+## [49,] 15 0.53030   2.83300
+## [50,] 15 0.53160   2.58100
+## [51,] 16 0.53280   2.35200
+## [52,] 17 0.53420   2.14300
+## [53,] 18 0.53580   1.95300
+## [54,] 18 0.53760   1.77900
+## [55,] 18 0.53890   1.62100
+## [56,] 18 0.54000   1.47700
+## [57,] 18 0.54090   1.34600
+## [58,] 18 0.54160   1.22600
+## [59,] 18 0.54220   1.11700
+## [60,] 18 0.54280   1.01800
+## [61,] 18 0.54320   0.92770
+## [62,] 18 0.54360   0.84530
+## [63,] 18 0.54380   0.77020
+## [64,] 19 0.54410   0.70180
+## [65,] 19 0.54430   0.63940
+## [66,] 19 0.54450   0.58260
+## [67,] 19 0.54470   0.53090
+## [68,] 19 0.54490   0.48370
+## [69,] 20 0.54510   0.44070
+## [70,] 20 0.54520   0.40160
+## [71,] 20 0.54530   0.36590
+## [72,] 20 0.54540   0.33340
+## [73,] 20 0.54550   0.30380
+## [74,] 20 0.54560   0.27680
+## [75,] 20 0.54570   0.25220
+## [76,] 20 0.54570   0.22980
+## [77,] 20 0.54580   0.20940
+## [78,] 20 0.54580   0.19080
+## [79,] 20 0.54590   0.17380
+## [80,] 20 0.54590   0.15840
+## [81,] 20 0.54590   0.14430
+## [82,] 20 0.54590   0.13150
+## [83,] 20 0.54600   0.11980
+## [84,] 19 0.54600   0.10920
+## [85,] 19 0.54600   0.09948
+## [86,] 19 0.54600   0.09064
+## [87,] 19 0.54600   0.08259
+## [88,] 20 0.54600   0.07525
+## [89,] 20 0.54600   0.06856
+
pred = predict(lasso.tr, x[-train, ])
+dim(pred)
+
## [1] 83 89
+
rmse = sqrt(apply((y[-train] - pred)^2, 2, mean))
+plot(log(lasso.tr$lambda), rmse, type = "b", xlab = "Log(lambda)")
+

+
lam.best = lasso.tr$lambda[order(rmse)[1]]
+lam.best
+
## [1] 19.98706
+
coef(lasso.tr, s = lam.best)
+
## 21 x 1 sparse Matrix of class "dgCMatrix"
+##                        1
+## (Intercept)  107.9416686
+## AtBat          .        
+## Hits           0.1591252
+## HmRun          .        
+## Runs           .        
+## RBI            1.7340039
+## Walks          3.4657091
+## Years          .        
+## CAtBat         .        
+## CHits          .        
+## CHmRun         .        
+## CRuns          0.5386855
+## CRBI           .        
+## CWalks         .        
+## LeagueA      -30.0493021
+## LeagueN        .        
+## DivisionW   -113.8317016
+## PutOuts        0.2915409
+## Assists        .        
+## Errors         .        
+## NewLeagueN     2.0367518
+
+ + + + +
+ + + + + + + + diff --git a/ch7/lab.Rmd b/ch7/lab.Rmd new file mode 100644 index 0000000..69b1fd2 --- /dev/null +++ b/ch7/lab.Rmd @@ -0,0 +1,146 @@ +Nonlinear Models +================== + +Here we explore the use of nonlinear models using some tools in R + +```{r} +library(ISLR) +``` + +Polynomials +------------- + +First we will use polynomials, and focus on a single predictor age: + +```{r} +fit = lm(wage ~ poly(age, 4L), data = Wage) +summary(fit) +``` + +The `poly()` function generates a basis of orthogonal polynomials. Lets make a plot of the fitted function, along with the standard errors of the fit. + +```{r fig.width = 7, fig.height = 6} +agelims = range(Wage$age) +age.grid = seq(from = agelims[1L], to = agelims[2L]) +preds = predict(fit, newdata = list(age = age.grid), se = TRUE) +se.bands = with(preds, fit + 2 * cbind(se.fit, -se.fit)) +with(Wage, plot(age, wage, col = "darkgrey")) +lines(age.grid, preds$fit, lwd = 2, col = "blue") +matlines(age.grid, se.bands, col = "blue", lty = 2L) +``` + +There are other more direct ways of doing this in R. For example + +```{r} +fita = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage) +summary(fita) +``` + +Here `I()` is a *wrapper* function; we need it because `age^2` means something to the formula language, while `I(age^2)` is protected. The coefficients are different to those we got before! However, the fits are the same: + +```{r} +plot(fitted(fit), fitted(fita)) +``` + +By using orthogonal polynomials in this simple way, it turns out that we can separately test for each coefficient. So if we look at the summary again, we can see that the linear, quadratic and cubic terms are significant, but not the quartic. + +```{r} +summary(fit) +``` + +This only works with linear regression, and if there is a single predictor. In general we would use `anova()` as this next example demonstrates. + +```{r} +fita = lm(wage ~ education, data = Wage) +fitb = lm(wage ~ education + age, data = Wage) +fitc = lm(wage ~ education + poly(age, 2L), data = Wage) +fitd = lm(wage ~ education + poly(age, 3L), data = Wage) +anova(fita, fitb, fitc, fitd) +``` + +### Polynomial logistic regression + +Now we fit a logistic regression model to a binary response variable, constructed from `wage`. We code the big earners (`>250K`) as 1, else 0. + +```{r} +fit = glm(I(wage > 250) ~ poly(age, 3L), data = Wage, family = binomial) +summary(fit) +preds = predict(fit, list(age = age.grid), se = TRUE) +se.bands = with(preds, fit + 2 * cbind(fit = 0, lower = -se.fit, upper = se.fit)) +se.bands[1L:5L, ] +``` + +We have done the computations on the logit scale. To transform we need to apply the inverse logit mapping + +$$ p=\frac{e^\eta}{1+e^\eta}.$$ + +(Here we have used the ability of MarkDown to interpret TeX expressions.) We can do this simultaneously for all three columns of `se.bands`: + +```{r} +prob.bands = exp(se.bands)/(1 + exp(se.bands)) +matplot(age.grid, prob.bands, col = "blue", + lwd = c(2L, 1L, 1L), lty = c(1L, 2L, 2L), + type = "l", ylim = c(0, 0.1)) +with(Wage, points(jitter(age), I(wage > 250)/10, pch = "|", cex = 0.5)) +``` + +Splines +--------- + +Splines are more flexible than polynomials, but the idea is rather similar. Here we will explore cubic splines. + + +```{r plotchunk} +library(splines) +fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage) +with(Wage, plot(age, wage, col = "darkgrey")) +lines(age.grid, predict(fit, list(age = age.grid)), + col = "darkgreen", lwd = 2) +abline(v = c(25, 40, 60), lty = 2L, col = "darkgreen") +``` + +The smoothing splines does not require knot selection, but it does have a smoothing parameter, which can conveniently be specified via the effective degrees of freedom or `df`. + +```{r, ref.label = 'plotchunk'} +fit = with(Wage, smooth.spline(age, wage, df = 16)) +lines(fit, col = "red", lwd = 2L) +``` + +Or we can use LOO cross-validation to select the smoothing parameter for us automatically: + +```{r, ref.label = 'plotchunk'} +fit = with(Wage, smooth.spline(age, wage, cv = TRUE)) +lines(fit, col = "purple", lwd = 2L) +fit +``` + +Generalized Additive Models +----------------------------- + +So far we have focused on fitting models with mostly single nonlinear terms. The `gam` package makes it easier to work with multiple nonlinear terms. In addition it knows how to plot these functions and their standard errors. + +```{r} +library(gam) +gam1 = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage) +par(mfrow = c(1L, 3L)) +plot(gam1, se = TRUE) +gam2 = gam(I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education, + data = Wage, family = binomial) +plot(gam2) +``` + +Lets see if we need a nonlinear terms for year + +```{r} +gam2a = gam(I(wage > 250) ~ s(age, df = 4) + year + education, + data = Wage, family = binomial) +anova(gam2a, gam2, test = "Chisq") +``` + +One nice feature of the `gam` package is that it knows how to plot the functions nicely, even for models fit by `lm` and `glm`. + +```{r, fig.width = 10, fig.height = 5} +par(mfrow = c(1L, 3L)) +lm1 = lm(wage ~ ns(age, df = 4) + ns(year, df = 4) + education, data = Wage) +plot.gam(lm1, se = TRUE) +``` diff --git a/ch7/lab.html b/ch7/lab.html new file mode 100644 index 0000000..2d690b3 --- /dev/null +++ b/ch7/lab.html @@ -0,0 +1,364 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+

Nonlinear Models

+

Here we explore the use of nonlinear models using some tools in R

+
library(ISLR)
+
+

Polynomials

+

First we will use polynomials, and focus on a single predictor age:

+
fit = lm(wage ~ poly(age, 4L), data = Wage)
+summary(fit)
+
## 
+## Call:
+## lm(formula = wage ~ poly(age, 4L), data = Wage)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -98.707 -24.626  -4.993  15.217 203.693 
+## 
+## Coefficients:
+##                Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)    111.7036     0.7287 153.283  < 2e-16 ***
+## poly(age, 4)1  447.0679    39.9148  11.201  < 2e-16 ***
+## poly(age, 4)2 -478.3158    39.9148 -11.983  < 2e-16 ***
+## poly(age, 4)3  125.5217    39.9148   3.145  0.00168 ** 
+## poly(age, 4)4  -77.9112    39.9148  -1.952  0.05104 .  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 39.91 on 2995 degrees of freedom
+## Multiple R-squared:  0.08626,    Adjusted R-squared:  0.08504 
+## F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
+

The poly() function generates a basis of orthogonal polynomials. Lets make a plot of the fitted function, along with the standard errors of the fit.

+
agelims = range(Wage$age)
+age.grid = seq(from = agelims[1L], to = agelims[2L])
+preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
+se.bands = with(preds, fit + 2 * cbind(se.fit, -se.fit))
+with(Wage, plot(age, wage, col = "darkgrey"))
+lines(age.grid, preds$fit, lwd = 2, col = "blue")
+matlines(age.grid, se.bands, col = "blue", lty = 2L)
+

+

There are other more direct ways of doing this in R. For example

+
fita = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
+summary(fita)
+
## 
+## Call:
+## lm(formula = wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -98.707 -24.626  -4.993  15.217 203.693 
+## 
+## Coefficients:
+##               Estimate Std. Error t value Pr(>|t|)    
+## (Intercept) -1.842e+02  6.004e+01  -3.067 0.002180 ** 
+## age          2.125e+01  5.887e+00   3.609 0.000312 ***
+## I(age^2)    -5.639e-01  2.061e-01  -2.736 0.006261 ** 
+## I(age^3)     6.811e-03  3.066e-03   2.221 0.026398 *  
+## I(age^4)    -3.204e-05  1.641e-05  -1.952 0.051039 .  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 39.91 on 2995 degrees of freedom
+## Multiple R-squared:  0.08626,    Adjusted R-squared:  0.08504 
+## F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
+

Here I() is a wrapper function; we need it because age^2 means something to the formula language, while I(age^2) is protected. The coefficients are different to those we got before! However, the fits are the same:

+
plot(fitted(fit), fitted(fita))
+

+

By using orthogonal polynomials in this simple way, it turns out that we can separately test for each coefficient. So if we look at the summary again, we can see that the linear, quadratic and cubic terms are significant, but not the quartic.

+
summary(fit)
+
## 
+## Call:
+## lm(formula = wage ~ poly(age, 4L), data = Wage)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -98.707 -24.626  -4.993  15.217 203.693 
+## 
+## Coefficients:
+##                Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)    111.7036     0.7287 153.283  < 2e-16 ***
+## poly(age, 4)1  447.0679    39.9148  11.201  < 2e-16 ***
+## poly(age, 4)2 -478.3158    39.9148 -11.983  < 2e-16 ***
+## poly(age, 4)3  125.5217    39.9148   3.145  0.00168 ** 
+## poly(age, 4)4  -77.9112    39.9148  -1.952  0.05104 .  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 39.91 on 2995 degrees of freedom
+## Multiple R-squared:  0.08626,    Adjusted R-squared:  0.08504 
+## F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
+

This only works with linear regression, and if there is a single predictor. In general we would use anova() as this next example demonstrates.

+
fita = lm(wage ~ education, data = Wage)
+fitb = lm(wage ~ education + age, data = Wage)
+fitc = lm(wage ~ education + poly(age, 2L), data = Wage)
+fitd = lm(wage ~ education + poly(age, 3L), data = Wage)
+anova(fita, fitb, fitc, fitd)
+
## Analysis of Variance Table
+## 
+## Model 1: wage ~ education
+## Model 2: wage ~ education + age
+## Model 3: wage ~ education + poly(age, 2L)
+## Model 4: wage ~ education + poly(age, 3L)
+##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
+## 1   2995 3995721                                 
+## 2   2994 3867992  1    127729 102.7378 <2e-16 ***
+## 3   2993 3725395  1    142597 114.6969 <2e-16 ***
+## 4   2992 3719809  1      5587   4.4936 0.0341 *  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+

Polynomial logistic regression

+

Now we fit a logistic regression model to a binary response variable, constructed from wage. We code the big earners (>250K) as 1, else 0.

+
fit = glm(I(wage > 250) ~ poly(age, 3L), data = Wage, family = binomial)
+summary(fit)
+
## 
+## Call:
+## glm(formula = I(wage > 250) ~ poly(age, 3L), family = binomial, 
+##     data = Wage)
+## 
+## Deviance Residuals: 
+##     Min       1Q   Median       3Q      Max  
+## -0.2808  -0.2736  -0.2487  -0.1758   3.2868  
+## 
+## Coefficients:
+##               Estimate Std. Error z value Pr(>|z|)    
+## (Intercept)    -3.8486     0.1597 -24.100  < 2e-16 ***
+## poly(age, 3)1  37.8846    11.4818   3.300 0.000968 ***
+## poly(age, 3)2 -29.5129    10.5626  -2.794 0.005205 ** 
+## poly(age, 3)3   9.7966     8.9990   1.089 0.276317    
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## (Dispersion parameter for binomial family taken to be 1)
+## 
+##     Null deviance: 730.53  on 2999  degrees of freedom
+## Residual deviance: 707.92  on 2996  degrees of freedom
+## AIC: 715.92
+## 
+## Number of Fisher Scoring iterations: 8
+
preds = predict(fit, list(age = age.grid), se = TRUE)
+se.bands = with(preds, fit + 2 * cbind(fit = 0, lower = -se.fit, upper = se.fit))
+se.bands[1L:5L, ]
+
##         fit      lower     upper
+## 1 -7.664756 -10.759826 -4.569686
+## 2 -7.324776 -10.106699 -4.542852
+## 3 -7.001732  -9.492821 -4.510643
+## 4 -6.695229  -8.917158 -4.473300
+## 5 -6.404868  -8.378691 -4.431045
+

We have done the computations on the logit scale. To transform we need to apply the inverse logit mapping

+

\[ p=\frac{e^\eta}{1+e^\eta}.\]

+

(Here we have used the ability of MarkDown to interpret TeX expressions.) We can do this simultaneously for all three columns of se.bands:

+
prob.bands = exp(se.bands)/(1 + exp(se.bands))
+matplot(age.grid, prob.bands, col = "blue",
+        lwd = c(2L, 1L, 1L), lty = c(1L, 2L, 2L), 
+        type = "l", ylim = c(0, 0.1))
+with(Wage, points(jitter(age), I(wage > 250)/10, pch = "|", cex = 0.5))
+

+
+
+
+

Splines

+

Splines are more flexible than polynomials, but the idea is rather similar. Here we will explore cubic splines.

+
library(splines)
+fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
+with(Wage, plot(age, wage, col = "darkgrey"))
+lines(age.grid, predict(fit, list(age = age.grid)),
+      col = "darkgreen", lwd = 2)
+abline(v = c(25, 40, 60), lty = 2L, col = "darkgreen")
+

+

The smoothing splines does not require knot selection, but it does have a smoothing parameter, which can conveniently be specified via the effective degrees of freedom or df.

+
library(splines)
+fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
+with(Wage, plot(age, wage, col = "darkgrey"))
+lines(age.grid, predict(fit, list(age = age.grid)),
+      col = "darkgreen", lwd = 2)
+abline(v = c(25, 40, 60), lty = 2L, col = "darkgreen")
+

+

Or we can use LOO cross-validation to select the smoothing parameter for us automatically:

+
library(splines)
+fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
+with(Wage, plot(age, wage, col = "darkgrey"))
+lines(age.grid, predict(fit, list(age = age.grid)),
+      col = "darkgreen", lwd = 2)
+abline(v = c(25, 40, 60), lty = 2L, col = "darkgreen")
+

+
+
+

Generalized Additive Models

+

So far we have focused on fitting models with mostly single nonlinear terms. The gam package makes it easier to work with multiple nonlinear terms. In addition it knows how to plot these functions and their standard errors.

+
library(gam)
+
## Loading required package: foreach
+
## Loaded gam 1.14
+
gam1 = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage)
+par(mfrow = c(1L, 3L))
+plot(gam1, se = TRUE)
+

+
gam2 = gam(I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education, 
+           data = Wage, family = binomial)
+plot(gam2)
+

+

Lets see if we need a nonlinear terms for year

+
gam2a = gam(I(wage > 250) ~ s(age, df = 4) + year + education,
+            data = Wage, family = binomial)
+anova(gam2a, gam2, test = "Chisq")
+
## Analysis of Deviance Table
+## 
+## Model 1: I(wage > 250) ~ s(age, df = 4) + year + education
+## Model 2: I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education
+##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
+## 1      2990     603.78                     
+## 2      2987     602.87  3  0.90498   0.8242
+

One nice feature of the gam package is that it knows how to plot the functions nicely, even for models fit by lm and glm.

+
par(mfrow = c(1L, 3L))
+lm1 = lm(wage ~ ns(age, df = 4) + ns(year, df = 4) + education, data = Wage)
+plot.gam(lm1, se = TRUE)
+

+
+
+ + + + +
+ + + + + + + + diff --git a/ch8/lab.Rmd b/ch8/lab.Rmd new file mode 100644 index 0000000..1c3a7cb --- /dev/null +++ b/ch8/lab.Rmd @@ -0,0 +1,139 @@ +Decision Trees +================ + +We will have a look at the `Carseats` data using the `tree` package in R, as in the lab in the book. +We create a binary response variable `High` (for high sales), and we include it in the same dataframe. + +```{r} +library(ISLR) +library(tree) +hist(Carseats$Sales) +High = with(Carseats, ifelse(Sales <= 8, "No", "Yes")) +Carseats = data.frame(Carseats, High) +``` + +Now we fit a tree to these data, and summarize and plot it. Notice that we have to _exclude_ `Sales` from the right-hand side of the formula, because the response is derived from it. + +```{r} +tree.carseats = tree(High ~ . - Sales, data = Carseats) +summary(tree.carseats) +plot(tree.carseats) +text(tree.carseats, pretty = 0) +``` + +For a detailed summary of the tree, print it: + +```{r} +tree.carseats +``` + +Let's create a training and test set (250,150) split of the 400 observations, grow the tree on the training set, and evaluate its performance on the test set. + +```{r} +set.seed(1011) +train = sample(1:nrow(Carseats), 250L) +tree.carseats = tree(High ~ . - Sales, Carseats, subset = train) +plot(tree.carseats) +text(tree.carseats, pretty = 0) +tree.pred = predict(tree.carseats, Carseats[-train, ], type = "class") +with(Carseats[-train, ], table(tree.pred, High)) +(72 + 33)/150 +``` + +This tree was grown to full depth, and might be too variable. We now use CV to prune it. + +```{r} +cv.carseats = cv.tree(tree.carseats, FUN = prune.misclass) +cv.carseats +plot(cv.carseats) +prune.carseats = prune.misclass(tree.carseats, best = 13) +plot(prune.carseats) +text(prune.carseats, pretty = 0) +``` + +Now lets evaluate this pruned tree on the test data. + +```{r} +tree.pred = predict(prune.carseats, Carseats[-train, ], type = "class") +with(Carseats[-train, ], table(tree.pred, High)) +(72 + 32)/150 +``` + +It has done about the same as our original tree. So pruning did not hurt us wrt misclassification errors, and gave us a simpler tree. + +Random Forests and Boosting +============================= + +These methods use trees as building blocks to build more complex models. Here we will use the Boston housing data to explore random forests and boosting. These data are in the `MASS` package. +It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census. + +Random Forests +--------------- + +Random forests build lots of bushy trees, and then average them to reduce the variance. + +```{r} +library(randomForest) +library(MASS) +set.seed(101) +dim(Boston) +train = sample(1:nrow(Boston), 300L) +?Boston +``` + +Lets fit a random forest and see how well it performs. We will use the response `medv`, the median housing value (in $1K dollars) + +```{r} +rf.boston = randomForest(medv ~ ., data = Boston, subset = train) +rf.boston +``` + +The MSR and % variance explained are based on OOB or _out-of-bag_ estimates, a very clever device in random forests to get honest error estimates. The model reports that `mtry=4`, which is the number of variables randomly chosen at each split. Since $p=13$ here, we could try all 13 possible values of `mtry`. We will do so, record the results, and make a plot. + +```{r} +oob.err = numeric(13L) +test.err = numeric(13L) +for (mtry in 1L:13L) { + fit = randomForest(medv ~ ., data = Boston, + subset = train, mtry = mtry, ntree = 400L) + oob.err[mtry] = fit$mse[400L] + pred = predict(fit, Boston[-train, ]) + test.err[mtry] = with(Boston[-train, ], mean((medv - pred)^2)) + cat(mtry, " ") +} +matplot(1L:mtry, cbind(test.err, oob.err), pch = 19L, + col = c("red", "blue"), type = "b", ylab = "Mean Squared Error") +legend("topright", legend = c("OOB", "Test"), + pch = 19L, col = c("red", "blue")) +``` + +Not too difficult! Although the test-error curve drops below the OOB curve, these are estimates based on data, and so have their own standard errors (which are typically quite large). Notice that the points at the end with `mtry=13` correspond to bagging. + +Boosting +---------- + +Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble. + +```{r} +library(gbm) +boost.boston = + gbm(medv ~ ., data = Boston[train, ], + distribution = "gaussian", n.trees = 10000L, + shrinkage = 0.01, interaction.depth = 4L) +summary(boost.boston, las = 2L) +plot(boost.boston, i = "lstat") +plot(boost.boston, i = "rm") +``` + +Lets make a prediction on the test set. With boosting, the number of trees is a tuning parameter, and if we have too many we can overfit. So we should use cross-validation to select the number of trees. We will leave this as an exercise. Instead, we will compute the test error as a function of the number of trees, and make a plot. + +```{r} +n.trees = seq(from = 100, to = 10000, by = 100) +predmat = predict(boost.boston, newdata = Boston[-train, ], n.trees = n.trees) +dim(predmat) +berr = with(Boston[-train, ], apply((predmat - medv)^2, 2L, mean)) +plot(n.trees, berr, pch = 19L, + ylab = "Mean Squared Error", xlab = "# Trees", + main = "Boosting Test Error") +abline(h = min(test.err), col = "red") +``` diff --git a/ch8/lab.html b/ch8/lab.html new file mode 100644 index 0000000..e7f8f19 --- /dev/null +++ b/ch8/lab.html @@ -0,0 +1,371 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+

Decision Trees

+

We will have a look at the Carseats data using the tree package in R, as in the lab in the book. We create a binary response variable High (for high sales), and we include it in the same dataframe.

+
library(ISLR)
+library(tree)
+hist(Carseats$Sales)
+

+
High = with(Carseats, ifelse(Sales <= 8, "No", "Yes"))
+Carseats = data.frame(Carseats, High)
+

Now we fit a tree to these data, and summarize and plot it. Notice that we have to exclude Sales from the right-hand side of the formula, because the response is derived from it.

+
tree.carseats = tree(High ~ . - Sales, data = Carseats)
+summary(tree.carseats)
+
## 
+## Classification tree:
+## tree(formula = High ~ . - Sales, data = Carseats)
+## Variables actually used in tree construction:
+## [1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
+## [6] "Advertising" "Age"         "US"         
+## Number of terminal nodes:  27 
+## Residual mean deviance:  0.4575 = 170.7 / 373 
+## Misclassification error rate: 0.09 = 36 / 400
+
plot(tree.carseats)
+text(tree.carseats, pretty = 0)
+

+

For a detailed summary of the tree, print it:

+
tree.carseats
+
## node), split, n, deviance, yval, (yprob)
+##       * denotes terminal node
+## 
+##   1) root 400 541.500 No ( 0.59000 0.41000 )  
+##     2) ShelveLoc: Bad,Medium 315 390.600 No ( 0.68889 0.31111 )  
+##       4) Price < 92.5 46  56.530 Yes ( 0.30435 0.69565 )  
+##         8) Income < 57 10  12.220 No ( 0.70000 0.30000 )  
+##          16) CompPrice < 110.5 5   0.000 No ( 1.00000 0.00000 ) *
+##          17) CompPrice > 110.5 5   6.730 Yes ( 0.40000 0.60000 ) *
+##         9) Income > 57 36  35.470 Yes ( 0.19444 0.80556 )  
+##          18) Population < 207.5 16  21.170 Yes ( 0.37500 0.62500 ) *
+##          19) Population > 207.5 20   7.941 Yes ( 0.05000 0.95000 ) *
+##       5) Price > 92.5 269 299.800 No ( 0.75465 0.24535 )  
+##        10) Advertising < 13.5 224 213.200 No ( 0.81696 0.18304 )  
+##          20) CompPrice < 124.5 96  44.890 No ( 0.93750 0.06250 )  
+##            40) Price < 106.5 38  33.150 No ( 0.84211 0.15789 )  
+##              80) Population < 177 12  16.300 No ( 0.58333 0.41667 )  
+##               160) Income < 60.5 6   0.000 No ( 1.00000 0.00000 ) *
+##               161) Income > 60.5 6   5.407 Yes ( 0.16667 0.83333 ) *
+##              81) Population > 177 26   8.477 No ( 0.96154 0.03846 ) *
+##            41) Price > 106.5 58   0.000 No ( 1.00000 0.00000 ) *
+##          21) CompPrice > 124.5 128 150.200 No ( 0.72656 0.27344 )  
+##            42) Price < 122.5 51  70.680 Yes ( 0.49020 0.50980 )  
+##              84) ShelveLoc: Bad 11   6.702 No ( 0.90909 0.09091 ) *
+##              85) ShelveLoc: Medium 40  52.930 Yes ( 0.37500 0.62500 )  
+##               170) Price < 109.5 16   7.481 Yes ( 0.06250 0.93750 ) *
+##               171) Price > 109.5 24  32.600 No ( 0.58333 0.41667 )  
+##                 342) Age < 49.5 13  16.050 Yes ( 0.30769 0.69231 ) *
+##                 343) Age > 49.5 11   6.702 No ( 0.90909 0.09091 ) *
+##            43) Price > 122.5 77  55.540 No ( 0.88312 0.11688 )  
+##              86) CompPrice < 147.5 58  17.400 No ( 0.96552 0.03448 ) *
+##              87) CompPrice > 147.5 19  25.010 No ( 0.63158 0.36842 )  
+##               174) Price < 147 12  16.300 Yes ( 0.41667 0.58333 )  
+##                 348) CompPrice < 152.5 7   5.742 Yes ( 0.14286 0.85714 ) *
+##                 349) CompPrice > 152.5 5   5.004 No ( 0.80000 0.20000 ) *
+##               175) Price > 147 7   0.000 No ( 1.00000 0.00000 ) *
+##        11) Advertising > 13.5 45  61.830 Yes ( 0.44444 0.55556 )  
+##          22) Age < 54.5 25  25.020 Yes ( 0.20000 0.80000 )  
+##            44) CompPrice < 130.5 14  18.250 Yes ( 0.35714 0.64286 )  
+##              88) Income < 100 9  12.370 No ( 0.55556 0.44444 ) *
+##              89) Income > 100 5   0.000 Yes ( 0.00000 1.00000 ) *
+##            45) CompPrice > 130.5 11   0.000 Yes ( 0.00000 1.00000 ) *
+##          23) Age > 54.5 20  22.490 No ( 0.75000 0.25000 )  
+##            46) CompPrice < 122.5 10   0.000 No ( 1.00000 0.00000 ) *
+##            47) CompPrice > 122.5 10  13.860 No ( 0.50000 0.50000 )  
+##              94) Price < 125 5   0.000 Yes ( 0.00000 1.00000 ) *
+##              95) Price > 125 5   0.000 No ( 1.00000 0.00000 ) *
+##     3) ShelveLoc: Good 85  90.330 Yes ( 0.22353 0.77647 )  
+##       6) Price < 135 68  49.260 Yes ( 0.11765 0.88235 )  
+##        12) US: No 17  22.070 Yes ( 0.35294 0.64706 )  
+##          24) Price < 109 8   0.000 Yes ( 0.00000 1.00000 ) *
+##          25) Price > 109 9  11.460 No ( 0.66667 0.33333 ) *
+##        13) US: Yes 51  16.880 Yes ( 0.03922 0.96078 ) *
+##       7) Price > 135 17  22.070 No ( 0.64706 0.35294 )  
+##        14) Income < 46 6   0.000 No ( 1.00000 0.00000 ) *
+##        15) Income > 46 11  15.160 Yes ( 0.45455 0.54545 ) *
+

Let’s create a training and test set (250,150) split of the 400 observations, grow the tree on the training set, and evaluate its performance on the test set.

+
set.seed(1011)
+train = sample(1:nrow(Carseats), 250L)
+tree.carseats = tree(High ~ . - Sales, Carseats, subset = train)
+plot(tree.carseats)
+text(tree.carseats, pretty = 0)
+

+
tree.pred = predict(tree.carseats, Carseats[-train, ], type = "class")
+with(Carseats[-train, ], table(tree.pred, High))
+
##          High
+## tree.pred No Yes
+##       No  72  27
+##       Yes 18  33
+
(72 + 33)/150
+
## [1] 0.7
+

This tree was grown to full depth, and might be too variable. We now use CV to prune it.

+
cv.carseats = cv.tree(tree.carseats, FUN = prune.misclass)
+cv.carseats
+
## $size
+##  [1] 20 14 13 10  9  7  6  5  2  1
+## 
+## $dev
+##  [1]  65  65  57  57  59  64  64  59  78 104
+## 
+## $k
+##  [1]      -Inf  0.000000  1.000000  1.333333  2.000000  2.500000  4.000000
+##  [8]  5.000000  9.000000 31.000000
+## 
+## $method
+## [1] "misclass"
+## 
+## attr(,"class")
+## [1] "prune"         "tree.sequence"
+
plot(cv.carseats)
+

+
prune.carseats = prune.misclass(tree.carseats, best = 13)
+plot(prune.carseats)
+text(prune.carseats, pretty = 0)
+

+

Now lets evaluate this pruned tree on the test data.

+
tree.pred = predict(prune.carseats, Carseats[-train, ], type = "class")
+with(Carseats[-train, ], table(tree.pred, High))
+
##          High
+## tree.pred No Yes
+##       No  72  28
+##       Yes 18  32
+
(72 + 32)/150
+
## [1] 0.6933333
+

It has done about the same as our original tree. So pruning did not hurt us wrt misclassification errors, and gave us a simpler tree.

+
+
+

Random Forests and Boosting

+

These methods use trees as building blocks to build more complex models. Here we will use the Boston housing data to explore random forests and boosting. These data are in the MASS package. It gives housing values and other statistics in each of 506 suburbs of Boston based on a 1970 census.

+
+

Random Forests

+

Random forests build lots of bushy trees, and then average them to reduce the variance.

+
library(randomForest)
+
## randomForest 4.6-12
+
## Type rfNews() to see new features/changes/bug fixes.
+
library(MASS)
+set.seed(101)
+dim(Boston)
+
## [1] 506  14
+
train = sample(1:nrow(Boston), 300L)
+?Boston
+

Lets fit a random forest and see how well it performs. We will use the response medv, the median housing value (in $1K dollars)

+
rf.boston = randomForest(medv ~ ., data = Boston, subset = train)
+rf.boston
+
## 
+## Call:
+##  randomForest(formula = medv ~ ., data = Boston, subset = train) 
+##                Type of random forest: regression
+##                      Number of trees: 500
+## No. of variables tried at each split: 4
+## 
+##           Mean of squared residuals: 12.34243
+##                     % Var explained: 85.09
+

The MSR and % variance explained are based on OOB or out-of-bag estimates, a very clever device in random forests to get honest error estimates. The model reports that mtry=4, which is the number of variables randomly chosen at each split. Since \(p=13\) here, we could try all 13 possible values of mtry. We will do so, record the results, and make a plot.

+
oob.err = numeric(13L)
+test.err = numeric(13L)
+for (mtry in 1L:13L) {
+    fit = randomForest(medv ~ ., data = Boston,
+                       subset = train, mtry = mtry, ntree = 400L)
+    oob.err[mtry] = fit$mse[400L]
+    pred = predict(fit, Boston[-train, ])
+    test.err[mtry] = with(Boston[-train, ], mean((medv - pred)^2))
+    cat(mtry, " ")
+}
+
## 1  2  3  4  5  6  7  8  9  10  11  12  13
+
matplot(1L:mtry, cbind(test.err, oob.err), pch = 19L,
+        col = c("red", "blue"), type = "b", ylab = "Mean Squared Error")
+legend("topright", legend = c("OOB", "Test"),
+       pch = 19L, col = c("red", "blue"))
+

+

Not too difficult! Although the test-error curve drops below the OOB curve, these are estimates based on data, and so have their own standard errors (which are typically quite large). Notice that the points at the end with mtry=13 correspond to bagging.

+
+
+

Boosting

+

Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble.

+
library(gbm)
+
## Loading required package: survival
+
## Loading required package: lattice
+
## Loading required package: splines
+
## Loading required package: parallel
+
## Loaded gbm 2.1.1
+
boost.boston = 
+  gbm(medv ~ ., data = Boston[train, ],
+      distribution = "gaussian", n.trees = 10000L, 
+      shrinkage = 0.01, interaction.depth = 4L)
+summary(boost.boston, las = 2L)
+

+
##             var    rel.inf
+## lstat     lstat 33.9327851
+## rm           rm 33.0292330
+## dis         dis  8.2042669
+## crim       crim  5.6007746
+## nox         nox  4.7898634
+## ptratio ptratio  3.8768099
+## black     black  3.2206529
+## age         age  3.1046916
+## tax         tax  1.4980644
+## chas       chas  1.3570033
+## rad         rad  0.7030437
+## indus     indus  0.5757992
+## zn           zn  0.1070121
+
plot(boost.boston, i = "lstat")
+

+
plot(boost.boston, i = "rm")
+

+

Lets make a prediction on the test set. With boosting, the number of trees is a tuning parameter, and if we have too many we can overfit. So we should use cross-validation to select the number of trees. We will leave this as an exercise. Instead, we will compute the test error as a function of the number of trees, and make a plot.

+
n.trees = seq(from = 100, to = 10000, by = 100)
+predmat = predict(boost.boston, newdata = Boston[-train, ], n.trees = n.trees)
+dim(predmat)
+
## [1] 206 100
+
berr = with(Boston[-train, ], apply((predmat - medv)^2, 2L, mean))
+plot(n.trees, berr, pch = 19L, 
+     ylab = "Mean Squared Error", xlab = "# Trees", 
+     main = "Boosting Test Error")
+abline(h = min(test.err), col = "red")
+

+
+
+ + + + +
+ + + + + + + + diff --git a/ch9/lab.Rmd b/ch9/lab.Rmd new file mode 100644 index 0000000..f564902 --- /dev/null +++ b/ch9/lab.Rmd @@ -0,0 +1,115 @@ +SVM +===== + +To demonstrate the SVM, it is easiest to work in low dimensions, so we can see the data. + +Linear SVM classifier +----------------------- + +Let's generate some data in two dimensions, and make them a little separated. + +```{r} +set.seed(10111) +x=matrix(rnorm(40L), 20L, 2L) +y=rep(c(-1L, 1L), each = 10L) +x[y == 1L, ] = x[y == 1L, ] + 1 +plot(x, col = y + 3L, pch = 19L) +``` + +Now we will load the package `e1071` which contains the `svm` function we will use. We then compute the fit. Notice that we have to specify a `cost` parameter, which is a tuning parameter. + +```{r} +library(e1071) +dat = data.frame(x, y = as.factor(y)) +svmfit = svm(y ~ ., data = dat, + kernel = "linear", cost = 10, scale = FALSE) +print(svmfit) +plot(svmfit, dat) +``` + +As mentioned in the the chapter, the plot function is somewhat crude, and plots X2 on the horizontal axis (unlike what R would do automatically for a matrix). Lets see how we might make our own plot. + +The first thing we will do is make a grid of values for X1 and X2. We will write a function to do that, in case we want to reuse it. It uses the handy function `expand.grid`, and produces the coordinates of `n*n` points on a lattice covering the domain of `x`. Having made the lattice, we make a prediction at each point on the lattice. We then plot the lattice, color-coded according to the classification. Now we can see the decision boundary. + +The support points (points on the margin, or on the wrong side of the margin) are indexed in the `$index` component of the fit. + +```{r} +make.grid=function(x, n = 75L){ + grange = apply(x, 2L, range) + x1 = seq(from = grange[1L, 1L], + to = grange[2L, 1L], length = n) + x2 = seq(from = grange[1L, 2L], + to = grange[2L, 2L], length = n) + expand.grid(X1 = x1, X2 = x2) +} +xgrid = make.grid(x) +ygrid = predict(svmfit, xgrid) +plot(xgrid, col = c("red", "blue")[as.integer(ygrid)], + pch = 20L, cex = .2) +points(x, col = y + 3L, pch = 19L) +points(x[svmfit$index, ], pch = 5L, cex = 2) +``` + +The `svm` function is not too friendly, in that we have to do some work to get back the linear coefficients, as described in the text. Probably the reason is that this only makes sense for linear kernels, and the function is more general. Here we will use a formula to extract the coefficients; for those interested in where this comes from, have a look in chapter 12 of ESL (“Elements of Statistical Learning”). + +We extract the linear coefficients, and then using simple algebra, we include the decision boundary and the two margins. + +```{r} +beta = drop(t(svmfit$coefs) %*% x[svmfit$index, ]) +beta0 = svmfit$rho +plot(xgrid, col = c("red","blue")[as.integer(ygrid)], + pch = 20L, cex = .2) +points(x, col = y + 3L, pch = 19L) +points(x[svmfit$index, ], pch = 5L, cex = 2) +abline(beta0/beta[2L], -beta[1L]/beta[2L]) +abline((beta0 - 1)/beta[2L], -beta[1L]/beta[2L], lty=2L) +abline((beta0 + 1)/beta[2L], -beta[1L]/beta[2L], lty=2L) +``` + +Just like for the other models in this book, the tuning parameter `C` has to be selected. Different values will give different solutions. Rerun the code above, but using `C=1`, and see what we mean. One can use cross-validation to do this. + +Nonlinear SVM +--------------- + +Instead, we will run the SVM on some data where a non-linear boundary is called for. We will use the mixture data from ESL + +```{r} +URL = paste0("http://www.stanford.edu/~hastie/", + "ElemStatLearn/datasets/ESL.mixture.rda") +load(url(URL)) +with(ESL.mixture, plot(x, col = y + 1L)) +dat = data.frame(y=factor(y), x) +fit = svm(y ~ ., data = dat, scale = FALSE, + kernel = "radial", cost = 5) +``` + +Now we are going to create a grid, as before, and make predictions on the grid. These data have the grid points for each variable included on the data frame. + +```{r} +xgrid = with(ESL.mixture, expand.grid(X1 = px1, X2 = px2)) +ygrid = predict(fit, xgrid) +plot(xgrid, col = as.integer(ygrid), + pch = 20L, cex = .2) +points(x, col = y + 1L, pch = 19L) +``` + +We can go further, and have the predict function produce the actual function estimates at each of our grid points. We can include the actual decision boundary on the plot by making use of the contour function. On the dataframe is also `prob`, which is the true probability of class 1 for these data, at the gridpoints. If we plot its 0.5 contour, that will give us the Bayes Decision Boundary, which is the best one could ever do. + +```{r} +func = predict(fit, xgrid, decision.values = TRUE) +func = attr(func, "decision") +xgrid = with(ESL.mixture, expand.grid(X1 = px1, X2 = px2)) +ygrid = predict(fit, xgrid) +plot(xgrid, col = as.integer(ygrid), + pch = 20L, cex = .2) +points(x, col = y + 1L, pch = 19L) +with(ESL.mixture, + contour(px1, px2, matrix(func, 69L, 99L), + level = 0, add = TRUE)) +with(ESL.mixture, + contour(px1, px2, matrix(prob, 69L, 99L), + level = .5, add = TRUE, + col = "blue", lwd = 2L)) +``` + +We see in this case that the radial kernel has done an excellent job. diff --git a/ch9/lab.html b/ch9/lab.html new file mode 100644 index 0000000..3389634 --- /dev/null +++ b/ch9/lab.html @@ -0,0 +1,249 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + +
+

SVM

+

To demonstrate the SVM, it is easiest to work in low dimensions, so we can see the data.

+
+

Linear SVM classifier

+

Let’s generate some data in two dimensions, and make them a little separated.

+
set.seed(10111)
+x=matrix(rnorm(40L), 20L, 2L)
+y=rep(c(-1L, 1L), each = 10L)
+x[y == 1L, ] = x[y == 1L, ] + 1
+plot(x, col = y + 3L, pch = 19L)
+

+

Now we will load the package e1071 which contains the svm function we will use. We then compute the fit. Notice that we have to specify a cost parameter, which is a tuning parameter.

+
library(e1071)
+dat = data.frame(x, y = as.factor(y))
+svmfit = svm(y ~ ., data = dat, 
+             kernel = "linear", cost = 10, scale = FALSE)
+print(svmfit)
+
## 
+## Call:
+## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, 
+##     scale = FALSE)
+## 
+## 
+## Parameters:
+##    SVM-Type:  C-classification 
+##  SVM-Kernel:  linear 
+##        cost:  10 
+##       gamma:  0.5 
+## 
+## Number of Support Vectors:  6
+
plot(svmfit, dat)
+

+

As mentioned in the the chapter, the plot function is somewhat crude, and plots X2 on the horizontal axis (unlike what R would do automatically for a matrix). Lets see how we might make our own plot.

+

The first thing we will do is make a grid of values for X1 and X2. We will write a function to do that, in case we want to reuse it. It uses the handy function expand.grid, and produces the coordinates of n*n points on a lattice covering the domain of x. Having made the lattice, we make a prediction at each point on the lattice. We then plot the lattice, color-coded according to the classification. Now we can see the decision boundary.

+

The support points (points on the margin, or on the wrong side of the margin) are indexed in the $index component of the fit.

+
make.grid=function(x, n = 75L){
+  grange = apply(x, 2L, range)
+  x1 = seq(from = grange[1L, 1L], 
+           to = grange[2L, 1L], length = n)
+  x2 = seq(from = grange[1L, 2L],
+           to = grange[2L, 2L], length = n)
+  expand.grid(X1 = x1, X2 = x2)
+}
+xgrid = make.grid(x)
+ygrid = predict(svmfit, xgrid)
+plot(xgrid, col = c("red", "blue")[as.integer(ygrid)],
+     pch = 20L, cex = .2)
+points(x, col = y + 3L, pch = 19L)
+points(x[svmfit$index, ], pch = 5L, cex = 2)
+

+

The svm function is not too friendly, in that we have to do some work to get back the linear coefficients, as described in the text. Probably the reason is that this only makes sense for linear kernels, and the function is more general. Here we will use a formula to extract the coefficients; for those interested in where this comes from, have a look in chapter 12 of ESL (“Elements of Statistical Learning”).

+

We extract the linear coefficients, and then using simple algebra, we include the decision boundary and the two margins.

+
beta = drop(t(svmfit$coefs) %*% x[svmfit$index, ])
+beta0 = svmfit$rho
+plot(xgrid, col = c("red","blue")[as.integer(ygrid)],
+     pch = 20L, cex = .2)
+points(x, col = y + 3L, pch = 19L)
+points(x[svmfit$index, ], pch = 5L, cex = 2)
+abline(beta0/beta[2L], -beta[1L]/beta[2L])
+abline((beta0 - 1)/beta[2L], -beta[1L]/beta[2L], lty=2L)
+abline((beta0 + 1)/beta[2L], -beta[1L]/beta[2L], lty=2L)
+

+

Just like for the other models in this book, the tuning parameter C has to be selected. Different values will give different solutions. Rerun the code above, but using C=1, and see what we mean. One can use cross-validation to do this.

+
+
+

Nonlinear SVM

+

Instead, we will run the SVM on some data where a non-linear boundary is called for. We will use the mixture data from ESL

+
URL = paste0("http://www.stanford.edu/~hastie/", 
+             "ElemStatLearn/datasets/ESL.mixture.rda")
+load(url(URL))
+with(ESL.mixture, plot(x, col = y + 1L))
+

+
dat = data.frame(y=factor(y), x)
+fit = svm(y ~ ., data = dat, scale = FALSE,
+          kernel = "radial", cost = 5)
+

Now we are going to create a grid, as before, and make predictions on the grid. These data have the grid points for each variable included on the data frame.

+
xgrid = with(ESL.mixture, expand.grid(X1 = px1, X2 = px2))
+ygrid = predict(fit, xgrid)
+plot(xgrid, col = as.integer(ygrid), 
+     pch = 20L, cex = .2)
+points(x, col = y + 1L, pch = 19L)
+

+

We can go further, and have the predict function produce the actual function estimates at each of our grid points. We can include the actual decision boundary on the plot by making use of the contour function. On the dataframe is also prob, which is the true probability of class 1 for these data, at the gridpoints. If we plot its 0.5 contour, that will give us the Bayes Decision Boundary, which is the best one could ever do.

+
func = predict(fit, xgrid, decision.values = TRUE)
+func = attr(func, "decision")
+xgrid = with(ESL.mixture, expand.grid(X1 = px1, X2 = px2))
+ygrid = predict(fit, xgrid)
+plot(xgrid, col = as.integer(ygrid), 
+     pch = 20L, cex = .2)
+points(x, col = y + 1L, pch = 19L)
+with(ESL.mixture, 
+     contour(px1, px2, matrix(func, 69L, 99L),
+             level = 0, add = TRUE))
+with(ESL.mixture,
+     contour(px1, px2, matrix(prob, 69L, 99L),
+             level = .5, add = TRUE,
+             col = "blue", lwd = 2L))
+

+

We see in this case that the radial kernel has done an excellent job.

+
+
+ + + + +
+ + + + + + + +