Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding lab .Rmd to correspond to YouTube Labs #69

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions ch10/lab.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Principal Components
======================

We will use the USArrests data (which is in R)

```{r}
dimnames(USArrests)
apply(USArrests, 2L, mean)
apply(USArrests, 2L, var)
```

We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA

```{r}
pca.out = prcomp(USArrests, scale = TRUE)
pca.out
names(pca.out)
biplot(pca.out, scale = 0)
```

K-Means Clustering
====================

K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. Lets make some data with clusters. We do this by shifting the means of the points around.

```{r}
set.seed(101)
x = matrix(rnorm(100L * 2L), 100L, 2L)
xmean = matrix(rnorm(8L, sd = 4), 4L, 2L)
which = sample(1L:4L, 100L, replace = TRUE)
x = x + xmean[which, ]
plot(x, col = which, pch = 19L)
```

We know the “true” cluster IDs, but we wont tell that to the `kmeans` algorithm.

```{r}
km.out = kmeans(x, 4L, nstart = 15L)
km.out
plot(x, col = km.out$cluster,
cex = 2, pch = 1L, lwd = 2L)
points(x, col = which, pch = 19L) #permuted colors
points(x, col = (4:1)[which], pch = 19L)
```

Hierarchical Clustering
=========================

We will use these same data and use hierarchical clustering

```{r}
hc.complete = hclust(dist(x), method = "complete")
plot(hc.complete)
hc.single = hclust(dist(x), method = "single")
plot(hc.single)
hc.average = hclust(dist(x), method = "average")
plot(hc.average)
```

Lets compare this with the actualy clusters in the data. We will use the function cutree to cut the tree at level 4. This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration.

We can use table to see how well they match:

```{r}
hc.cut = cutree(hc.complete, 4L)
table(hc.cut, which)
table(hc.cut, km.out$cluster)
```

or we can use our group membership as labels for the leaves of the dendrogram:

```{r}
plot(hc.complete, labels = which)
```
266 changes: 266 additions & 0 deletions ch10/lab.html

Large diffs are not rendered by default.

165 changes: 165 additions & 0 deletions ch6/lab.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
Model Selection
================

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages, and a very nice way of distributing an analysis. It has some very simple syntax rules.


```{r}
library(ISLR)
summary(Hitters)
```

There are some missing values here, so before we proceed we will remove them:

```{r}
Hitters = na.omit(Hitters)
with(Hitters, sum(is.na(Salary)))
```

Best Subset regression
========================

We will now use the package `leaps` to evaluate all the best-subset models.

```{r}
library(leaps)
regfit.full = regsubsets(Salary ~ ., data = Hitters)
summary(regfit.full)
```

It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables

```{r}
regfit.full = regsubsets(Salary ~ ., data = Hitters, nvmax = 19L)
reg.summary = summary(regfit.full)
names(reg.summary)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
which.min(reg.summary$cp)
points(10, reg.summary$cp[10L], pch = 20L, col = "red")
```

There is a plot method for the `regsubsets` object

```{r}
plot(regfit.full, scale = "Cp")
coef(regfit.full, 10L)
```

Forward Stepwise Selection
============================

Here we use the `regsubsets` function but specify the `method=“forward”` option:

```{r}
regfit.fwd = regsubsets(Salary ~ ., data = Hitters,
nvmax = 19L, method = "forward")
summary(regfit.fwd)
plot(regfit.fwd, scale = "Cp")
```

Model Selection Using a Validation Set
========================================

Lets make a training and validation set, so that we can choose a good subset model. We will do it using a slightly different approach from what was done in the the book.

```{r}
dim(Hitters)
set.seed(1)
train = sample(263L, 180L, replace = FALSE)
train
regfit.fwd = regsubsets(Salary ~ ., data = Hitters[train, ],
nvmax = 19L, method = "forward")
```

Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no `predict` method for `regsubsets`.

```{r}
val.errors = numeric(19L)
x.test = model.matrix(Salary ~ ., data = Hitters[-train, ]) # notice the -index!
for (i in 1L:19L) {
coefi = coef(regfit.fwd, id = i)
pred = x.test[ , names(coefi)] %*% coefi
val.errors[i] = mean((Hitters$Salary[-train] - pred)^2)
}
plot(sqrt(val.errors), ylab = "Root MSE",
ylim = c(300, 400), pch = 19L, type = "b")
points(sqrt(regfit.fwd$rss[-1L]/180),
col = "blue", pch = 19L, type = "b")
legend("topright", legend = c("Training", "Validation"),
col = c("blue", "black"), pch = 19L)
```

As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error.

This was a little tedious - not having a predict method for regsubsets. So we will write one!

```{r}
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2L]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[ , names(coefi)] %*% coefi
}
```

Model Selection by Cross-Validation
=====================================

We will do 10-fold cross-validation. Its really easy!

```{r}
set.seed(11)
folds = sample(rep(1L:10L, length = nrow(Hitters)))
folds
table(folds)
cv.errors = matrix(NA, 10L, 19L)
for (k in 1L:10L) {
best.fit = regsubsets(Salary ~ ., data = Hitters[folds != k, ],
nvmax = 19L, method = "forward")
for (i in 1L:19L) {
pred = predict(best.fit, Hitters[folds == k, ], id = i)
cv.errors[k, i] = mean((Hitters$Salary[folds == k] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2L, mean))
plot(rmse.cv, pch = 19L, type = "b")
```

Ridge Regression and the Lasso
================================

We will use the package `glmnet`, which does not use the model formula language, so we will set up an x and y.

```{r}
library(glmnet)
x = model.matrix(Salary ~ . - 1, data = Hitters)
y = Hitters$Salary
fit.ridge = glmnet(x, y, alpha = 0)
plot(fit.ridge, xvar = "lambda", label = TRUE)
cv.ridge = cv.glmnet(x, y, alpha = 0)
plot(cv.ridge)
```

Now we fit a lasso model; for this we use the default `alpha=1`

```{r}
fit.lasso = glmnet(x, y)
plot(fit.lasso, xvar = "lambda", label = TRUE)
cv.lasso = cv.glmnet(x, y)
plot(cv.lasso)
coef(cv.lasso)
```

Suppose we want to use our earlier train/validation division to select the `lambda` for the lasso. This is easy to do.

```{r}
lasso.tr = glmnet(x[train, ], y[train])
lasso.tr
pred = predict(lasso.tr, x[-train, ])
dim(pred)
rmse = sqrt(apply((y[-train] - pred)^2, 2, mean))
plot(log(lasso.tr$lambda), rmse, type = "b", xlab = "Log(lambda)")
lam.best = lasso.tr$lambda[order(rmse)[1]]
lam.best
coef(lasso.tr, s = lam.best)
```
597 changes: 597 additions & 0 deletions ch6/lab.html

Large diffs are not rendered by default.

146 changes: 146 additions & 0 deletions ch7/lab.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
Nonlinear Models
==================

Here we explore the use of nonlinear models using some tools in R

```{r}
library(ISLR)
```

Polynomials
-------------

First we will use polynomials, and focus on a single predictor age:

```{r}
fit = lm(wage ~ poly(age, 4L), data = Wage)
summary(fit)
```

The `poly()` function generates a basis of orthogonal polynomials. Lets make a plot of the fitted function, along with the standard errors of the fit.

```{r fig.width = 7, fig.height = 6}
agelims = range(Wage$age)
age.grid = seq(from = agelims[1L], to = agelims[2L])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = with(preds, fit + 2 * cbind(se.fit, -se.fit))
with(Wage, plot(age, wage, col = "darkgrey"))
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, col = "blue", lty = 2L)
```

There are other more direct ways of doing this in R. For example

```{r}
fita = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
summary(fita)
```

Here `I()` is a *wrapper* function; we need it because `age^2` means something to the formula language, while `I(age^2)` is protected. The coefficients are different to those we got before! However, the fits are the same:

```{r}
plot(fitted(fit), fitted(fita))
```

By using orthogonal polynomials in this simple way, it turns out that we can separately test for each coefficient. So if we look at the summary again, we can see that the linear, quadratic and cubic terms are significant, but not the quartic.

```{r}
summary(fit)
```

This only works with linear regression, and if there is a single predictor. In general we would use `anova()` as this next example demonstrates.

```{r}
fita = lm(wage ~ education, data = Wage)
fitb = lm(wage ~ education + age, data = Wage)
fitc = lm(wage ~ education + poly(age, 2L), data = Wage)
fitd = lm(wage ~ education + poly(age, 3L), data = Wage)
anova(fita, fitb, fitc, fitd)
```

### Polynomial logistic regression

Now we fit a logistic regression model to a binary response variable, constructed from `wage`. We code the big earners (`>250K`) as 1, else 0.

```{r}
fit = glm(I(wage > 250) ~ poly(age, 3L), data = Wage, family = binomial)
summary(fit)
preds = predict(fit, list(age = age.grid), se = TRUE)
se.bands = with(preds, fit + 2 * cbind(fit = 0, lower = -se.fit, upper = se.fit))
se.bands[1L:5L, ]
```

We have done the computations on the logit scale. To transform we need to apply the inverse logit mapping

$$ p=\frac{e^\eta}{1+e^\eta}.$$

(Here we have used the ability of MarkDown to interpret TeX expressions.) We can do this simultaneously for all three columns of `se.bands`:

```{r}
prob.bands = exp(se.bands)/(1 + exp(se.bands))
matplot(age.grid, prob.bands, col = "blue",
lwd = c(2L, 1L, 1L), lty = c(1L, 2L, 2L),
type = "l", ylim = c(0, 0.1))
with(Wage, points(jitter(age), I(wage > 250)/10, pch = "|", cex = 0.5))
```

Splines
---------

Splines are more flexible than polynomials, but the idea is rather similar. Here we will explore cubic splines.


```{r plotchunk}
library(splines)
fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
with(Wage, plot(age, wage, col = "darkgrey"))
lines(age.grid, predict(fit, list(age = age.grid)),
col = "darkgreen", lwd = 2)
abline(v = c(25, 40, 60), lty = 2L, col = "darkgreen")
```

The smoothing splines does not require knot selection, but it does have a smoothing parameter, which can conveniently be specified via the effective degrees of freedom or `df`.

```{r, ref.label = 'plotchunk'}
fit = with(Wage, smooth.spline(age, wage, df = 16))
lines(fit, col = "red", lwd = 2L)
```

Or we can use LOO cross-validation to select the smoothing parameter for us automatically:

```{r, ref.label = 'plotchunk'}
fit = with(Wage, smooth.spline(age, wage, cv = TRUE))
lines(fit, col = "purple", lwd = 2L)
fit
```

Generalized Additive Models
-----------------------------

So far we have focused on fitting models with mostly single nonlinear terms. The `gam` package makes it easier to work with multiple nonlinear terms. In addition it knows how to plot these functions and their standard errors.

```{r}
library(gam)
gam1 = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage)
par(mfrow = c(1L, 3L))
plot(gam1, se = TRUE)
gam2 = gam(I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education,
data = Wage, family = binomial)
plot(gam2)
```

Lets see if we need a nonlinear terms for year

```{r}
gam2a = gam(I(wage > 250) ~ s(age, df = 4) + year + education,
data = Wage, family = binomial)
anova(gam2a, gam2, test = "Chisq")
```

One nice feature of the `gam` package is that it knows how to plot the functions nicely, even for models fit by `lm` and `glm`.

```{r, fig.width = 10, fig.height = 5}
par(mfrow = c(1L, 3L))
lm1 = lm(wage ~ ns(age, df = 4) + ns(year, df = 4) + education, data = Wage)
plot.gam(lm1, se = TRUE)
```
Loading