ISLR - Unsupervised Learning¶

Chapter 10

10.2: Principal Component Analysis¶

A large set of correlated variables in a dataset can be summarized with a smaller number of representative variables that collectively explain most of the variability in the original set.

Principal Component Regression -> use principal components as predictors in a regression model in place of the original larger set of variables.

10.2.1 What are Principal Components?¶

PCA can find a low-dimensional representation of the data set (with many features) that contains as much as possible of the data set’s variation.

An interesting dimension (feature) is defined by the amount that the observations vary along each dimension.

Each dimension found by PCA is a linear combination of the p features.

The first principal component of a set of features \(X_{1}, X_{2}, … , X_{p}\) is the normalized linear combination of the features with the largest variance:

The loading vector \(\phi_{1}\) with elements \(\phi_{11}, \phi_{21}, … , \phi_{p1}\) defines a direction in feature space along which the data vary the most.

10.3 Clustering Methods¶

Clustering is to find subgroups in a data set. "... we seek to partition data into distinct groups, where the observations in each group are similar to each other, and the observations in different groups are quite different from each other."

10.3.1 K-Means Clustering¶

Specify desired number of clusters, \(K\);
K-means clustering algorithm assigns each observation to one of the clusters.
Each observation belongs to at least one of the \(K\) clusters;
The clusters are non-overlapping, i.e., no observation belongs to more than one cluster.

A good cluster is one with the smallest possible within-cluster variation.

The measure \(W(C_{k})\) for cluster \(C_{k}\) is the amount by which the observations within a cluster differ from one another.

\[minimize_{C_{1},...,C_{k}} \Biggl\{{\sum_{k=1}^{K} W(C_{k})}\Biggl\}\]

Use squared Euclidean distance to define the concept of within-cluster:

\[W(C_{k}) = \frac{1}{|C_{k}|} \sum_{i,i'\in C_{k}} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2\]

The optimization problem that defines K-means clustering:

\[minimize_{C_{1},...,C_{k}} \Biggl \{\sum_{k=1}^{K} \frac{1}{|C_{k}|} \sum_{i,i'\in C_{k}} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2 \Biggl\}\]

Algorithm for K-means Clustering: a pretty good solution¶

Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing: a. For each of the K clusters, compute the cluster centroid. The \(k\)th cluster centroid is the vector of the p feature means for the observations in the kth cluster. b. Assign each observation to the cluster whose centroid is closet (closest defined as shortest Euclidean distance).

10.3.2 Hierarchical Clustering¶

Algorithm 10.2 Hierarchical Clustering¶

Begin with n observations and a measure (such as Euclidean distance) of all the \(\binom{n}{2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
For \(i = n, n-1,...,2\): a. Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dedrogram at which the fusion should be placed. b. Compute the new pairwise inter-cluster dissimilarities among the \(i = 1\) remaining clusters.

Linkage	Description
Complete	Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Single	Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
Average	Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Centroid	Dissimilarity between the centroid for cluster A (a mean vector of length \(p\)) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.