ISL
Chapter 10: Unsupervised Learning
PCA
10.1 Review Questions:
QUESTION:
You are analyzing a dataset where each observation is an age, height, length, and width of a particular turtle. You want to know if the data can be well described by fewer than four dimensions (maybe for plotting), so you decide to do Principal Component Analysis. Which of the following is most likely to be the loadings of the first Principal Component?
Explanation
We know that options 1 and 4 cannot be right because the sum of the squared loadings exceeds 1. The second option is most likely correct because we expect all four variables to be positively correlated with each-other.
Note that it is fairly common for the loadings of the first principal component to all have the same sign. In such a case, the principal component is often referred to as a size component.
THREE BULLETS:
- sum of squared loadings must not exceed 1
- We can add even more parameters by adding a second level of PCA (?)
- There is a way we can graph this that is cool?
HIGHER ORDER PRINCIPAL COMPONENTS
- How is PCA different from least squares regression? Least squares regression is also finding a hyperplane that can get close to the data, but here, “closest” means “closet to y” where in PCA we have no y, so it’s the shortest distance of the hyperplane to
- If the variables are different units, scaling each to have a standard deviation of 1 is recommended
- TOTAL VARIANCE = sum of the variance of all the variables
- You want to look for the elbow in the scree plot
QUIZ QUESTION
Suppose we a data set where each data point represents a single student’s scores on a math test, a physics test, a reading comprehension test, and a vocabulary test.
We find the first two principal components, which capture 90% of the variability in the data, and interpret their loadings. We conclude that the first principal component represents overall academic ability, and the second represents a contrast between quantitative ability and verbal ability.
What loadings would be consistent with that interpretation? Choose all that apply.
(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5) (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5)
Explanation
For the first two choices, the two loading vectors are not orthogonal. For the fifth and sixth choices, the first set of loadings only has to do with two specific tests. For the third and fourth pairs of loadings, the first component is proportional to average score, and the second component measures the difference between the first pair of scores and the second pair of scores.
k-means clustering
K-means clustering defines “good clustering” as minimal variance between points within a cluster when “within cluster variation is small”
- We get different results if we choose a different starting point
(FROM HOML)
KMEANS:
Suppose we are given centroids?
We label based on how far the particular point is from the centroid
Suppose we are given labeled data?
We compute the centroids by taking the mean of the instances for each cluster
What happens if we are given data and no centroids and no labels?
- We randomly assign centroids
- Label the instances
- Update the centroids
- Label the instances
- Update the centroids etc. until centroids don’t move
FEATURE SCALING
PROBLEM: We have all these points but no idea how to categorize them!
SOLUTION: Clustering!! We know how many groups/categorize their should be, right?! (This is an easy way to find “k” in k-means, for example if we have a group and we know it’s men and women so k=2)
PROBLEM: No, sadly, we don’t know how many groups or categorids
SOLUTION: Heirarchial clustering!!
PROBLEM: But Joe, we don’t know what that is!
SOLUTION: Watch the lecture, of course!!
PROBLEM: The lecture says to “start at the bottom” and “work our way up” finding the “distances between the points” – WTF does that mean!?
SOLUTION: Well, we frequently find the distances between points by using squared distance.
PROBLEM: But why do we take SQUARED distance?! Why not just distance!?
SOLUTION: Because squared distance is a way of NORMALIZING our data. This is also called Feature Scaling.
From Stack Overflow
*For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the Sum of squares:
SUM_i(Xi-Xbar)² Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you’re using and what you’re aiming at. Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.*
Hierarchical Clustering
Four ways of measuring distance between clusters:
- Complete linkage (furthest point in cluster to furthest point in new cluster)
- Average linkage (average of one cluster to average of second cluster)
- Single?
- Centroid (center of one cluster to center of second cluster)
Complete + Average are most commonly used. Single Linkage is error prone.
All about measuring distance
- Squared distance
- Euclidean distance
- Correlation between the two profiles (good for time series data)
- Sometimes it matters if things are positively correlated or anti-correlated and other times we just use the absolute value of a correlation
CORRELATION-BASED DISTANCE: considers two observations to be similar if their features are highly correlated (This is a new and unusual way of using correlation because usually it’s computed between VARIABLES, now it’s computed between the observation profiles for each pair of observations)
NOTE: the only thing you need for clustering (hierarchal or k-means) is a distance measure of your choosing
- Scaling of the variables matters!! (If variables are not in the same units, we must standardize!! Also for PCA!!)
- HOW TO: If not the same units, make the mean 0 and standard deviation to be 1
- If same units, do both. Once as is, once with standardization.
- You can choose the features you use! If you change the features, you will change the kind of cluster you get
Review Questions
True or False: If we cut the dendrogram at a lower point, we will tend to get more clusters, and cannot get fewer clusters (assuming complete, single or average linkage).
Answer: True
Explanation
After cutting the dendrogram at threshold t, we keep all the joins with linkage distance less than t and discard the joins with larger linkage distance. Thus, decreasing the threshold gives us fewer joins, and thus more clusters. If, in decreasing the threshold, we don’t cross a junction of the dendrogram, the number of clusters will remain the same.
UNSUPERVISED LEARNING, A WRAP UP!
- Unsupervised learning is where we don’t have labeled data
- It is harder to deal with because there is no “standard” we are reaching for (we cannot test accuracy and there isn’t an “outcome variable”)
- There are other methods of handling unsupervised learning, but in this section we covered PCA and clustering.
- Other methods (outlined in ESL ch.14) include – self-organizing maps, independent components analysis and spectral clustering