IST736 WK 8 – Async
8.1 ReadingsPage
8.2 K-Means for Document Clustering
– 2 vid 1 answer
Similarity/Distance measures
- Eucledian distance
- Cosine similarity
Centroid
- Choose best centroid using SSE – Sum of Squared Errors
- We want a smaller SSE
How to choose the K?
- Domain knowedge
- The elbow method
What if the iteration never stops?
- Set max number of iterations
- Set min of SSE change
Variations of K-means
- Considered “hard clustering”
- Softer clustering is called EM clustering (Expectation Maximization Algo)
Cluster Validity
- Clusters are in the eyes of the beholder (we don’t have things like accuracy precision and recall)
- Clustering is best used for exploratory data analysis
- Clustering is usually best as a starting point for further analysis
ANSWER
I just did this in sklearn but I’m pretty sure the results are different because of different centroids and different Ks. Depending on the differences in centroids and Ks you can get very different results.
8.3 LDA for Topic Modeling
– 1 vid 1 answer
Why Topic Modeling?
- Gives a birds eye view using LDA
- Latent Semantic Analysis (LSA)
- Probabilistic Topic Models
What is a topic?
- Distribution over a vocabulary (eg evolutionary biology topic has words about evolutionary biology with high probability)
- Topic Trend Analysis (if topic has a time stamp)
ANSWER
In the aftermath of the 2016 election, it came out that a lot of people harbored secret feelings of dissatisfaction with our government and wanted a change, leading to the “unpredicted outcome.” It would be interesting to see what the “social climate” was before the 2016 election – we could take twitter data, facebook data, any data that is self-referential and see if we could get both topics and trend analysis. This might help us better understand what the polls weren’t telling us – how do people REALLY feel vs how do they claim to feel?
8.4 LDA Topic Modeling with Mallet and sklearn
– 2 vid 2 answers
Use MALLET’s sample data to build an LDA model. List the top 10 words for each topic.
THREE STEPS
- Data prep
- Topic modeling
- Examine output file and see what model has learned
–
- Like vectorization
8.5 Explaining the Topics
– written
Word Intrusion and Topic Intrusion are interesting both conceptually and mechanically (thinking how I would implement this in a coding environment). To be completely honest, I’d be more interested in trying to come up with a way to make my computer do this than to attempt to quantify the “gut reaction” of the humans.