IST736 WK 8 – Async

2 minute read

8.1 ReadingsPage

8.2 K-Means for Document Clustering

– 2 vid 1 answer

Similarity/Distance measures

  • Eucledian distance
  • Cosine similarity

Centroid

  • Choose best centroid using SSE – Sum of Squared Errors
  • We want a smaller SSE

How to choose the K?

  • Domain knowedge
  • The elbow method

What if the iteration never stops?

  • Set max number of iterations
  • Set min of SSE change

Variations of K-means

  • Considered “hard clustering”
  • Softer clustering is called EM clustering (Expectation Maximization Algo)

Cluster Validity

  • Clusters are in the eyes of the beholder (we don’t have things like accuracy precision and recall)
  • Clustering is best used for exploratory data analysis
  • Clustering is usually best as a starting point for further analysis

ANSWER

I just did this in sklearn but I’m pretty sure the results are different because of different centroids and different Ks. Depending on the differences in centroids and Ks you can get very different results.

8.3 LDA for Topic Modeling

– 1 vid 1 answer

Why Topic Modeling?

  • Gives a birds eye view using LDA
  • Latent Semantic Analysis (LSA)
  • Probabilistic Topic Models

What is a topic?

  • Distribution over a vocabulary (eg evolutionary biology topic has words about evolutionary biology with high probability)
  • Topic Trend Analysis (if topic has a time stamp)

ANSWER

In the aftermath of the 2016 election, it came out that a lot of people harbored secret feelings of dissatisfaction with our government and wanted a change, leading to the “unpredicted outcome.” It would be interesting to see what the “social climate” was before the 2016 election – we could take twitter data, facebook data, any data that is self-referential and see if we could get both topics and trend analysis. This might help us better understand what the polls weren’t telling us – how do people REALLY feel vs how do they claim to feel?

8.4 LDA Topic Modeling with Mallet and sklearn

– 2 vid 2 answers

Use MALLET’s sample data to build an LDA model. List the top 10 words for each topic.

THREE STEPS

  1. Data prep
  2. Topic modeling
  3. Examine output file and see what model has learned

  1. Like vectorization

8.5 Explaining the Topics

– written

Word Intrusion and Topic Intrusion are interesting both conceptually and mechanically (thinking how I would implement this in a coding environment). To be completely honest, I’d be more interested in trying to come up with a way to make my computer do this than to attempt to quantify the “gut reaction” of the humans.

Updated: