IST736 WK 8 – Async

2 minute read

8.1 ReadingsPage

8.2 K-Means for Document Clustering

– 2 vid 1 answer

Similarity/Distance measures

Eucledian distance
Cosine similarity

Centroid

Choose best centroid using SSE – Sum of Squared Errors
We want a smaller SSE

How to choose the K?

Domain knowedge
The elbow method

What if the iteration never stops?

Set max number of iterations
Set min of SSE change

Variations of K-means

Considered “hard clustering”
Softer clustering is called EM clustering (Expectation Maximization Algo)

Cluster Validity

Clusters are in the eyes of the beholder (we don’t have things like accuracy precision and recall)
Clustering is best used for exploratory data analysis
Clustering is usually best as a starting point for further analysis

ANSWER

I just did this in sklearn but I’m pretty sure the results are different because of different centroids and different Ks. Depending on the differences in centroids and Ks you can get very different results.

8.3 LDA for Topic Modeling

– 1 vid 1 answer

Why Topic Modeling?

Gives a birds eye view using LDA
Latent Semantic Analysis (LSA)
Probabilistic Topic Models

What is a topic?

Distribution over a vocabulary (eg evolutionary biology topic has words about evolutionary biology with high probability)
Topic Trend Analysis (if topic has a time stamp)

ANSWER

In the aftermath of the 2016 election, it came out that a lot of people harbored secret feelings of dissatisfaction with our government and wanted a change, leading to the “unpredicted outcome.” It would be interesting to see what the “social climate” was before the 2016 election – we could take twitter data, facebook data, any data that is self-referential and see if we could get both topics and trend analysis. This might help us better understand what the polls weren’t telling us – how do people REALLY feel vs how do they claim to feel?

8.4 LDA Topic Modeling with Mallet and sklearn

– 2 vid 2 answers

Use MALLET’s sample data to build an LDA model. List the top 10 words for each topic.

THREE STEPS

Data prep
Topic modeling
Examine output file and see what model has learned

–

Like vectorization

8.5 Explaining the Topics

– written

Word Intrusion and Topic Intrusion are interesting both conceptually and mechanically (thinking how I would implement this in a coding environment). To be completely honest, I’d be more interested in trying to come up with a way to make my computer do this than to attempt to quantify the “gut reaction” of the humans.

Share on

Twitter Facebook LinkedIn

Daniel Caraway

IST736 WK 8 – Async

8.1 ReadingsPage

8.2 K-Means for Document Clustering

Similarity/Distance measures

Centroid

How to choose the K?

What if the iteration never stops?

Variations of K-means

Cluster Validity

ANSWER

8.3 LDA for Topic Modeling

Why Topic Modeling?

What is a topic?

ANSWER

8.4 LDA Topic Modeling with Mallet and sklearn

8.5 Explaining the Topics

Share on

You may also enjoy

daily log 03-25-21

How to use Data Science Superpowers for Useless Things: Getting a Job at Amazon, Take 2

How to use Data Science Superpowers for Useless Things: Getting a Job at Amazon

How to use Data Science Superpowers for Useless Things: Adding Text to Images (aka Cats Narrate the Big Lebowski)