NLPIA Ch4 Notes

1 minute read

` conda activate nlpiaenv `

RESOURCES

Sklearn.lda.LDA

Truncated SVD

Deep Learning – PCA

The man the myth the legend – Andrew Ng

Jurafsky Ch.16

Low Dimensional Metrics

Spotify Annoy

BOOK NOTES

pd.options.display.width = 120

Usually, a Naive Bayes classifier won’t work well when your vocabulary is much larger than the number of labeled examples in your dataset. That’s where the semantic analysis techniques of this chapter can help.

The math you use to uncover the meaning of words in LSA is called singular value decomposition. SVD, from your linear algebra class, is what LSA uses to create vectors like those in the word-topic matrices just discussed

They were all finding that the semantic similarity between two natural language expressions (or individual words) is proportional to the similarity between the contexts in which words or expressions are used.

4.4 Principal Component Analysis

  • Another name for SVD (when used for dimension reduction)

Whichever algorithm or implementation you use for semantic analysis (LSA, PCA, SVD, truncated SVD, or LDiA), you should normalize your BOW or TF-IDF vectors first.

TO REMEMBER!

  • Term-document matrix is a matrix where ROWS are TERMS and COLUMNS are DOCUMENTS
  • The SVD algo behind LSA “notices” when terms are frequently together and lumps those together for us (giving us dimensions for free)

TO GOOGLE

  • Adding a “ghost” count

57 A full text index in a database like PostgreSQL is usually based on trigrams of characters, to deal with spelling errors and text that doesn’t parse into words.

In this chapter, you’ve learned two ways—LSA and LDiA—to compute topic vectors that capture the semantics (meaning) of words and documents in a vector. One of the reasons that latent semantic analysis was first called latent semantic indexing was because it promised to power semantic search with an index of numerical values, like BOW and TF-IDF tables. Semantic search was the next big thing in information retrieval.

Updated: