NLPIA Ch4 Notes
` conda activate nlpiaenv `
RESOURCES
The man the myth the legend – Andrew Ng
BOOK NOTES
pd.options.display.width = 120
Usually, a Naive Bayes classifier won’t work well when your vocabulary is much larger than the number of labeled examples in your dataset. That’s where the semantic analysis techniques of this chapter can help.
The math you use to uncover the meaning of words in LSA is called singular value decomposition. SVD, from your linear algebra class, is what LSA uses to create vectors like those in the word-topic matrices just discussed
They were all finding that the semantic similarity between two natural language expressions (or individual words) is proportional to the similarity between the contexts in which words or expressions are used.
4.4 Principal Component Analysis
- Another name for SVD (when used for dimension reduction)
Whichever algorithm or implementation you use for semantic analysis (LSA, PCA, SVD, truncated SVD, or LDiA), you should normalize your BOW or TF-IDF vectors first.
TO REMEMBER!
- Term-document matrix is a matrix where ROWS are TERMS and COLUMNS are DOCUMENTS
- The SVD algo behind LSA “notices” when terms are frequently together and lumps those together for us (giving us dimensions for free)
TO GOOGLE
- Adding a “ghost” count
57 A full text index in a database like PostgreSQL is usually based on trigrams of characters, to deal with spelling errors and text that doesn’t parse into words.
In this chapter, you’ve learned two ways—LSA and LDiA—to compute topic vectors that capture the semantics (meaning) of words and documents in a vector. One of the reasons that latent semantic analysis was first called latent semantic indexing was because it promised to power semantic search with an index of numerical values, like BOW and TF-IDF tables. Semantic search was the next big thing in information retrieval.