NLPIA CH2

1 minute read

Math With Words (TF-IDF vectors)

book github

3.1 Bag of Words

3.2 Vectorizing

3.2.1 Vector Spaces

3.3 Zipf’s Law

3.4 Topic Modeling

3.4.1 Return of Zipf

3.4.2 Relevance ranking

3.4.3 Tools

3.4.4 Alternatives

3.4.5 Okapi BM25

3.4.6 What’s next


Math With Words (TF-IDF vectors)

  • Analyze meaning by counting words and term frequencies
  • Predict word occurence probabilities with Zipf’s Law
  • Represnt words as vectors
  • Find relevant documents using Inverse Document Frequencies
  • Estimate similarity of pairs of docs using cosine similarity and Okapi BM25

3.1 Bag of Words

3.2 Vectorizing

3.2.1 Vector Spaces

3.3 Zipf’s Law

3.4 Topic Modeling

3.4.1 Return of Zipf

3.4.2 Relevance ranking

3.4.3 Tools

3.4.4 Alternatives

3.4.5 Okapi BM25

3.4.6 What’s next

  • Web-scale search engines have TF-IDF term document matrix hidden under the hood
  • Term frequencies must be weighted by their inverse document frequency to ensure the most important, most meaningful words are bubbled to the top
  • Zipf’s law can help predict the frequencies of ALL the things! Words, characters, people, oh my!
  • Row of TF-IDF can be used as a vector represntation of the meanings of those individual words
  • Row of TF-IDF can be used to create a vector space model of word semantics (??)
  • Euclidean distance and similarity between pairs of high dimensional vectors doens’t adequately represnt their similarity for most NLP applications
  • Cosine Distance (the amount of overlap between vectors) can be calculated efficiently just by multiplying the elements of normalized vectors together and summing up those products (??)
  • Cosine distance is the go-to similarity score for most natural language vector representations

Updated: