WK2

1 minute read

Corresponding Jupyter Notebook

Readings

2.2 Tokenization

  • To get computers to understand language, we need to turn things into numbers they can count

What to count

  • Tokens – assemble a set of rules about grouping characters
  • First, split into sentences (with punctuation)
  • Second, split into words (with spaces)
  • Which tokenizer is best? It depends on context & task
  • N-Gram: Multiword Tokens
    • Uni-grams
    • Bi-grams
    • Tri-grams

Challenges with Tokenization

  • When whitespace is absent (for example, in a URL, other languages etc.) tokenization is hard
  • Lowercase vs. Uppercase – which to choose?
  • Same word means different things (bank and bank)
  • Inflection (dishwasher, dishwashers)

Word Sense Disambiguation (WSD)

  • use word context to decide the word sense
  • so far doesn’t help search engines significantly
  • not widely used in text mining

2.3 VECTORIZATION

How to count

  • Bag of Words
    • Boolean (sparce matrix)
    • Term frequency
    • Normalized term frequency
    • Tf*idf
  • TF*IDF Vectors
    • Tf: term (word) frequency
    • Df: Document frequency how many docs contain this term) (e.g. in 2 out of 3 documents – 2/3)
    • Idf: Inversed-document frequcny (3/2)
    • Tfidf = tf*log(idf)
  • TL;DR – TF*IDF penalizes common words and boosts the weight of rare words

Normalization

  • Normalization by Document Length (L-1)
  • L2 Normalization ??

Sparse Matrix

  • Compressed Text Representation (look up dictionary of “index:word” pairs) Each word that occurs in a document will be recorded as an “index:value” pair

2.4 Reducing Vocabulary Size

Stemming

NLTK Stemming Demo

Lemmatizer

Turn all to lowercase

  • Note: Do not do this if the uppercase words are important

Remove Stopwords

… with caution!

  • The Secret Life of Pronounds What our words say about us
  • (Identifying a personal blog/homepage by pronoun count)
  • Function words for authorship attribution
  • Gender Dfference in Speech and Writing

Updated: