WK2
Corresponding Jupyter Notebook
Readings
2.2 Tokenization
- To get computers to understand language, we need to turn things into numbers they can count
What to count
- Tokens – assemble a set of rules about grouping characters
- First, split into sentences (with punctuation)
- Second, split into words (with spaces)
- Which tokenizer is best? It depends on context & task
- N-Gram: Multiword Tokens
- Uni-grams
- Bi-grams
- Tri-grams
Challenges with Tokenization
- When whitespace is absent (for example, in a URL, other languages etc.) tokenization is hard
- Lowercase vs. Uppercase – which to choose?
- Same word means different things (bank and bank)
- Inflection (dishwasher, dishwashers)
Word Sense Disambiguation (WSD)
- use word context to decide the word sense
- so far doesn’t help search engines significantly
- not widely used in text mining
2.3 VECTORIZATION
How to count
- Bag of Words
- Boolean (sparce matrix)
- Term frequency
- Normalized term frequency
- Tf*idf
- TF*IDF Vectors
- Tf: term (word) frequency
- Df: Document frequency how many docs contain this term) (e.g. in 2 out of 3 documents – 2/3)
- Idf: Inversed-document frequcny (3/2)
- Tfidf = tf*log(idf)
- TL;DR – TF*IDF penalizes common words and boosts the weight of rare words
Normalization
- Normalization by Document Length (L-1)
- L2 Normalization ??
Sparse Matrix
- Compressed Text Representation (look up dictionary of “index:word” pairs) Each word that occurs in a document will be recorded as an “index:value” pair
2.4 Reducing Vocabulary Size
Stemming
Lemmatizer
Turn all to lowercase
- Note: Do not do this if the uppercase words are important
Remove Stopwords
… with caution!
- The Secret Life of Pronounds What our words say about us
- (Identifying a personal blog/homepage by pronoun count)
- Function words for authorship attribution
- Gender Dfference in Speech and Writing