IST736WK3
IST 736
WK 3 CORPUS ANALYSIS
OVERVIEW:
Corpus Construction
- collect text data from web
- Clean text data (regex)
Corpus analysis
- Then use frequency analysis to explore corpus
3.3
ANSWER:
Using Find and Replace in Visual Studio code Regex option –
Removed all lines that started with Thread and Reply using ^Thread.+$
and ^Reply.+$
Removed all white spaces using ^\n$
3.4 Corpus Construction
ANSWER: Oh man! This is the best question and even if answers drooled out of my mouth for the next hour, I’d still have more answers – I want to look at movie reviews, song lyrics, restaurant reviews (AND REMOVE ALL RESTAURANT REVIEWS ABOUT THE SERVICE!! I just want to know about the food, people!! I could use regex to highlight and pull out words with “waiter” or “service” or something and then see if that would give more food-centric reviews?!) I also want to look at debate transcripts and ALL the things!!
3.5 Use Specific Lexicons for Corpus Analysis
I looked at some of President Trump’s recent tweets at http://www.trumptwitterarchive.com – the “authenticity” rating surprised me. The average for social is 55.66, Trump’s tweets clocked in at 8% (I’m quite curious how they are calculating this metric!)
Things to pay attention to:
- FUNCTION WORDS: style and status
- CONTENT: topic oriented
3.6 Comparative and Trend Analysis
ANSWER: Article
3.7 Power of Big Data
ANSWER: I first searched “Trump” and “Obama” in Google Books Ngrams and, since ‘Trump’ is a word AND a name, there were many, many occurrences where Obama only started getting mentioned in the last two decades. I then wanted to see if the newer “Fifty-Shades-of-Grades” trend was reflected in Google Books (and would google books even add these texts-formerly-known-as-smut!) and YES – searching “BDSM” is very interesting.