IST736WK3

1 minute read

IST 736

WK 3 CORPUS ANALYSIS

OVERVIEW:

Corpus Construction

  • collect text data from web
  • Clean text data (regex)

Corpus analysis

  • Then use frequency analysis to explore corpus

3.3

ANSWER: Using Find and Replace in Visual Studio code Regex option – Removed all lines that started with Thread and Reply using ^Thread.+$ and ^Reply.+$ Removed all white spaces using ^\n$

3.4 Corpus Construction

ANSWER: Oh man! This is the best question and even if answers drooled out of my mouth for the next hour, I’d still have more answers – I want to look at movie reviews, song lyrics, restaurant reviews (AND REMOVE ALL RESTAURANT REVIEWS ABOUT THE SERVICE!! I just want to know about the food, people!! I could use regex to highlight and pull out words with “waiter” or “service” or something and then see if that would give more food-centric reviews?!) I also want to look at debate transcripts and ALL the things!!

3.5 Use Specific Lexicons for Corpus Analysis

LIWC

I looked at some of President Trump’s recent tweets at http://www.trumptwitterarchive.com – the “authenticity” rating surprised me. The average for social is 55.66, Trump’s tweets clocked in at 8% (I’m quite curious how they are calculating this metric!)

Things to pay attention to:

  • FUNCTION WORDS: style and status
  • CONTENT: topic oriented

3.6 Comparative and Trend Analysis

ANSWER: Article

3.7 Power of Big Data

ANSWER: I first searched “Trump” and “Obama” in Google Books Ngrams and, since ‘Trump’ is a word AND a name, there were many, many occurrences where Obama only started getting mentioned in the last two decades. I then wanted to see if the newer “Fifty-Shades-of-Grades” trend was reflected in Google Books (and would google books even add these texts-formerly-known-as-smut!) and YES – searching “BDSM” is very interesting.

Updated: