MSDS IST 736: Text Mining

3 minute read

Text Mining

MSDS - Q6: IST736

Description:

Introduces concepts and methods for knowledge discovery from large amounts of text data and the application of text mining techniques for business intelligence, digital humanities, and social behavior analysis.

Additional Course Description:

The main goal of this course is to increase student awareness of the power of large amounts of text data and computational methods to find patterns in large text corpora. This course is designed as a general introductory- level course for all students who are interested in text mining. Programming skill is preferred but not required in this class. This course will introduce the concepts and methods of text mining technologies rooted from machine learning, natural language processing, and statistics. This course will also showcase the applications of text mining technologies in (1) information organization and access, (2) business intelligence,(3) social behavior analysis,and (4) digital humanities.

Learning Objectives:

After taking this course, the students will be able to:

  • describe basic concepts and methods in text mining, for example document representation, information extraction, text classification and clustering, and topic modeling;
  • use benchmark corpora, commercial and open-source text analysis and visualization tools to explore interesting patterns;
  • understand conceptually the mechanism of advanced text mining algorithms for information extraction, text classification and clustering, opinion mining, and their applications in real-world problems; and
  • choose appropriate technologies for specific text analysis tasks and evaluate the benefit and challenges of the chosen technical solution.

Deliverables

Class Outline

Assessments

  • Homework 1 Assignment Completed
  • Homework 2 Assignment Completed
  • Homework 3 Assignment Completed
  • Homework 4 Assignment Completed
  • Homework 5 Assignment Completed
  • Homework 6 Assignment Completed
  • Homework 7 Assignment Completed
  • Homework 8 Assignment Completed
  • Final Project Assignment Completed

Week 1 | Introduction to Text Mining

  • 1.1 Readings
  • 1.2 Student Self-Introduction
  • 1.3 Text Representation/Vectorization
  • 1.4 Exploratory Text Mining
  • 1.5 Predictive Text Mining
  • 1.6 Difference Between Text Mining and NLP
  • 1.7 Class Policies
  • 1.8 Software Download

Week 2 | Document Vectorization

  • 2.1 Readings
  • 2.2 Tokenization
  • 2.3 Vectorization
  • 2.4 Reducing Vocabulary Size
  • 2.5 Vectorization in sklearn (Optional, Advanced Content, See Also Unit 6 Vectorization)

Week 3 | Corpus Analysis

  • 3.1 Readings
  • 3.2 Corpus Analysis Overview
  • 3.3 Regular Expression for Data Cleaning and Information Extraction
  • 3.4 Corpus Construction
  • 3.5 Use Specific Lexicons for Corpus Analysis
  • 3.6 Comparative and Trend Analysis
  • 3.7 Power of Big Data

Week 4 | Naive Bayes for Text Categorization

  • 4.1 Readings
  • 4.2 Review of Vectorization
  • 4.3 Decision Tree for Text Categorization
  • 4.4 Naive Bayes for Text Categorization (Theory)
  • 4.5 Multinomial Naive Bayes in Weka
  • 4.6 Feature Ranking in MultinomialNB
  • 4.7 MultinomialNB vs. MultinomialNBText in Weka

Week 5 | Text Classifier Evaluation and Human Annotation

  • 5.1 Readings
  • 5.2 Model Overfitting
  • 5.3 Evaluation Methods
  • 5.4 Evaluation Measures
  • 5.5 Training Label Acquisition
  • 5.6 Crowdsourcing Annotations

Week 6 | scikit-learn

  • 6.1 Readings
  • 6.2 Data Preparation
  • 6.3 Vectorization
  • 6.4 Training and Explaining MultinomialNB Model
  • 6.5 Use Model for Prediction and Explain Prediction Result
  • 6.6 Kaggle Submission
  • 6.7 Compare BernoulliNB and MultinomialNB With Cross Validation

Week 7 | SVMs for Text Categorization

  • 7.1 Readings
  • 7.2 Document Similarity Measure
  • 7.3 SVM for Text Categorization
  • 7.4 SVM in Weka
  • 7.5 LinearSVC in Sklearn
  • 7.6 Sklearn Feature Engineering (Optional)

Week 8 | Document Clustering and Topic Modeling

  • 8.1 Readings
  • 8.2 K-Means for Document Clustering
  • 8.3 LDA for Topic Modeling
  • 8.4 LDA Topic Modeling with Mallet and sklearn
  • 8.5 Explaining the Topics

Week 9 | Literature Review on Text Mining Applications/Project Idea Presentation

  • 9.1 Readings
  • 9.2 Text Mining Applications
  • 9.3 Final Project Presentations

Week 10 | New Topics in Text Mining/Project Clinic and Presentation

  • 10.1 Readings
  • 10.2 Computational Social Science
  • 10.3 Bias in AI Algorithms
  • 10.4 Final Project Presentations