MSDS IST 736: Text Mining
Text Mining
MSDS - Q6: IST736
Description:
Introduces concepts and methods for knowledge discovery from large amounts of text data and the application of text mining techniques for business intelligence, digital humanities, and social behavior analysis.
Additional Course Description:
The main goal of this course is to increase student awareness of the power of large amounts of text data and computational methods to find patterns in large text corpora. This course is designed as a general introductory- level course for all students who are interested in text mining. Programming skill is preferred but not required in this class. This course will introduce the concepts and methods of text mining technologies rooted from machine learning, natural language processing, and statistics. This course will also showcase the applications of text mining technologies in (1) information organization and access, (2) business intelligence,(3) social behavior analysis,and (4) digital humanities.
Learning Objectives:
After taking this course, the students will be able to:
- describe basic concepts and methods in text mining, for example document representation, information extraction, text classification and clustering, and topic modeling;
- use benchmark corpora, commercial and open-source text analysis and visualization tools to explore interesting patterns;
- understand conceptually the mechanism of advanced text mining algorithms for information extraction, text classification and clustering, opinion mining, and their applications in real-world problems; and
- choose appropriate technologies for specific text analysis tasks and evaluate the benefit and challenges of the chosen technical solution.
Deliverables
Class Outline
Assessments
- Homework 1 Assignment Completed
- Homework 2 Assignment Completed
- Homework 3 Assignment Completed
- Homework 4 Assignment Completed
- Homework 5 Assignment Completed
- Homework 6 Assignment Completed
- Homework 7 Assignment Completed
- Homework 8 Assignment Completed
- Final Project Assignment Completed
Week 1 | Introduction to Text Mining
- 1.1 Readings
- 1.2 Student Self-Introduction
- 1.3 Text Representation/Vectorization
- 1.4 Exploratory Text Mining
- 1.5 Predictive Text Mining
- 1.6 Difference Between Text Mining and NLP
- 1.7 Class Policies
- 1.8 Software Download
Week 2 | Document Vectorization
- 2.1 Readings
- 2.2 Tokenization
- 2.3 Vectorization
- 2.4 Reducing Vocabulary Size
- 2.5 Vectorization in sklearn (Optional, Advanced Content, See Also Unit 6 Vectorization)
Week 3 | Corpus Analysis
- 3.1 Readings
- 3.2 Corpus Analysis Overview
- 3.3 Regular Expression for Data Cleaning and Information Extraction
- 3.4 Corpus Construction
- 3.5 Use Specific Lexicons for Corpus Analysis
- 3.6 Comparative and Trend Analysis
- 3.7 Power of Big Data
Week 4 | Naive Bayes for Text Categorization
- 4.1 Readings
- 4.2 Review of Vectorization
- 4.3 Decision Tree for Text Categorization
- 4.4 Naive Bayes for Text Categorization (Theory)
- 4.5 Multinomial Naive Bayes in Weka
- 4.6 Feature Ranking in MultinomialNB
- 4.7 MultinomialNB vs. MultinomialNBText in Weka
Week 5 | Text Classifier Evaluation and Human Annotation
- 5.1 Readings
- 5.2 Model Overfitting
- 5.3 Evaluation Methods
- 5.4 Evaluation Measures
- 5.5 Training Label Acquisition
- 5.6 Crowdsourcing Annotations
Week 6 | scikit-learn
- 6.1 Readings
- 6.2 Data Preparation
- 6.3 Vectorization
- 6.4 Training and Explaining MultinomialNB Model
- 6.5 Use Model for Prediction and Explain Prediction Result
- 6.6 Kaggle Submission
- 6.7 Compare BernoulliNB and MultinomialNB With Cross Validation
Week 7 | SVMs for Text Categorization
- 7.1 Readings
- 7.2 Document Similarity Measure
- 7.3 SVM for Text Categorization
- 7.4 SVM in Weka
- 7.5 LinearSVC in Sklearn
- 7.6 Sklearn Feature Engineering (Optional)
Week 8 | Document Clustering and Topic Modeling
- 8.1 Readings
- 8.2 K-Means for Document Clustering
- 8.3 LDA for Topic Modeling
- 8.4 LDA Topic Modeling with Mallet and sklearn
- 8.5 Explaining the Topics
Week 9 | Literature Review on Text Mining Applications/Project Idea Presentation
- 9.1 Readings
- 9.2 Text Mining Applications
- 9.3 Final Project Presentations
Week 10 | New Topics in Text Mining/Project Clinic and Presentation
- 10.1 Readings
- 10.2 Computational Social Science
- 10.3 Bias in AI Algorithms
- 10.4 Final Project Presentations