MSDS IST 707: Scripting for Data Analysis

4 minute read

Data Analytics

MSDS - Q3: IST707

Description:

Introduction to data mining techniques, familiarity with particular real- world applications, challenges involved in these applications, and future directions of the field. Hands-on experience with open-source software packages.

Additional Course Description:

This course will introduce popular data mining methods for extracting knowledge from data. The principles and theories of data mining methods will be discussed and will be related to the issues in applying data mining to problems. Students will also acquire hands-on experience using state-of-the-art software to develop data mining solutions to scientific and business problems. The focus of this course is in understanding of data and how to formulate data mining tasks in order to solve problems using the data.

The topics of the course will include the key tasks of data mining, including data preparation, concept description, association rule mining, classification, clustering, evaluation and analysis. Through the exploration of the concepts and techniques of data mining and practical exercises, students will develop skills that can be applied to business, science or other organizational problems.

The format of the class meetings will be a combined lecture and lab format, with lectures and class discussions to cover material and lab time to investigate small examples for the topic of the week. There will be weekly readings based on the textbook and on other materials, which will be posted on-line.

Learning Goals

After taking this course, the students will be able to:

  • Document, analyze, and translate data mining needs into technical designs and solutions.
  • Apply data mining concepts, algorithms, and evaluation methods to real-world problems.
  • Employ data storytelling and dive into the data, find useful patterns, and articulate what patterns have been found, how they are found, and why they are valuable and trustworthy.

Deliverables:

Class Outline:

Assessments

  • Homework Assignment 1 (week 1)
  • Homework Assignment 2 (week 2)
  • Homework Assignment 3 (week 3)
  • Homework Assignment 4 (week 4)
  • Homework Assignment 5 (week 5)
  • Homework Assignment 6 (week 7)
  • Homework Assignment 7 (week 8)
  • Homework Assignment 8 (week 9)
  • Project ProposalForum
  • Project Progress ReportForum
  • Project Report

Week 1 | What Is Data Mining?

  • 1.1 Readings
  • 1.2 FAQs
  • 1.3 Student Self Introduction
  • 1.4 Classification
  • 1.5 Clustering
  • 1.6 Association Rule Mining
  • 1.7 Relationship with Other Fields
  • 1.8 Descriptive vs. Predictive Analysis
  • 1.9 Challenges of Data Mining
  • 1.10 Data Communication Skills
  • 1.11 Install R Studio and Weka
  • 1.12 Week 1 Class Participation

Week 2 | Data Preparation

  • 2.1 Readings
  • 2.2 Data and Code
  • 2.3 Data Set Types
  • 2.4 Attribute Types
  • 2.5 Convert Attribute Type in R
  • 2.6 Data Quality Issues
  • 2.7 Summary Statistics
  • 2.8 Visualization
  • 2.8.1 Exercise: Visualize Titanic DataForum
  • 2.9 Aggregation
  • 2.10 Transformation
  • 2.10.1 Exercise: TransformationForum
  • 2.11 Sampling
  • 2.12 Data Preparation in R
  • 2.13 Week 2 Class Participation

Week 3 | Association Rule Mining

  • 3.1 Readings
  • 3.2 Data and Code
  • 3.3 What Is Association Rule Mining?
  • 3.4 Basic Concepts in Association Rule Mining
  • 3.5 Apriori Algorithm
  • 3.6 Rule Evaluation
  • 3.7 Weka AR Demonstration
  • 3.8 R AR Demonstration
  • 3.9 RMD Demonstration
  • 3.10 Week 3 Class Participation

Week 4 | Clustering Techniques

  • 4.1 Readings
  • 4.2 Data and Code
  • 4.3 What Is Clustering Analysis?
  • 4.4 Distance Measure
  • 4.5 K-Means Algorithm
  • 4.6 Tuning K-Means
  • 4.7 Weka Demonstration on K-Means
  • 4.7.1 Centroid InterpretationForum
  • 4.7.2 K-Means for Outlier DetectionForum
  • 4.8 R Demonstration on K-Means
  • 4.8.1 Forum: K-Means With R
  • 4.9 K-Means Case Study
  • 4.10 HAC Algorithm
  • 4.11 Week 4 Class Participation

Week 5 | Decision Tree Theory

  • 5.1 Readings
  • 5.2 Data and Code
  • 5.3 Review Relevant Content
  • 5.4 What Is a Decision Tree Model?
  • 5.4.1 What Is a Decision Tree Model?Forum
  • 5.5 C4.5 Algorithm (1): How to Split Data at a Node
  • 5.6 C4.5 Algorithm (2): Which Attribute to Choose as a Node
  • 5.7 Information Gain calculation exercise (optional)Forum
  • 5.8 Gain Ratio
  • 5.9 Apply Model
  • 5.10 Overfitting and Pruning
  • 5.11 Weka J48 Tutorial
  • 5.11.1 Weka Titanic Data Preprocessing Tutorial
  • 5.11.2 Use Weka J48 to Build Decision Tree
  • 5.11.3 Forum: Weka J48 Exercise
  • 5.12 RWeka J48 Tutorial
  • 5.13 Week 5 Class Participation

Week 6 | Model Evaluation

  • 6.1 Readings
  • 6.2 Model Overfitting
  • 6.2.1 Forum: Unpruned and Pruned Trees
  • 6.3 Model Evaluation Methods
  • 6.4 Model Evaluation Metrics
  • 6.5 Model Comparison
  • 6.6 Training Set Size
  • 6.7 Training Data Acquisition
  • 6.8 Reproducible Research
  • 6.9 Week 6 Class Participation

Week 7 | Naive Bayes

  • 7.1 Readings
  • 7.2 Data and Code
  • 7.3 Review of Probability Concepts
  • 7.3.1 Chances Are…
  • 7.4 Bayes Rules in Mammogram Example
  • 7.5 Naive Bayes
  • 7.6 Smoothing
  • 7.7 Probability of Continuous Variable
  • 7.8 Summary of Naive Bayes
  • 7.9 Variations in Naive Bayes Implementations
  • 7.9.1 Mitchell’s Pseudocode for a Multinomial Naive Bayes Text Classifier
  • 7.10 Weka Demonstration
  • 7.11 R Demonstration
  • 7.12 Week 7 Class Participation

Week 8 | SVM and Others

  • 8.1 Readings
  • 8.2 Data and Code
  • 8.3 Review Distance Measure
  • 8.4 kNN
  • 8.5 Max-Margin SVMs
  • 8.6 Soft-Margin SVMs
  • 8.7 SVM Kernels
  • 8.7.1 SVM With a Polynomial Kernal Notation
  • 8.7.2 SVM Kernels, Continued
  • 8.8 Strength and Weakness
  • 8.9 Ensemble Learning
  • 8.10 Weka Demonstration
  • 8.11 R Demonstration
  • 8.12 Exercise on kNN
  • 8.13 Exercise on SVMs
  • 8.14 Exercise on Random Forest
  • 8.15 Exercise on Algorithm Comparison
  • 8.16 Week 8 Class Participation

Week 9 | Text Mining

  • 9.1 Readings
  • 9.2 Data and Code
  • 9.3 Review Data Set Type in Week 2
  • 9.4 Review Importance of Normalization in Week 4
  • 9.5 Tokenization
  • 9.6 Vectorization
  • 9.7 Reducing Vocabulary Size
  • 9.8 Little Words Can Make a Big Difference
  • 9.9 Advanced Text Representation
  • 9.10 Weka Demonstration on Text Categorization
  • 9.11 Exercise: Weka Text Categorization
  • 9.12 Week 9 Class Participation

Week 10 | Literature Review / Student Project Presentation

  • 10.1 Readings
  • 10.2 Code and Data
  • 10.3 Exercise: Find Data Mining Applications in Academic and Business Literature
  • 10.4 Student Project Presentation
  • 10.5 Week 10 Class Participation