MSDS IST 652: Scripting for Data Analysis

3 minute read

Scripting for Data Analysis

MSDS - Q5: IST652 (2019-0703)

Description:

Scripting for the data science pipeline. Acquiring, accessing, and transforming data in the forms of structured, semistructured, and unstructured data.

Additional Course Description:

The goal of this class is to teach students the tools and skills of scripting needed to solve problems of accessing and preparing data in a variety of formats and situations, sometimes known as data wrangling. The scripting will provide the skills needed to form data science pipelines, from acquiring and cleaning data to accessing data and transforming data for analysis or visualization.

The main content focus is on information access and processing tasks on the types of structured, semistructured, and unstructured data in current use in information applications. For these three types of data, the course includes the use of structured numeric and text data such as that from a spreadsheet or database, the use of data obtained through standard data exchange formats such as HTML or XML from web pages or JSON from web-based APIs, and the use of data obtained by pattern matching from text or log files. The scripting language Python was chosen because of its ease of use and available packages to work with data in many information applications. The skills learned in this class are intended to complement the analytical and visualization skills learned in other data science courses. The scripting language Python will be taught, but it will be assumed that students already have a programming background, either through course work or through online study.

Learning Objectives:

Upon successful completion of this course, the student will be able to:

  • write scripts to access and amass data from fields in structured data, access fields in semistructured data, and define and find patterns of data in unstructured data;
  • prepare and transform data to produce data summaries, lists, and networks;
  • analyze and solve data access problems for the three types of data and to find and deploy appropriate software packages that can be integrated into the problem solution; and
  • frame real-world data questions and show how they can be answered from data.

Deliverables:

Assessments

  • Lab Problem 1
  • Lab Problem 2
  • Lab Problem 3
  • Quiz 1
  • Quiz 2
  • Participation
  • Homework 1: Structured Data
  • Homework 2: Semistructured Data
  • Final Project ProposalAssignment You must submit this assignment to mark it complete.
  • Final Project Presentation
  • Final Project Report

Unit 1 | Data Pipeline and Python Language Basics

  • 1.1 Readings
  • 1.2 Why Python for Data Analysis?
  • 1.3 Building Blocks of Python
  • 1.4 Booleans and Conditional Statements
  • 1.5 Lists and Simple Loops

Unit 2 | Booleans and Dictionaries

  • 2.1 Readings
  • 2.2 More About Strings
  • 2.3 Tuples
  • 2.4 Writing a Program
  • 2.5 Reading Files

Unit 3 | Exploring and Transforming Data for Structured Data

  • 3.1 Readings
  • 3.2 Data Exploration
  • 3.3 Using Dictionaries
  • 3.4 Sorting Lists
  • 3.5 Reading CSV files
  • 3.6 Loops and Numeric Summarization

Unit 4 | Arrays, Functions, and Categorical Summarization

  • 4.1 Readings
  • 4.2 Functions and Documentation Strings
  • 4.3 Categorical Summarization
  • 4.4 Numeric Arrays
  • 4.5 Reading Data From a Spreadsheet

Unit 5 | Stacking and Unstacking Data

  • 5.1 Readings
  • 5.2 Data Transformations
  • 5.3 Data Stacking and Unstacking
  • 5.4 Writing Reports and Tables to a File

Unit 6 | Semi-structured Data

  • 6.1 Readings
  • 6.2 Semi-structured Data
  • 6.3 HTML—Hypertext Markup Language
  • 6.4 XML, DOM, and Element Tree
  • 6.5 Beautiful Soup

Unit 7 | Mongo Database, JSON From RSS

  • 7.1 Readings
  • 7.2 Semi-structured Data: JSON
  • 7.3 Unicode and Python
  • 7.4 MongoDB
  • 7.5 MongoDB and JSON Data

Unit 8 | Processing Twitter and Facebook

  • 8.1 Readings
  • 8.2 Twitter API and Python Package Tweepy
  • 8.3 Users, Trends, and Tweets
  • 8.4 Twitter: Collecting Tweets
  • 8.5 Facebook API

Unit 9 | Unstructured Data

  • 9.1 Readings
  • 9.2 Text Tokenization
  • 9.3 Regular Expressions
  • 9.4 Finding Patterns in Text
  • 9.5 Sentiment Analysis

Unit 10 | Network Structures

  • 10.1 Readings
  • 10.2 Social Network Analysis
  • 10.3 Showing Geolocations on Maps
  • 10.4 Facebook Post Comments