Presentation Talking Points

1 minute read

DATA CLEANING:

This is the part we would leave out if we were presenting to our stake-holders. However, as we are presenting to fellow data scientists, we wanted to include this part about data cleaning and we hope Professor Gates will allow (and maybe even get excited by) these slides.

As data scientists, we know that 1. the world is full of data 2. almost none of it is in a tidy format like our 707 class lulled us into believing. This class in particular highlights a particularly challenging type (pun intended) of data. Text. With our spirits high and our toolbags full, we prepared to run our text data – last statements – through our pipelines no problem. What we didn’t anticipate, as the naive young saplings we are, is that much of this text data wasn’t actually… in text format. At least, not yet. It was in image format. It was scanned images of text – via different types of intake forms – spanning over 50 years.

Dog and wolf

THREE IMAGES:

Standardization was a big issue for us. Here are three examples of the different kinds of documents we were working with —

How?! How would we handle these issues? Enter the beautiful world of Optical Character Recognition –

OCR examples:

First, what do we mean by “good” and “bad” How did OCR do on the documents? Did it read the text well? How “standardized” were the documents? “Bad” means not standard or handwritten (see aforementioned slide)