Tutorial - vectorize text documents in sklearn

This tutorial demonstrates how to use the Sci-kit Learn (sklearn) package to vectorize text documents.

Step 1: Read in data

In [1]:
# read in the tsv file as input
import pandas as p
input=p.read_csv("../A-data/moviereview.tsv", delimiter='\t')
docs=input['text'].values
print(docs[0:2])
['\'plot : two teen couples go to a church party , drink and then drive . \\nthey get into an accident . \\none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \\nwhat\\\'s the deal ? \\nwatch the movie and \\" sorta \\" find out . . . \\ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \\nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\\\'t snag this one correctly . \\nthey seem to have taken this pretty neat concept , but executed it terribly . \\nso what are the problems with the movie ? \\nwell , its main problem is that it\\\'s simply too jumbled . \\nit starts off \\" normal \\" but then downshifts into this \\" fantasy \\" world in which you , as an audience member , have no idea what\\\'s going on . \\nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \\nnow i personally don\\\'t mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film\\\'s biggest problem . \\nit\\\'s obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \\nand do they make things entertaining , thrilling or even engaging , in the meantime ? \\nnot really . \\nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn\\\'t the make the film all that more entertaining . \\ni guess the bottom line with movies like this is that you should always make sure that the audience is \\" into it \\" even before they are given the secret password to enter your world of understanding . \\ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \\nokay , we get it . . . there \\nare people chasing her and we don\\\'t know who they are . \\ndo we really need to see it over and over again ? \\nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \\napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \\nthere might\\\'ve been a pretty decent teen mind-fuck movie in here somewhere , but i guess \\" the suits \\" decided that turning it into a music video with little edge , would make more sense . \\nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \\nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character\\\'s unraveling . \\noverall , the film doesn\\\'t stick because it doesn\\\'t entertain , it\\\'s confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \\noh , and by the way , this is not a horror or teen slasher flick . . . it\\\'s \\njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \\nit also wrapped production two years ago and has been sitting on the shelves ever since . \\nwhatever . . . skip \\nit ! \\nwhere\\\'s joblo coming from ? \\na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \\n\''
 '\'the happy bastard\\\'s quick movie review \\ndamn that y2k bug . \\nit\\\'s got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . \\nlittle do they know the power within . . . \\ngoing for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . \\nwe don\\\'t know why the crew was really out in the middle of nowhere , we don\\\'t know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don\\\'t know why donald sutherland is stumbling around drunkenly throughout . \\nhere , it\\\'s just \\" hey , let\\\'s chase these people around with some robots \\" . \\nthe acting is below average , even from the likes of curtis . \\nyou\\\'re more likely to get a kick out of her work in halloween h20 . \\nsutherland is wasted and baldwin , well , he\\\'s acting like a baldwin , of course . \\nthe real star here are stan winston\\\'s robot design , some schnazzy cgi , and the occasional good gore shot , like picking into someone\\\'s brain . \\nso , if robots and body parts really turn you on , here\\\'s your movie . \\notherwise , it\\\'s pretty much a sunken ship of a movie . \\n\'']

Step 2: Prepare Vectorizer

In [4]:
# sklearn contains two vectorizers

# CountVectorizer can give you Boolean or TF vectors
# https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_generated_sklearn.feature-5Fextraction.text.CountVectorizer.html&d=DwIGAg&c=KqtxL2Lt1AKmPhqmvvNjR0MTQm8XwKWV11VtWfYv1LQ&r=2VcUQbUmLN6zU85f94-1ZxUreXcE3be7opAgeVcjVKw&m=2Gqo51Rev6xsuBteQ1AYkPfNXwicgzl0rBQ9uDRqb44&s=P4MNn6qr6Mgi3IcTSJqAWTaHS8zbaomxmgSUBAahYe0&e=

# TfidfVectorizer can give you TF or TFIDF vectors
# https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_generated_sklearn.feature-5Fextraction.text.TfidfVectorizer.html&d=DwIGAg&c=KqtxL2Lt1AKmPhqmvvNjR0MTQm8XwKWV11VtWfYv1LQ&r=2VcUQbUmLN6zU85f94-1ZxUreXcE3be7opAgeVcjVKw&m=2Gqo51Rev6xsuBteQ1AYkPfNXwicgzl0rBQ9uDRqb44&s=5hK9g_5oHP7rl7T3jCGTh1iX-MFiFfK4BaBYML4_Sdk&e=

# Read the sklearn documentation to understand all vectorization options

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')

Step 3: Vectorize the text documents

In [5]:
# The vectorizer can do "fit" and "transform"
# fit is a process to collect unique tokens into the vocabulary
# transform is a process to convert each document to vector based on the vocabulary
# These two processes can be done together using fit_transform(), or used individually: fit() or transform()

# fit vocabulary in documents and transform the documents into vectors
vecs = unigram_count_vectorizer.fit_transform(docs)

# check the content of a document vector
print(vecs.shape)
print(vecs[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_count_vectorizer.vocabulary_.get('year'))
(2000, 13724)
[[0 0 0 ... 0 0 0]]
13724
[('plot', 9380), ('teen', 12334), ('couples', 2622), ('church', 2013), ('party', 9067), ('drink', 3551), ('drive', 3556), ('nthey', 8563), ('accident', 196), ('guys', 5200)]
13670
In [ ]: