Tutorial: Feature Engineering

In sklearn, you can use not only the word features, but also some other calculated features like sentence length, average word length, whether negations occur, or anything you can define and calculate.

sklearn provides a featureUnion tool to combine features from difference sources. http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

This feature union tool uses pipeline. It is powerful tool, but it is not convenient for storing and checking intermediate results. Below is a slower but more intuitive approach.

First, let's look at a simple example that adds a feature to detect negations in text. Here negation is simply defined as the occurrence of any of the three words: "not", "no", and "never". In this example a pandas dataframe is created to store the original text data in one column and the generated negation feature in another column. Later the text data will be vectorized, and pandas.sparse.hstack() will be used to combine the vectors and the negation feature together.

In [90]:
import pandas as pd
import re
txts = ['this is good', 'this is bad', 'this is not good', 'this is not bad', 'this is useless']
df = pd.DataFrame({'text':txts})
pattern_neg = re.compile(r'\b(not|no|never)\b')
df['neg'] = df['text'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)
print(df)
               text  neg
0      this is good    0
1       this is bad    0
2  this is not good    1
3   this is not bad    1
4   this is useless    0

Or you can define a more complicated function separately for negation detection

In [91]:
def has_negation(post):
    pattern_neg_1 = re.compile(r'\b(not|no|never)\b')
    pattern_neg_2 = re.compile(r'\b([a-z]+less)\b')
    if pattern_neg_1.search(post.lower()) or pattern_neg_2.search(post.lower()):
        return 1
    else: 
        return 0

df['neg'] = df['text'].apply(lambda x: 1 if has_negation(x) else 0)
print(df)
               text  neg
0      this is good    0
1       this is bad    0
2  this is not good    1
3   this is not bad    1
4   this is useless    1

Now vectorize the text and combine the word vectors with the negation feature values.

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
unigram_bool = CountVectorizer(encoding='latin-1', binary=True)
vecs = unigram_bool.fit_transform(df['text']).astype(float)
#print(vecs)
X_dense = df[['neg']]
X_sparse = vecs
X = sparse.hstack([X_sparse, X_dense]).tocsr()
print(X)
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 4)	1.0
  (1, 0)	1.0
  (1, 2)	1.0
  (1, 4)	1.0
  (2, 1)	1.0
  (2, 2)	1.0
  (2, 3)	1.0
  (2, 4)	1.0
  (2, 5)	1.0
  (3, 0)	1.0
  (3, 2)	1.0
  (3, 3)	1.0
  (3, 4)	1.0
  (3, 5)	1.0

Let's try this approach on the Kaggle sentiment data, and see how much improvement can simple negation detection offer.

In [80]:
train=pd.read_csv("/Users/byu/Desktop/data/kaggle/train.tsv", delimiter='\t')
y=train['Sentiment']

pattern_neg = re.compile(r'\b(not|no|never)\b')
train['neg'] = train['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)
X_dense = train[['neg']]
X_sparse = unigram_bool.fit_transform(train['Phrase']).astype(float)
X = sparse.hstack([X_sparse, X_dense]).tocsr()
In [74]:
%%time
# test the model with negation detection
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
svm_clf= LinearSVC()
scores = cross_val_score(svm_clf, X, y, cv=3, n_jobs=3)
avg=sum(scores)/len(scores)
print(avg)
0.569293741676
CPU times: user 265 ms, sys: 77.4 ms, total: 342 ms
Wall time: 1min 37s

Sample output:

0.569293741676 CPU times: user 265 ms, sys: 77.4 ms, total: 342 ms Wall time: 1min 37s

In [75]:
%%time
# test the model without negation detection
# note this cross validation is not the standard pipeline method
# but a cut-corner version that does vectorization first and then train/test models
# this cut-corner version would allow the model to see the text of the test data, 
# but the model would still not see the labels of the test data
svm_clf2= LinearSVC()
scores2 = cross_val_score(svm_clf2, X_sparse, y, cv=3, n_jobs=3)
avg2=sum(scores2)/len(scores2)
print(avg2)
0.569223263356
CPU times: user 263 ms, sys: 74.1 ms, total: 337 ms
Wall time: 1min 43s

Sample output:

0.569223263356 CPU times: user 263 ms, sys: 74.1 ms, total: 337 ms Wall time: 1min 43s

In [79]:
print(X.shape)
num_more_correct_predictions = X.shape[0]*(0.569293741676 - 0.569223263356)
num_more_correct_predictions = 156060*(0.569293741676 - 0.569223263356)
print(num_more_correct_predictions)
(156060, 15241)
10.998846619193118

There are only 11 more correct predictions with added negation detection, considering this negation detector is rather simplistic.