In sklearn, you can use not only the word features, but also some other calculated features like sentence length, average word length, whether negations occur, or anything you can define and calculate.
sklearn provides a featureUnion tool to combine features from difference sources. http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py
This feature union tool uses pipeline. It is powerful tool, but it is not convenient for storing and checking intermediate results. Below is a slower but more intuitive approach.
First, let's look at a simple example that adds a feature to detect negations in text. Here negation is simply defined as the occurrence of any of the three words: "not", "no", and "never". In this example a pandas dataframe is created to store the original text data in one column and the generated negation feature in another column. Later the text data will be vectorized, and pandas.sparse.hstack() will be used to combine the vectors and the negation feature together.
import pandas as pd
import re
txts = ['this is good', 'this is bad', 'this is not good', 'this is not bad', 'this is useless']
df = pd.DataFrame({'text':txts})
pattern_neg = re.compile(r'\b(not|no|never)\b')
df['neg'] = df['text'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)
print(df)
def has_negation(post):
pattern_neg_1 = re.compile(r'\b(not|no|never)\b')
pattern_neg_2 = re.compile(r'\b([a-z]+less)\b')
if pattern_neg_1.search(post.lower()) or pattern_neg_2.search(post.lower()):
return 1
else:
return 0
df['neg'] = df['text'].apply(lambda x: 1 if has_negation(x) else 0)
print(df)
from sklearn.feature_extraction.text import CountVectorizer
unigram_bool = CountVectorizer(encoding='latin-1', binary=True)
vecs = unigram_bool.fit_transform(df['text']).astype(float)
#print(vecs)
X_dense = df[['neg']]
X_sparse = vecs
X = sparse.hstack([X_sparse, X_dense]).tocsr()
print(X)
train=pd.read_csv("/Users/byu/Desktop/data/kaggle/train.tsv", delimiter='\t')
y=train['Sentiment']
pattern_neg = re.compile(r'\b(not|no|never)\b')
train['neg'] = train['Phrase'].apply(lambda x: 1 if pattern_neg.search(x.lower()) else 0)
X_dense = train[['neg']]
X_sparse = unigram_bool.fit_transform(train['Phrase']).astype(float)
X = sparse.hstack([X_sparse, X_dense]).tocsr()
%%time
# test the model with negation detection
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
svm_clf= LinearSVC()
scores = cross_val_score(svm_clf, X, y, cv=3, n_jobs=3)
avg=sum(scores)/len(scores)
print(avg)
Sample output:
0.569293741676 CPU times: user 265 ms, sys: 77.4 ms, total: 342 ms Wall time: 1min 37s
%%time
# test the model without negation detection
# note this cross validation is not the standard pipeline method
# but a cut-corner version that does vectorization first and then train/test models
# this cut-corner version would allow the model to see the text of the test data,
# but the model would still not see the labels of the test data
svm_clf2= LinearSVC()
scores2 = cross_val_score(svm_clf2, X_sparse, y, cv=3, n_jobs=3)
avg2=sum(scores2)/len(scores2)
print(avg2)
Sample output:
0.569223263356 CPU times: user 263 ms, sys: 74.1 ms, total: 337 ms Wall time: 1min 43s
print(X.shape)
num_more_correct_predictions = X.shape[0]*(0.569293741676 - 0.569223263356)
num_more_correct_predictions = 156060*(0.569293741676 - 0.569223263356)
print(num_more_correct_predictions)
There are only 11 more correct predictions with added negation detection, considering this negation detector is rather simplistic.