HW4 -- Sentiment and Lies

STEP 1: Import the data

NOTE: May need to change delimiter based on the data file

In [2]:
import pandas as pd
df = pd.read_csv('deception_data_converted_final.csv',  sep='\t')
df[:5]
Out[2]:
lie,sentiment,review
0 f,n,'Mike\'s Pizza High Point, NY Service was ...
1 f,n,'i really like this buffet restaurant in M...
2 f,n,'After I went shopping with some of my fri...
3 f,n,'Olive Oil Garden was very disappointing. ...
4 f,n,'The Seven Heaven restaurant was never kno...

STEP 2: Pull out the labels

In [3]:
def get_labels(row):
    split_row = str(row).split(',')
    lie = split_row[0]
    sentiment = split_row[1]
    return [lie, sentiment, split_row[2:]]

df['all'] = df.apply(lambda row: get_labels(row['lie,sentiment,review']), axis=1)
df[:5]
Out[3]:
lie,sentiment,review all
0 f,n,'Mike\'s Pizza High Point, NY Service was ... [f, n, ['Mike\'s Pizza High Point, NY Service...
1 f,n,'i really like this buffet restaurant in M... [f, n, ['i really like this buffet restaurant ...
2 f,n,'After I went shopping with some of my fri... [f, n, ['After I went shopping with some of my...
3 f,n,'Olive Oil Garden was very disappointing. ... [f, n, ['Olive Oil Garden was very disappointi...
4 f,n,'The Seven Heaven restaurant was never kno... [f, n, ['The Seven Heaven restaurant was never...
In [4]:
df['lie'] = df.apply(lambda row: row['all'][0][0], axis=1)
df[:5]
Out[4]:
lie,sentiment,review all lie
0 f,n,'Mike\'s Pizza High Point, NY Service was ... [f, n, ['Mike\'s Pizza High Point, NY Service... f
1 f,n,'i really like this buffet restaurant in M... [f, n, ['i really like this buffet restaurant ... f
2 f,n,'After I went shopping with some of my fri... [f, n, ['After I went shopping with some of my... f
3 f,n,'Olive Oil Garden was very disappointing. ... [f, n, ['Olive Oil Garden was very disappointi... f
4 f,n,'The Seven Heaven restaurant was never kno... [f, n, ['The Seven Heaven restaurant was never... f
In [5]:
df['sentiment'] = df.apply(lambda row: row['all'][1][0], axis=1)
df[:5]
Out[5]:
lie,sentiment,review all lie sentiment
0 f,n,'Mike\'s Pizza High Point, NY Service was ... [f, n, ['Mike\'s Pizza High Point, NY Service... f n
1 f,n,'i really like this buffet restaurant in M... [f, n, ['i really like this buffet restaurant ... f n
2 f,n,'After I went shopping with some of my fri... [f, n, ['After I went shopping with some of my... f n
3 f,n,'Olive Oil Garden was very disappointing. ... [f, n, ['Olive Oil Garden was very disappointi... f n
4 f,n,'The Seven Heaven restaurant was never kno... [f, n, ['The Seven Heaven restaurant was never... f n
In [6]:
df['review'] = df.apply(lambda row: ''.join(row['all'][2]), axis=1)
df[:5]
Out[6]:
lie,sentiment,review all lie sentiment review
0 f,n,'Mike\'s Pizza High Point, NY Service was ... [f, n, ['Mike\'s Pizza High Point, NY Service... f n 'Mike\'s Pizza High Point NY Service was very ...
1 f,n,'i really like this buffet restaurant in M... [f, n, ['i really like this buffet restaurant ... f n 'i really like this buffet restaurant in Marsh...
2 f,n,'After I went shopping with some of my fri... [f, n, ['After I went shopping with some of my... f n 'After I went shopping with some of my friend ...
3 f,n,'Olive Oil Garden was very disappointing. ... [f, n, ['Olive Oil Garden was very disappointi... f n 'Olive Oil Garden was very disappointing. I ex...
4 f,n,'The Seven Heaven restaurant was never kno... [f, n, ['The Seven Heaven restaurant was never... f n 'The Seven Heaven restaurant was never known f...
In [7]:
clean_df = df.copy()
In [8]:
clean_df.drop(['lie,sentiment,review', 'all'], axis=1, inplace=True)
In [9]:
clean_df
Out[9]:
lie sentiment review
0 f n 'Mike\'s Pizza High Point NY Service was very ...
1 f n 'i really like this buffet restaurant in Marsh...
2 f n 'After I went shopping with some of my friend ...
3 f n 'Olive Oil Garden was very disappointing. I ex...
4 f n 'The Seven Heaven restaurant was never known f...
... ... ... ...
87 t p 'Pastablities is a locally owned restaurant in...
88 t p 'I like the Pizza at Dominoes for their specia...
89 t p 'It was a really amazing Japanese restaurant. ...
90 t p 'How do I even pick a best experience at Joe\'...
91 t p 'My sister and I ate at this restaurant called...

92 rows × 3 columns

STEP 3: Clean the data

In [10]:
def clean_rogue_characters(string):
    exclude = ['\\',"\'",'"']
    string = ''.join(string.split('\\n'))
    string = ''.join(ch for ch in string if ch not in exclude)
    return string

clean_df['review'] = clean_df['review'].apply( lambda x: clean_rogue_characters(x) )
clean_df['review'][0]
Out[10]:
'Mikes Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'

STEP 4: Export cleaned, formatted CSV

In [11]:
clean_df.to_csv('hw4_data.csv',index=False)
In [12]:
df = pd.read_csv('hw4_data.csv')
df[:5]
Out[12]:
lie sentiment review
0 f n Mikes Pizza High Point NY Service was very slo...
1 f n i really like this buffet restaurant in Marsha...
2 f n After I went shopping with some of my friend w...
3 f n Olive Oil Garden was very disappointing. I exp...
4 f n The Seven Heaven restaurant was never known fo...

STEP 5: Split df into data sets

LIE DFs

In [13]:
lie_df_f = df[df['lie'] == 'f']
lie_df_t = df[df['lie'] == 't']

SENTIMENT DFs

In [14]:
sent_df_n = df[df['sentiment'] == 'n']
sent_df_p = df[df['sentiment'] == 'p']

STEP 5b: Export to Corpus to run on current pipelines

In [15]:
def print_to_file(rating, review, num, title):
    both = review
    output_filename = str(rating) + '_'+ title +'_' + str(num) + '.txt'
    outfile = open(output_filename, 'w')
    outfile.write(both)
    outfile.close()

def export_to_corpus(df, subj, title):
    for num,row in enumerate(df['review']):
        print_to_file(subj, row, num, title)
In [16]:
export_to_corpus(sent_df_n, 'neg', 'hw4_n')
export_to_corpus(sent_df_p, 'pos', 'hw4_p')
In [17]:
export_to_corpus(lie_df_f, 'false', 'hw4_f')
export_to_corpus(lie_df_t, 'true', 'hw4_t')
In [ ]: