Sentiment Analysis - Data Preprocessing and Feature Engineering : part 4

In this step, we can follow some data preprocessing operations as described in the image below:

1. Load data

In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

PYTHONWARNINGS="ignore"
In [34]:
df = pd.read_csv('data.csv')
In [35]:
df.head()
Out[35]:
title sentiment
0 Fed official says weak data caused by weather, should not slow taper -1
1 Fed's Charles Plosser sees high bar for change in pace of tapering 1
2 US open: Stocks fall after Fed official hints at accelerated tapering 1
3 Fed risks falling 'behind the curve', Charles Plosser says 0
4 Fed's Plosser: Nasty Weather Has Curbed Job Growth -1
In [36]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
title        2000 non-null object
sentiment    2000 non-null int64
dtypes: int64(1), object(1)
memory usage: 31.3+ KB
In [38]:
%matplotlib inline

df['sentiment'].plot(kind='hist')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f43bbac46a0>

2. Remove stop words

In [43]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = stopwords.words('english')
In [44]:
def remove_stopwords(document):
    tokens = word_tokenize(document)
    words = [w for w in tokens if not w in stop_words]
    return " ".join(words)
    
In [45]:
remove_stopwords('This is a test for Natural language processing stopwords')
Out[45]:
'This test Natural language processing stopwords'
In [46]:
df['sentiment_1'] = df['title'].apply(lambda x: remove_stopwords(x))
In [47]:
df.head()
Out[47]:
title sentiment sentiment_1
0 Fed official says weak data caused by weather, should not slow taper -1 Fed official says weak data caused weather , slow taper
1 Fed's Charles Plosser sees high bar for change in pace of tapering 1 Fed 's Charles Plosser sees high bar change pace tapering
2 US open: Stocks fall after Fed official hints at accelerated tapering 1 US open : Stocks fall Fed official hints accelerated tapering
3 Fed risks falling 'behind the curve', Charles Plosser says 0 Fed risks falling 'behind curve ' , Charles Plosser says
4 Fed's Plosser: Nasty Weather Has Curbed Job Growth -1 Fed 's Plosser : Nasty Weather Has Curbed Job Growth

3. Remove special chars and convert abbreviation to whole : ( what's -> what is)

In [76]:
import re

def remove_special_chars(text):
    return " ".join(e for e in text.split() if e.isalnum() and not e.isdigit() and not re.search("\d", e))

def remove_abbr_and_math_symbol(text):
    # Mathematical symbols
    text = re.sub(r"\+", " plus ", text)
    text = re.sub(r"\-", " minus ", text)
    text = re.sub(r"\*", " multiply ", text)
    text = re.sub(r"\=", "equal", text)
    
    # abbrv.
    text = re.sub(r"What's", "What is ", text)
    text = re.sub(r"Who's", "Who is ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"'m", " am ", text)
    
    text = text.strip()
    return text
In [77]:
df['sentiment_2'] = df['sentiment_1'].apply(lambda x: remove_special_chars(x))
df['sentiment_3'] = df['sentiment_2'].apply(lambda x: remove_abbr_and_math_symbol(x))
In [78]:
df.head()
Out[78]:
title sentiment sentiment_1 sentiment_2 sentiment_3 sentiment_4
0 Fed official says weak data caused by weather, should not slow taper -1 Fed official says weak data caused weather , slow taper Fed official says weak data caused weather slow taper Fed official says weak data caused weather slow taper Fed official say weak data cause weather slow taper
1 Fed's Charles Plosser sees high bar for change in pace of tapering 1 Fed 's Charles Plosser sees high bar change pace tapering Fed Charles Plosser sees high bar change pace tapering Fed Charles Plosser sees high bar change pace tapering Fed Charles Plosser see high bar change pace taper
2 US open: Stocks fall after Fed official hints at accelerated tapering 1 US open : Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hint accelerate taper
3 Fed risks falling 'behind the curve', Charles Plosser says 0 Fed risks falling 'behind curve ' , Charles Plosser says Fed risks falling curve Charles Plosser says Fed risks falling curve Charles Plosser says Fed risk fall curve Charles Plosser say
4 Fed's Plosser: Nasty Weather Has Curbed Job Growth -1 Fed 's Plosser : Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth

4. Lemmatization

In [79]:
from nltk.stem import WordNetLemmatizer

# the most commonly used stemmer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("sang", pos='v'), lemmatizer.lemmatize("singing", pos='v')
Out[79]:
('sing', 'sing')
In [80]:
def lemmatization(document):
    tokens = word_tokenize(document)
    words = [w for w in tokens if not w in stop_words]
    lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
    return " ".join(lemmas)
In [81]:
lemmatization('This is a test for Natural language processing stopwords')
Out[81]:
'This test Natural language process stopwords'
In [82]:
df['sentiment_4'] = df['sentiment_3'].apply(lambda x: lemmatization(x))
In [83]:
df.head()
Out[83]:
title sentiment sentiment_1 sentiment_2 sentiment_3 sentiment_4
0 Fed official says weak data caused by weather, should not slow taper -1 Fed official says weak data caused weather , slow taper Fed official says weak data caused weather slow taper Fed official says weak data caused weather slow taper Fed official say weak data cause weather slow taper
1 Fed's Charles Plosser sees high bar for change in pace of tapering 1 Fed 's Charles Plosser sees high bar change pace tapering Fed Charles Plosser sees high bar change pace tapering Fed Charles Plosser sees high bar change pace tapering Fed Charles Plosser see high bar change pace taper
2 US open: Stocks fall after Fed official hints at accelerated tapering 1 US open : Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hints accelerated tapering US open Stocks fall Fed official hint accelerate taper
3 Fed risks falling 'behind the curve', Charles Plosser says 0 Fed risks falling 'behind curve ' , Charles Plosser says Fed risks falling curve Charles Plosser says Fed risks falling curve Charles Plosser says Fed risk fall curve Charles Plosser say
4 Fed's Plosser: Nasty Weather Has Curbed Job Growth -1 Fed 's Plosser : Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth Fed Plosser Nasty Weather Has Curbed Job Growth

4. TFIDF

Suppose you have a corpus of documents talking about Mountains. Now, Tfidf will work in three different parts :

i. Term Frequency (Tf)

It simply looks for the number of times a particular term occurs in your single document. Let's take an example. You can find word is in almost all of your documents. Do you think the word is has any credit to describe your document ? But, at the same time, if you find word Everest in the document, you can figure out that the document is about Mountains. So, we need to find a way to reduce the weightage of word is. One of the method is calculating reciprocal of the count value, which minimizes the significant of word is in the document. But, you might face another problem here. Suppose, you encounter a word in your document (let's say stratovolcano having rare occurence in your document). If you take reciprocal of this occurence, you will get 1 (or close to 1 - hightest weightage). But, this word does not give much sense about Montains.

For now, we can keep the word count as follows:

tf("is") = 1000

tf("Everest") = 50

tf("stratovolcano") = 2

ii. Inverser Document Frequency (iDF)

The problem of occurence of rare and more frequent words will be handled by this method. Inverser Document Frequency gives downscale weights for words that occur in many documents in the corpus by taking log of number of total documents in your corpus divided by the total number of documents having occurence of the word. i.e iDF = log (total number of documents/ total documents with word occurence)

Let's calculate iDF for above words. (Suppose we have total 20 documents)

iDF("is") = log(20/20) = 0 , Since 'is' occurs in all the documents

iDF("Everest") = log(20/5) = 0.6, Since corpus is talking about 'Mountains'

iDF("stratovolcano") = log(10/1) = 1, Since stratovolcano occurs in one doc.

iii. TfiDF

TfiDF = TF * iDF

Therefore,

`TfiDF("is") = 1000 * 0 = 0`

`TfiDF("Everest") = 50 * 0.6 = 30`

`TfiDF("stratovolcano") = 2 * 1 = 2`
In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

sample_train_data = ['Dispute delays National Assembly formation process',
              'Country is looking to encourage entrepreneurs and startup process',
              'Airline fuel surcharges to go up from Tuesday'] 

tfidf_vec = TfidfVectorizer()
sample_train_tfidf = tfidf_vec.fit_transform(sample_train_data)

df1 = pd.DataFrame(sample_train_tfidf.toarray(), columns=tfidf_vec.get_feature_names())
In [90]:
print(df1)
    airline       and  assembly   country    delays   dispute  encourage  \
0  0.000000  0.000000  0.423394  0.000000  0.423394  0.423394  0.000000    
1  0.000000  0.350139  0.000000  0.350139  0.000000  0.000000  0.350139    
2  0.363255  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    

   entrepreneurs  formation      from    ...           go        is   looking  \
0  0.000000       0.423394   0.000000    ...     0.000000  0.000000  0.000000   
1  0.350139       0.000000   0.000000    ...     0.000000  0.350139  0.350139   
2  0.000000       0.000000   0.363255    ...     0.363255  0.000000  0.000000   

   national   process   startup  surcharges        to   tuesday        up  
0  0.423394  0.322002  0.000000  0.000000    0.000000  0.000000  0.000000  
1  0.000000  0.266290  0.350139  0.000000    0.266290  0.000000  0.000000  
2  0.000000  0.000000  0.000000  0.363255    0.276265  0.363255  0.363255  

[3 rows x 21 columns]

Comments

Comments powered by Disqus