Natural Language Processing : How N-grams models are used to solve NLP problems?

Natural language processing applies different methods to extract patterns and build knowledge based from text data. N-grams is one of the language model, where we use previous N-1 (N being the size of your document/sentence),to predict the next word.

Along with sequence prediction, n-grams model is being used for spelling correction (as in Google search), language translation and text summarization.

Math behind n-grams

n-gram model is based on the idea of computing the probability of a sentence or sequence of words.


P(W) = P(w1, w2, w3, .....)

If we need to predict the upcoming word/ sequence (w4),


Here, we need to calculate the probability of number of words; which can be represented as joint probability and by using Chain Rule.

Conditional probability can be written as:

P(B | A) = P(A,B) / P(A)

=> P(A,B) = P (B | A) * P(A)

If we include more variables:

P(A,B,C,D,E) = = P(A) P(B|A) P(C|A,B) P(D|A,B,C) P(E|A,B,C,D)

Therefore, we use Chain Rule to compute join probability for the words in a sentence.


Let us take a sentence : "I like green salad"

P( I like green salad ) = P (I) P( like | I) P( green|I like) P( salad|I like green)

So, we can estimate the probability by using simple counting method as follows:

P (I) = count('I') / total # of words

P(like | I ) = count ('I like') / count('I')

P(green | I like) = count('I like green') / count ('I like')

P(salad | I like green) = count('I like green salad') / count('I like green')

But the problem here is that we do not have large corpus to address this kind of estimate.

Bigram Model

In order to solve the above problem, we take an approximation; i.e the probability of next word depends only on last word (Markov Assumption)

i.e P( I like green salad ) = P (I) P( like | I) P( green| like) P( salad|green)

Here, we are only taking the previous word to measure estimate of the current word.


Let us take an example of corpus and calculate probability to a word occurring next in a sequence of words.

Corpus :

a. I read a book about NLP.
b. I read a paper on n-grams.
c. If you need to expand your knowledge, you should read books daily.

After removing stopwords / lemmatization (stemming process), our sentences will be

 a. read book about NLP
 b. read paper n-grams
 c. If you need expand your knowledge, you should read book daily.

Suppose we need to find the probability of the word 'book' in sequence with 'read'

= (No. of times “read book” occurs) / (No. of times “read” occurs) 
= 2/3 
= 0.67

Suppose we need to find the probability of the word 'paper' in sequence with 'read'

= (No. of times “read paper” occurs) / (No. of times “read” occurs) 
= 1/3 
= 0.33

This means that, when user is typing read, the probability of suggesting word book is higher than suggesting paper in our system (based on our corpus)


Comments powered by Disqus