Natural language processing applies different methods to extract patterns and build knowledge based from text data. N-grams is one of the language model, where we use previous N-1 (N being the size of your document/sentence),to predict the next word.

Along with sequence prediction, n-grams model is being used for spelling correction (as in Google search), language translation and text summarization.

#### Math behind n-grams¶

n-gram model is based on the idea of computing the probability of a sentence or sequence of words.

Mathematically,

P(W) = P(w1, w2, w3, .....)

If we need to predict the upcoming word/ sequence (w4),

P(w4|w1,w2,w3..)

Here, we need to calculate the probability of number of words; which can be represented
as joint probability and by using `Chain Rule`

.

Conditional probability can be written as:

P(B | A) = P(A,B) / P(A)

=> P(A,B) = P (B | A) * P(A)

If we include more variables:

P(A,B,C,D,E) = = P(A) P(B|A) P(C|A,B) P(D|A,B,C) P(E|A,B,C,D)

Therefore, we use `Chain Rule`

to compute join probability for the words in a sentence.

#### Example¶

Let us take a sentence : "I like green salad"

P( I like green salad ) = P (I) P( like | I) P( green|I like) P( salad|I like green)

So, we can estimate the probability by using simple counting method as follows:

```
P (I) = count('I') / total # of words
P(like | I ) = count ('I like') / count('I')
P(green | I like) = count('I like green') / count ('I like')
P(salad | I like green) = count('I like green salad') / count('I like green')
```

But the problem here is that we do not have large corpus to address this kind of estimate.

#### Bigram Model¶

In order to solve the above problem, we take an approximation; i.e the probability of next word depends only on last word (Markov Assumption)

i.e P( I like green salad ) = P (I) P( like | I) P( green| like) P( salad|green)

Here, we are only taking the previous word to measure estimate of the current word.

#### Example¶

Let us take an example of corpus and calculate probability to a word occurring next in a sequence of words.

Corpus :

```
a. I read a book about NLP.
b. I read a paper on n-grams.
c. If you need to expand your knowledge, you should read books daily.
```

After removing stopwords / lemmatization (stemming process), our sentences will be

```
a. read book about NLP
b. read paper n-grams
c. If you need expand your knowledge, you should read book daily.
```

Suppose we need to find the probability of the word 'book' in sequence with 'read'

```
= (No. of times “read book” occurs) / (No. of times “read” occurs)
= 2/3
= 0.67
```

Suppose we need to find the probability of the word 'paper' in sequence with 'read'

```
= (No. of times “read paper” occurs) / (No. of times “read” occurs)
= 1/3
= 0.33
```

This means that, when user is typing `read`

, the probability of suggesting word `book`

is higher than suggesting `paper`

in our system (based on our corpus)