Naive Bayes Algorithm

13 min readSep 25, 2021

“Probability theory is nothing but common sense reduced to calculation.”

Have you ever noticed emails being categorized into different buckets and automatically being marked as important, spam, promotions, etc? And if you have, has it really piqued your curiosity as to how it happens? Isn’t it really helpful to see our machines smartly do this job and save our time! So, what does the machine really do?

Let’s take the example of spam emails. More often than not, we receive many unwanted emails, right? And most of them include words like, ‘Won!’, ‘Prize’, ‘Congratulations!’, ‘Bravo!’ etc. So, what the machine does is, scans through an email and tries to identify these specific words, and then classifies the email as spam. This is a kind of predictive classification problem that can be easily solved with the Naive Bayes algorithm.

In this article, we shall cover the following topics:

Introduction to ‘Naive’ and ‘Bayes’
Conditional Probability
Total Probability Theorem and Bayes’ Rule
Understanding the Algorithm
Laplace Smoothing
Logarithmic Probability
Naive Bayes vs Logistic Regression
Hyperparameter tuning
Types of Naive Bayes(NB) Classifiers
Advantages and Disadvantages
Let’s begin coding!

Introduction to ‘Naive’ and ‘Bayes’

‘Naive’ means that the input features of our algorithm are independent of each other and thus, changing one or more input features will not affect the remaining ones. This is probably why it’s called Naive since the assumption that it makes might not always be true.

‘Bayes’ on the other hand refers to the Bayesian inference which suggests that the probability of an event depends on certain conditions and can vary accordingly.

The Naive Bayes Algorithm is based on the Bayes Theorem which calculates the conditional probability of an event.

So, let us take a moment to understand the conditional probability of the occurrence of an event.

Conditional Probability

You must have heard your teacher say,” I’ll correct your notebook given that you submit it on time”. So, what your teacher means is that he/she will correct the notebook based on a condition- that you must submit it on time. So, this brings us to the idea of conditional probability which forms the base for our Bayes Theorem.

So, we formally define conditional probability as the possibility of an event or outcome happening, based on the existence of a previous event or outcome. It is calculated by simply multiplying the probability of the preceding event by the renewed probability of the succeeding or conditional event.

In the example shown,

P(A) = probability of getting an odd number i.e., 1, 3, or 5 = 3/6

P(B) = probability of getting a value < 4 i.e., 1, 2, or 3 = 3/6

P(A⋂B) = probability of getting an odd number, which is less than 4 i.e., 1, 3 = 2/6

Therefore, according to the formula, we get

P(B/A) = probability of getting value < 4, given that the number is odd = 2/3

Let’s understand it in a different way. The condition given is that the value has to be an odd number, which reduces our sample space to 1,3,5, instead of 1,2,3,4,5,6. Now, among these three odd numbers, our favorable outcomes are 1 and 3 (value < 4). As we know that probability is given by the ratio of no. of favorable outcomes to the total outcomes, we get P(B/A) as 2/3.

Now, let us recall some basics of probability:

Note the conditional independence of x and y given z.

Total Probability Theorem and Bayes’ Rule

Total Probability Theorem

Let the events A1, A2, …, Ak partition the finite discrete sample space S for an experiment and let B be an event defined on S,

Bayes’ Rule

P(B): Evidence, which forms the pre-existing condition of our event.

P(A_i): Priori of A_i, also known as prior probability, is the probability of the event before the evidence is recorded.

P(A_i|B): Posteriori probability of B, i.e. conditional probability of event A after evidence is seen.

P(B|A_i): Likelihood, which refers to the probability of observing the event that has been observed assuming that the event has occurred from a specific scenario.

Thus, the above equation can be written as:

Now, take a deep breath as we are all set to understand the idea behind the Naive Bayes Algorithm.

Understanding the Algorithm

Assumption: Each input feature makes an independent and equal contribution to the final outcome.

For a given independent feature vector X of size ‘n’ with attributes x1, x2, x3 ……, xn, we predict the class Y_pred for the new instance as:

where Y — set of all possible classes.

Then, on applying Bayes’ Theorem, we get:

As P(x_1,x_2,…,x_n) is constant, Y_pred can also be written as

But, for computing P(x_1,x_2,…,x_n|y), we need to find the data that belongs to the class y with the exact combination of x_1,x_2,…,x_n. But it is highly unlikely that we have encountered the same combination before! How do we calculate it then? Well, here comes the assumption to our rescue.

As per our assumption, since x1, x2….., xn are independent feature inputs given class y, we apply conditional independence to get:

So, the updated Naive Bayes classifier is:

Too much of probability! Don’t worry, let us understand the algorithm with the help of an example.

John enjoys playing tennis with his friend, Alice. So, he calls her up and asks if she is ready to go. Alice looks at the weather and decides whether she should play or not. Depending on his previous experiences, John has noted down the outcome of various weather conditions, which is reported in the below table.

Today, the outlook is sunny, the temperature is mild, humidity is normal, and windy is false. So, let us try to predict if Alice would go out to play tennis with John or not. We have the testing data —

The weather data summary is described in the table below —

We find out the respective probabilities using the Naive Bayes Algorithm:

Similarly,

Clearly, we observe that

Yippee! Good day for John, and we predict the Play as Yes.

So far so good. The features of the test data that we assumed were present in the given dataset. But, that might always be the case. Consider the example in which the outlook of weather is cloudy, the temperature is hot, humidity is high, and windy is false, i.e, the test data is given as

Note that cloudy is a new outlook, that is not present in our trained dataset, reminding you in Naive Bayes we consider that these inputs are independent. But we are using the same dataset while testing new samples.

Then, if we try to calculate the probability of Play = Yes, we get:

Since cloudy is not present, implying P(cloudy|Yes) = 0, we might say that P(Yes)P(X_test|Yes) = 0, which means we’ll never pick Yes. Similar is the case with No. That means whenever we encounter a completely new attribute, the probabilities of all the classes become zero! How do we handle such cases?

Laplace Smoothing

Are you thinking “such a complicated term this is!!!!! LAPLACE SMOOTHING!!” Well, it’s not so. It’s a pretty simple and problem-solving concept in the case of zero probability. Have a read of this section, you will get to know how important this part is in the Naive Bayes implementation.

Laplace smoothing is used to solve the problem of zero probability in Naive Bayes. We add a smoothing value ⍺ to each of the counts so that the probability of an attribute, given any class, never becomes zero. The probability is then calculated as:

where k represents the number of dimensions (features) in the data. It will be non-zero and have reasonable probability, and the problem of zero probability will be resolved.

In this case, k = 4. Let us assume ⍺ = 1, which is generally preferred since using higher alpha values will push the likelihood towards a value of 0.5, i.e., the probability of a feature equal to 0.5 for both the yes and no classes. Since we are not getting much information from that, it is not preferable.

Similarly,

Therefore, we predict the Play as No, as P(Yes|X_test) < P(No|X_test).

Logarithmic Probability

Amid all these formulae, there is one thing of particular interest here. Say we have a large number of input features, so to calculate the final value we notice that after multiplying all the conditional probabilities (0<P<1), our result obtained is rather a small value approaching 0.

Now, working with such small values can be difficult as it can lead to underflow of numeric precision and our algorithm may approximate the calculated value to 0! Hence, this is when the concept of logarithmic probability comes to our rescue where we take the log of the obtained value to avoid this problem.

That is, instead of using P(x_j | y_i) and P(y_i), we use ln {P(x_j | y_i)} and ln {P(y_i)} ( ln means the natural logarithm, though any logarithm works).

This transformation works (and is rather elegant) because:

• A probability is always between 0 and 1 so a log probability is always between −∞ and 0 (a much bigger range of numbers than between 0 and 1).

• If a < b, then ln a < ln b.

This property of logarithms means that comparing probabilities to figure out which one is the biggest is equivalent to comparing the log probabilities.

• Also the math part of it works out rather well because the logarithm of a product is the sum of the individual log probabilities.

So, we can definitely conclude that log probabilities solve the precision problem that we may encounter during the normal calculations.

Naive Bayes Vs Logistic Regression

Ummmm!! I thought these two are used for classification problems only! I read somewhere that both of these are also supervised learning models. Then what’s the difference between them!!! Let us find out.

The learning mechanism is a bit different between the two models, where Naive Bayes is a Generative model, i.e., models each class and Logistic regression is a Discriminative model, i.e., constructs the decision boundary. What does this mean?

Generative model: Naive Bayes models the joint distribution of the feature X and target Y, and then predicts the posterior probability given as P(y|x).

Discriminative model: Logistic regression directly models the posterior probability of P(y|x) by learning the input to output mapping by minimizing the error.

Hyperparameter tuning

Naive Bayes is a classification machine learning algorithm. Following are the graphs of the dataset showing their classification as per model gets trained.

In the previous section, where we learned Laplace smoothing, as we increase ⍺ also called a hyperparameter, the likelihood of p(x_i|Yes) and p(x_i|No) tends to 0.5. They become biased towards the value of 0.5. In this case, the problem of underfitting occurs.

Similarly, when the value of ⍺ is very small, it will give importance to even very rare words or data points from the training data, in this case, an overfitting problem will occur. Hence, for the perfect fit of the value, we tune the model. For cross-checking, we check the performance for different hyperparameters.

Types of Naive Bayes classifiers

1. Multinomial Naive Bayes

Typically used when features represent the frequency of some events.

The model is a variant of Naive Bayes that is mostly used in Natural Language Processing (NLP). It uses the Bayes theorem to predict the tag of a text such as a piece of email or a newspaper article. And for every tag in a given sample, it calculates the probability and outputs the tag with the highest probability.

2. Bernoulli Naive Bayes

This model is used for discrete data and it works on Bernoulli distributions. Bernoulli Naive Bayes has the main characteristic that it only accepts binary features, such as true or false, yes or no, success or failure, 0 or 1. So, whenever the feature values are binary, we use Bernoulli Naive Bayes classifiers.

Now for binary values like p and q, let’s consider ‘p’ as the probability of success and ‘q’ as the probability of failure which is also equal to (1-p), then for a random variable ‘X’ in the Bernoulli distribution:

The decision rule for Bernoulli Naive Bayes is:

where, i is an event, and x_i needs to be binary (0 or 1).

3. Gaussian Naive Bayes

When dealing with continuous data, this model is commonly used, it is supposed that the continuous values associated with each class are distributed in accordance with a normal distribution (or Gaussian distribution).

The decision rule for Gaussian Naive Bayes is:

Where μ is the mean and σ is the standard deviation that we have to estimate from the data.

Advantages and Disadvantages of Naive Bayes Classifier

Advantages

This algorithm works quickly and can save a lot of time.
Naive Bayes is suitable for solving multi-class prediction problems.
If its assumption of the independence of features holds true, it can perform better than other models and requires much less training data.
Naive Bayes is better suited for categorical input variables than numerical variables.

Disadvantages

Naive Bayes assumes that all features are independent, rarely happening in real life. This limits the practical applicability of the algorithm.
The algorithm requires smoothing to face issues arising due to the presence of features in testing data that were absent in the training data.
The algorithm can give wrong estimations, e.g. Naive Bayes assumes features to contribute independently to the probability of an occurrence of a class. When we are applying Naive Bayes on a data set involving features that are correlated to each other, it still takes equal contribution from all the features, thereby giving a wrong estimation of the probability.

Also, Naive Bayes gives inaccurate probabilities when it encounters unknown features, ie. , features not present in the training data set.

Let’s begin coding!

The given below code and dataset can be found here.

Text Classification Using Naive Bayes (using sklearn):

Let’s have a look at one use case of the Naive Bayes Algorithm:

Given some emails and target labels the category under which the content of the email comes) as training data, we want to classify the mails under these 20 categories.

The whole implementation can be found in this colab notebook. Have a good time going through it 🙂

NoteBook Link

2. Building Gaussian Naive Bayes from scratch

(the link to the dataset can be found here)

import math

Creating the required functions:

Data Visualization and Prediction

Evaluation (Confusion Matrix and Accuracy)

Conclusion

A sigh of relief! Finally, we reached the conclusion. Kudos on making it till here. Well, this was a comprehensive article on the Naive Bayes Algorithm. Hope you got the idea behind this probabilistic classifier. Though it’s very useful in the case when the features are not highly correlated and are conditionally independent, there is still a requirement for a robust model that gives us accurate results in nearly all the use cases. Stay tuned, it’s coming up soon in this blog series!

So, revise, code, and watch out for our next article!

You can follow us on: