**Logistic Regression**

What do you do when you see a “You have won a lottery” email? You prefer to report it as ‘spam’ rather than ignoring it.

The above image gives an overview of spam filtering. Plenty of emails arrive every day. Some of them go to the spam folder while the rest remain in the primary inbox. Also, emails get classified as primary, social, promotion, updates, forum.

*How do mail apps classify them?*

The blue box in the middle — **Machine Learning Model** decides which mail is spam and which is not. How can a model decide which mail is spam and which is not? There are many algorithms by which we can do this. One of them is **Logistic Regression**.

That is going to be the topic of today’s blog. Indeed, Logistic Regression is one of the most important analytic tools. Logistic Regression is a machine learning algorithm that resembles a **single-layer perceptron** with one output unit, the only difference being the introduction of non-linearity in the output.

*In this article, we are going to look at –*

- Sigmoid Function
- Getting familiar with Notations
- Cross-entropy Cost function (using Bernoulli distribution)
- Gradient Descent
- Evaluation metric
- Multiclass Classification

So, without further ado, let’s get started!

We have used a linear regression algorithm to try to predict y given x. However, there are lots of examples for which linear regression performs poorly. Therefore, Logistic Regression is used for classification. For classification, the output of the hypothesis function should be in the range [0,1]. The hypothesis function should also be able to represent a straight line that classifies data points.

How do we construct a hypothesis function which satisfies the above conditions?

**Sigmoid Function**

To construct such a hypothesis function, let’s change the form of our hypothesis function. We shall choose the sigmoid function or logistic function in this case.

The plot of the sigmoid function is shown below:

# Getting familiar with Notations

Logistics Regression solves problems by learning, from a training set, a vector of weights, and a bias term. Each weight ** wi** is a real number and is associated with one of the input features

*xi**s*. The weight

**represents how important that input feature is to the classification decision.**

*wi*To decide on a test instance — after we’ve learned the weights in training — the classifier first multiplies each ** xi** by its weight

**, sums up the weighted features, and adds the bias term**

*wi***. The resulting single number**

*b***expresses the weighted sum of the evidence for the class.**

*z*which is equivalent to:

We can get the output in the range (0,1) using the sigmoid function, but does this hypothesis function also represent a straight line that classifies points into two classes? The answer to this question is YES. Let’s see how.

So,

**Exploring Cost Function**

The problem with decision boundaries is we can get many decision boundaries by varying the values of **w** and **b**. How do we know which boundary is the best? How do we choose the best decision boundary? By looking at visualization, we can say which decision boundary is best, but how do we choose it mathematically?

Here comes the role of the loss function. We need a loss function to objectively evaluate different possible values of **w** and **b**, and arrive at the best decision boundary. The better the decision boundary the lower should be the loss value.

In Linear Regression, we have seen a convex cost function that would be easy to optimize. Similarly, can we use **Mean Squared Error** as the cost function for Logistic Regression? NO, because if MSE is used as the cost function in Logistic Regression, the cost function will be **non-convex** and the optimization algorithm might not be able to find the global minimum.

Now, it is clear that we cannot use MSE as a cost function. What else can we use? Is there any function that is convex and also can find the best decision boundary? Yes, there is a **cross-entropy loss function**.

Let’s derive the loss function. We would like to learn weights that maximize the probability of the correct label *p*(y|x). Since there are only two discrete outcomes (1 or 0), this is **Bernoulli distribution** and we can express the probability *p*(y|x) as the following-

Now, we take logs on both sides (For mathematical convenience).

The above equation describes log-likelihood that should be maximized. To make it a loss function, we have to minimize it to get the best boundary. Hence, we just flip the sign. The result of the cross-entropy loss function is -

**Gradient Descent**

Now that we can evaluate different parameter values and say which represents a better boundary as we have a convex cost function, can we use the Gradient Descent optimization to find the best model in Logistic Regression?

Yes, we can use Gradient Descent for optimization in Logistic Regression too.

**Evaluation Metrics**

Done with gradient descent! But, how can we say that our model is performing well? We cannot use the cost function alone for evaluation as it is based on the probabilities of classes and not the final prediction. Can we use an accuracy metric here? Sure, we can, but the accuracy metric is not a good metric for imbalanced data. We might have to use **F1-score** in such cases.

**Here we are done with Binary Classification! But after all, the world is not binary!!**

**Multiclass Classification**

As we classified data points into two classes, can we do multiclass classification with Logistic Regression?

**Time to Code!**

Now, we will code this algorithm from scratch using a gradient descent algorithm.

The links to the dataset and code are attached herewith.

**Conclusion**

Logistic Regression is a generalized linear model that we can use to model or predict categorical outcome variables. It is a classical method mainly used for binary classification problems, however, it can also be used for multi-class classification problems as well with some modification. Logistic Regression is one of the most used models in ML. It is relatively fast compared to other supervised classification techniques such as **kernel SVM** or **ensemble methods** (see later in AI weekly series) but suffers to some degree in its accuracy. It tends to underperform in the cases when the decision boundary that we want is not linear.

So, revise, code, and watch out for our next article!

Follow us on: