Learn Logistic Regression in Machine Learning From Scratch

Here’s: a descriptive walk-through that blows your curious mind

11 min readJan 17, 2025

In machine learning, we used logistic regression for classification of probability between two different columns. Well, logistic regression is the best algorithm that is used by data scientists or ML engineers for classification purposes and the sigmoid curve that basically defines the probability is also a good factor about this algorithm. In this article, we will be learning logistic regression from scratch with a hands-on approach using R programming and once you grab the concept, you will understand how easy and valuable it is, in machine learning.

Table of Content

Logistic regression
Installing packages
Why logistic regression
Simple logistic regression
Multiple logistic regression
Assessing model accuracy
Model concerns
Feature interpretation

1. Logistic Regression

Linear regression is used to approximate the linear relationship between a
continuous response variable and a set of predictor variables.

However, when the
response variable is binary (Yes/No), linear regression is not appropriate.

Fortunately, analysts can turn to an analogous method, logistic regression which is similar to linear regression in many ways.

Now here, we will explore the

use of logistic regression for binary response variables.

Installing Packages

# Helper packages
library(dplyr)
library(ggplot2) 
library(rsample)

dplyr — It is used for data manipulation in R.
ggplot2 — It is used for data visualization in R.
rsample — It is used for data splitting in R.

# Modeling packages
library(caret)

Caret — It is used for logistic regression modeling in R.

# Model interpretability packages
library(vip)

vip — It is used for variable importance in R.

To illustrate logistic regression concepts we will use the employee attrition data, where our intent is to predict the Attrition response variable that is coded as ”Yes”/”No”.

As we see earlier that we will set aside 30% of our data as a
test set to assess our generalizability error.

df <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE)

# Create training (70%) and test (30%) sets for the
# rsample::attrition data.
set.seed(123) # for reproducibility
churn_split <- initial_split(df, prop = .7, strata = ”Attrition”)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)

Why Logistic Regression

To provide a clear motivation for logistic regression, assume we have credit
card default data for customers and we want to understand if the current credit

card balance of a customer is an indicator of whether or not they will default
on their credit card.

To classify a customer as a high vs. low risk defaulter
based on their balance we could use linear regression, however, the left plot

in our figure that shows how linear regression would predict the probability
of defaulting.

Unfortunately, for balances close to zero we predict a negative

probability of defaulting, if we were to predict for very large balances, we would
get values bigger than 1.

These predictions are not sensible, since of course
the true probability of defaulting, regardless of credit card balance, must fall
between 0 and 1.

These inconsistencies only increase as our data become more

imbalanced and the number of outliers increases.

Contrast this with the logistic

regression line (right plot) that is non-linear (sigmoidal-shaped).

To avoid the inadequacies of the linear model fit on a binary response, we
must model the probability of our response using a function that gives outputs

between 0 and 1 for all values of 𝑋.

Many functions meet this description. In
logistic regression we use the logistic function which is defined below and produces the S-shaped curve in the right plot above.

The 𝛽i parameters represent the coefficients as in linear regression and
𝑝 (𝑋) may be interpreted as the probability that the positive class (default in the above example) is present.

The minimum for 𝑝(𝑥) is obtained at limu?→−∞ [
ea/1+ea ] = 0 and the maximum for 𝑝 (𝑥) is obtained at

limu?→∞ [
ea
1+ea ] = 1 which restricts the output probabilities to 0–1.

Re-arranging below equation yields the logit transformation which is where logistic
regression gets its name.

Comparing the predicted probabilities of linear regression (left)
to logistic regression (right). Predicted probabilities using linear regression
results in flawed logic whereas predicted values from logistic regression will

always lie between 0 and 1.

Applying a logit transformation to 𝑝(𝑋) results in a linear equation similar
to the mean response in a simple linear regression model.

Using the logit
transformation also results in an intuitive interpretation for the magnitude

of 𝛽1: the odds for example, of defaulting increase multiplicatively by exp (𝛽1) for

every one-unit increase in 𝑋.

Simple Logistic Regression

We will fit two logistic regression models in order to predict the probability of an
employee attriting.

The first predicts the probability of attrition based on their
monthly income (MonthlyIncome) and the second is based on whether or not

the employee works overtime (OverTime).

The glm() function fits generalized
linear models, a class of models that includes both logistic regression and
simple linear regression as special cases.

The syntax of the glm() function

is similar to that of lm(), except that we must pass the argument family =
”binomial” in order to tell R to run a logistic regression rather than some other
type of generalized linear model, the default is family = ”gaussian”, which is
equivalent to ordinary linear regression assuming normally distributed errors.

model1 <- glm(Attrition ~ MonthlyIncome, family = ”binomial”, data = churn_train)
model2 <- glm(Attrition ~ OverTime, family = ”binomial”, data = churn_train)

In the background glm(), uses ML estimation to estimate the unknown model
parameters.

The basic intuition behind using ML estimation to fit a logistic

regression model is as follows, we seek estimates for 𝛽0 and 𝛽1

such that
the predicted probability ̂𝑝 (𝑋i) of attrition for each employee corresponds
as closely as possible to the employee’s observed attrition status.

In other
words, we try to find 𝛽̂0 and 𝛽̂1
such that plugging these estimates into the
model for 𝑝 (𝑋) in below equation, yields a number close to one for all employees
who attrited and a number close to zero for all employees who did not.

This
intuition can be formalized using a mathematical equation called a likelihood
function.

The estimates 𝛽̂
0 and 𝛽̂
1 are chosen to maximize this likelihood function.

What results is the predicted probability of attrition. The figure illustrates the
predicted probabilities for the two models.

Predicted probablilities of employee attrition based on monthly
income (left) and overtime (right). As monthly income increases, ‘model1‘
predicts a decreased probability of attrition and if employees work overtime
‘model2‘ predicts an increased probability.

The table that I show you below the coefficient estimates and related information that

result from fitting a logistic regression model in order to predict the probability

of Attrition = Yes for our two models.

Bear in mind that the coefficient

estimates from logistic regression characterize the relationship between the
predictor and response variable on a log-odds for example logit scale.

For model1, the estimated coefficient for MonthlyIncome is 𝛽̂
1 = -0.000130,
which is negative, indicating that an increase in MonthlyIncome is associated

with a decrease in the probability of attrition.

Similarly, for model2, employees

who work OverTime are associated with an increased probability of attrition
compared to those that do not work OverTime.

tidy(model1)

tidy(model2)

As we discussed earlier, it is easier to interpret the coefficients using an exp()
transformation.

exp(coef(model1))

exp(coef(model2))

Thus, the odds of an employee attriting in model1 increase multiplicatively by
1 for every one dollar increase in MonthlyIncome, whereas the odds of attriting

in model2 increase multiplicatively by 4.081 for employees that work OverTime

compared to those that do not.

Many aspects of the logistic regression output are similar to those discussed
for linear regression.

For example, we can use the estimated standard errors to
get confidence intervals as we did for linear regression in our previous article.

confint(model1)

# MonthlyIncome -0.000185 -8.11e-05
confint(model2)

Multiple Logistic Regression

We can also extend our model as we have seen in equation 1 so that we can predict a
binary response using multiple predictors.

Let’s go ahead and fit a model that predicts the probability of Attrition based
on the MonthlyIncome and OverTime.

Our results show that both features are
statistically significant at the 0.05 level and in the figure below illustrates common
trends between MonthlyIncome and Attrition however, working OverTime tends
to nearly double the probability of attrition.

model3 <- glm(
Attrition ~ MonthlyIncome + OverTime, family = ”binomial”, data = churn_train)

tidy(model3)

Predicted probability of attrition based on monthly income

and whether or not employees work overtime.

Assessing Model Accuracy

With a basic understanding of logistic regression under our belt, similar to
linear regression our concern now shifts to how well do our models predict.

As in the last guide, we will use caret::train() and fit three 10-fold cross
validated logistic regression models.

Extracting the accuracy measures in this
case, classification accuracy, we see that both cv_model1 and cv_model2 had

an average accuracy of 83.88%.

However, cv_model3 which used all predictor
variables in our data achieved an average accuracy rate of 87.58%.

set.seed(123)
cv_model1 <- train(Attrition ~ MonthlyIncome, data = churn_train, method = ”glm”, family = ”binomial”, trControl = trainControl(method = ”cv”, number = 10))

set.seed(123)
cv_model2 <- train(Attrition ~ MonthlyIncome + OverTime, data = churn_train,method = ”glm”, family = ”binomial”,trControl = trainControl(method = ”cv”, number = 10))

# extract out of sample performance measures
summary(resamples(list(
model1 = cv_model1,
model2 = cv_model2,
model3 = cv_model3)))

$statistics$Accuracy

We can get a better understanding of our model’s performance by assessing
the confusion matrix.

We can use the caret::confusionMatrix()
to compute a confusion matrix.

We need to supply our model’s predicted class
and the actuals from our training data.

The confusion matrix provides a wealth
of information.

Particularly, we can see that although we do well predicting

cases of non-attrition, note the high specificity, our model does particularly
poor predicting actual cases of attrition, note the low sensitivity.

By default the predict() function predicts the response class for a caret
model however, you can change the type argument to predict the probabilities.

?caret::predict.train

# predict class
pred_class <- predict(cv_model3, churn_train)

# create confusion matrix
confusionMatrix(data = relevel(pred_class, ref = ”Yes”), reference = relevel(churn_train$Attrition, ref = ”Yes”))

# create confusion matrix
confusionMatrix(data = relevel(pred_class, ref = ”Yes”),
reference = relevel(churn_train$Attrition, ref = ”Yes”))

One thing to point out, in the confusion matrix above you will note the metric No
Information Rate: 0.839.

This represents the ratio of non-attrition vs. attrition in our training data (table(churn_train$Attrition) %>% prop.table()).

Consequently, if we simply predicted ”No” for every employee we would still
get an accuracy rate of 83.9%.

Therefore, our goal is to maximize our accuracy
rate over and above this no information baseline while also trying to balance
sensitivity and specificity.

To that end, we plot the ROC curve which is displayed in figure.

If we compare our simple model (cv_model1) to our full model (cv_model3), we see the lift achieved with the more accurate

model.

# installing the package
library(ROCR)

# Compute predicted probabilities
m1_prob <- predict(cv_model1, churn_train, type = ”prob”)$Yes
m3_prob <- predict(cv_model3, churn_train, type = ”prob”)$Yes

# Compute AUC metrics for cv_model1 and cv_model3
perf1 <- prediction(m1_prob, churn_train$Attrition) %>% performance(measure = ”tpr”, x.measure = ”fpr”)
perf2 <- prediction(m3_prob, churn_train$Attrition) %>% performance(measure = ”tpr”, x.measure = ”fpr”)

# Plot ROC curves for cv_model1 and cv_model3
plot(perf1, col = ”black”, lty = 2)
plot(perf2, add = TRUE, col = ”blue”)
legend(0.8, 0.2, legend = c(”cv_model1”, ”cv_model3”), col = c(”black”, ”blue”), lty = 2:1, cex = 0.6)

Similar to linear regression, we can perform a PLS logistic regression to
assess if reducing the dimension of our numeric predictors helps to improve

accuracy.

There are 16 numeric features in our data set so this code
performs a 10-fold cross-validated PLS model while tuning the number of

principal components to use from 1–16.

The optimal model uses 14 principal
components which is not reducing the dimension by much.

Well, the mean
accuracy of 0.876 is no better than the average CV accuracy of cv_model3
(0.876).

# Perform 10-fold CV on a PLS model tuning the number of PCs to
# use as predictors
set.seed(123)

cv_model_pls <- train(Attrition ~ ., data = churn_train,method = ”pls”, family = ”binomial”,trControl = trainControl(method = ”cv”, number = 10), preProcess = c(”zv”, ”center”, ”scale”),tuneLength = 16)

ROC curve for cross-validated models 1 and 3. The increase

in the AUC represents the ’lift’ that we achieve with model 3.

# Model with lowest RMSE
cv_model_pls$bestTune

# Plot cross-validated RMSE
ggplot(cv_model_pls)

Model Concerns

As with linear models, it is important to check the adequacy of the logistic
regression model in fact, this should be done for all parametric models.

These linear models where the residuals played an
important role.

Although not as common, residual analysis and diagnostics are equally important to generalized linear models.

The problem is that there is no

obvious way to define what a residual is for more general models.

For instance,
how might we define a residual in logistic regression when the outcome is
either 0 or 1? Nonetheless attempts have been made and a number of useful

diagnostics can be constructed based on the idea of a pseudo residual see, for
example, Harrell (2015).

More recently, Liu and Zhang (2018) introduced the concept of surrogate
residuals that allows for residual based diagnostic procedures and plots not

unlike those in traditional linear regression for example, checking for outliers and
mis-specified link functions.

For an overview with examples in R using the
sure package

Feature Interpretation

Similar to linear regression, once our preferred logistic regression model is
identified, we need to interpret how the features are influencing the results.

As with normal linear regression models, variable importance for logistic regression models can be computed using the absolute value of the 𝑧-statistic
for each coefficient, albeit with the same issues previously discussed.

Using
vip::vip() we can extract our top 20 influential variables.

The figure shows that OverTime is the most influential followed by JobSatisfaction and

EnvironmentSatisfaction.

vip(cv_model3, num_features = 20)

Top 20 most important variables for the PLS model.

Similar to linear regression, logistic regression assumes a monotonic linear
relationship.

However, the linear relationship occurs on the logit scale, on the

probability scale, the relationship will be nonlinear.

This is illustrated by the

PDP in the figure, which illustrates the functional relationship between the
predicted probability of attrition and the number of companies an employee

has worked for (NumCompaniesWorked) while taking into account the average

effect of all the other predictors in the model.

Employees who have experienced

more employment changes tend to have a high probability of making another
change in the future.

Furthermore, the PDPs for the top three categorical predictors OverTime, JobSatisfaction and EnvironmentSatisfaction illustrate the change in predicted probability of attrition based on the employee’s status for each predictor.

Partial dependence plots for the first four most important

variables. We can see how the predicted probability of attrition changes for
each value of the influential predictors.

Thanks For Reading

If you read this article till the end, please consider the following:

Follow the author to get updates of upcoming articles 🔔
If you like this article, please consider a clap 👏🏻
Highlight that inspired you
Comment your precious thoughts 💭
Share this article by showing your love and support ❤️