22 — Pandas Data Cleaning: Using Linear Regression To Identify Data Points With Significant Influence

6 min readSep 13, 2024

Linear regression is a powerful statistical tool that allows us to uncover relationships within data, but often, the real insights lie not just in the overall trend but in identifying specific data points that have a significant influence on the model. Getting ready to dive into this means preparing a clear dataset, understanding potential outliers and setting up your linear regression analysis for success. How to do it involves employing techniques like Cook’s Distance or leverage to spot these influential points, while carefully ensuring that your model remains robust and interpretable. How it works is by analyzing each data point’s impact on the regression line and overall predictions, allowing you to refine your model for accuracy. But there’s more , once you have identified these key data points, you can make more informed decisions on whether to keep, adjust or remove them to improve your analysis. See also advanced methods such as robust regression or alternative diagnostic tools that go beyond the basics to ensure you are making the most of your data analysis.

Table of Content

Using linear regression To identify data points with significant influence
Getting ready
How to do it
How it works
There’s more

Using Linear Regression To Identify Data Points With Significant Influence

The remaining recipes in this chapter use statistical modeling to identify outliers. The advantage of these techniques is that they are less dependent on
the distribution of the variable of concern and take more into account than
can be revealed in either univariate or bivariate analyses. This allows us to
identify outliers that are not otherwise apparent. On the other hand, by taking
more factors into account, multivariate techniques may provide evidence that
a previously suspect value is actually within an expected range and provides
meaningful information. In this recipe, we use linear regression to identify
observations (rows) that have an out sized influence on models of a target or

dependent variable. This can indicate that one or more values for a few
observations are so extreme that they compromise model fit for all of the
other observations.

Getting Ready

The code in this recipe requires the matplotlib and statsmodels libraries.
You can install Matplotlib and Statsmodels by using
pip (python package manager) in a terminal
window or powershell in windows. We will be working with data on total
covid cases and deaths per country.

pip install matplotlib
pip install statsmodels

How To Do It

We will use the statsmodels OLS method to fit a linear regression model of
total cases per million of the population. We then identify those countries that
have the greatest influence on that model.

Import pandas, Matplotlib, and Statsmodels, and load the Covid case
data.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
covidtotals = pd.read_csv("data/covidtotals720.csv")
covidtotals.set_index("iso_code", inplace=True)

Create an analysis file and generate descriptive statistics.

Get just the columns required for analysis. Drop any row with missing data for analysis columns.

xvars = ['pop_density','median_age','gdp_per_capita']
covidanalysis = covidtotals.loc[:,['total_cases_pm'] + xvars].dropna()
covidanalysis.describe()

Output:

total_cases_pm pop_density median_age gdp_per_capita
count 174 174 174 174
mean 2,200 208 30 18,795
std 3,964 642 9 19,527
min 1 2 15 661
25% 199 36 22 4,454
50% 768 82 30 12,623
75% 2,666 207 39 27,114
max 35,795 7,916 48 116,936

Fit a linear regression model.

There are good conceptual reasons to believe that population density, median
age and GDP per capita may be predictors of total cases per million. We use
all three variables in our model.

def getlm(df):
Y = df.total_cases_pm
X = df[['pop_density', 'median_age','gdp_per_capita']]
X = sm.add_constant(X)
return sm.OLS(Y, X).fit()
lm = getlm(covidanalysis)
lm.summary()

Output:

coef std err t P>|t|
----------------------------------------------------------------
---------------
const 2591.8003 929.551 2.788 0.006
pop_density -0.0364 0.394 -0.093 0.926
median_age -104.0544 34.837 -2.987 0.003
gdp_per_capita 0.1482 0.017 8.869 0.000

Identify those countries with an out sized influence on the model.

Cook’s Distance values of greater than 0.5 should be scrutinized closely.

influence = lm.get_influence().summary_frame()
influence.loc[influence.cooks_d>0.5, ['cooks_d']]

Output:

cooks_d
iso_code
QAT 4.30
SGP 6.11

covidanalysis.loc[influence.cooks_d>0.5]

Output:

total_cases_pm pop_density median_age gdp_per_capita
iso_code
QAT 35,795 227 32 116,936
SGP 7,826 7,916 42 85,535

Do an influence plot.

Countries with higher Cook’s Distance values have larger circles.

fig, ax = plt.subplots(figsize=(10,6))
sm.graphics.influence_plot(lm, ax = ax, criterion="cooks")
plt.show()

Output:

Influence plot, including countries with the highest Cook’s

Distance

Run the model without the two outliers.

Removing these outliers, particularly Qatar has a dramatic effect on the
model. The estimates for median_age and for the constant are no longer
significant.

covidanalysisminusoutliers = covidanalysis.loc[influence.cooks_d<0.5]
lm = getlm(covidanalysisminusoutliers)
lm.summary()

Output:

coef std err t P>|t|
----------------------------------------------------------------
---------------
const 901.1855 803.745 1.121 0.264
pop_density 1.8371 0.793 2.317 0.022
median_age -23.2250 31.229 -0.744 0.458
gdp_per_capita 0.0828 0.016 5.079 0.000

This gives us a sense of the countries that are most unlike the others in terms
of the relationship between demographic variables and total cases per million

in population.

How It Works

Cook’s Distance is a measure of how much each observation influences the
model. The large impact of the two outliers is confirmed in step 6 when we

rerun the model without them. The question for the analyst is whether outliers
such as these add important information or distort the model and limit its applicability. The coefficient of -49 for median age in the first regression
results indicates that every one year increase in median age is associated with
a 49 point reduction in cases per million people. But this seems largely due to
the model trying to fit a quite extreme total cases per million value for Qatar.
Without qatar, the coefficient on age is no longer significant. The P>|t|
value in the regression output tells us whether the coefficient is significantly

different from 0. In the first regression, the coefficients for median_age and
gdp_per_capita are significant at the 99% level, that is, the P>|t| value is
less than 0.01. Only gdp_per_capita is significant when the model is run

without the 2 outliers, though pop_density approaches significance here.

There’s More

We run a linear regression model in this recipe, not so much because we are
interested in the parameter estimates of the model, but because we want to
determine whether there are observations with potential out sized influence

on any multivariate analysis we might conduct. That definitely seems to be
true in this case.Often, it makes sense to remove the outliers, as we have done
here, but that is not always true. When we have independent variables that do
a good job of capturing what makes outliers different, then the parameter
estimates for the other independent variables are less vulnerable to distortion.
We also might consider transformations, such as the log transformation we

did in a previous recipe and the scaling we will do in the next two recipes.
An appropriate transformation given your data, can reduce the influence of outliers by limiting the size of residuals at the extremes.

Conclusion

In conclusion, using linear regression to identify data points with significant influence is a powerful method for improving the accuracy and reliability of your predictive models. It not only helps in identifying outliers and high leverage points but also provides insights into how certain data points can skew results or offer critical value. By getting ready with a clear understanding of your dataset and applying the right techniques, you can easily uncover influential points that can refine your analysis. Understanding how it works, from fitting the model to analyzing residuals and leverage, allows you to make more informed decisions based on your data. Moreover, there’s always more to explore in this area, as linear regression is just the beginning of deeper statistical insights. Continue your journey by looking into more advanced methods like ridge regression or robust regression to further enhance your skills and improve model accuracy. Be sure to explore related topics that can expand your understanding and application of these powerful techniques, driving more accurate and insightful business outcomes.

22 — Pandas Data Cleaning: Using Linear Regression To Identify Data Points With Significant Influence

Table of Content

Using Linear Regression To Identify Data Points With Significant Influence

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

Conclusion

Written by A.I Hub

No responses yet