19 — Pandas Data Cleaning: Identifying Outliers With One Variables

A.I Hub
10 min readSep 12, 2024

--

Image owned by Canva

Outliers can skew your data, distort results and lead you to inaccurate conclusions, making it crucial to identify them early in any analysis. Whether you are dealing with financial figures, customer behavior or scientific measurements, pinpointing outliers is an essential first step toward cleaner data. As we dive into this guide, we will walk you through the process of identifying outliers with just one variable, ensuring you are prepared to spot anomalies with precision. From setting the groundwork to executing the right techniques, and understanding the mechanisms behind the scenes, this article covers it all. But there’s more, because beyond identifying outliers, we will explore additional resources and tools to fine tune your approach giving you a well rounded grasp of outlier detection. Ready to elevate your data skills? Let’s get started.

Table of Content

  • Identifying outliers with one variable
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Identifying Outliers With One Variable

The concept of an outlier is somewhat subjective but is closely tied to the
properties of a particular distribution; to its central tendency, spread and

shape. We make assumptions about whether a value is expected or
unexpected based on how likely we are to get that value given the variable’s
distribution. We are more inclined to view a value as an outlier if it is
multiple standard deviations away from the mean and it is from a distribution
that is approximately normal, one that is symmetrical has low skew and has

relatively skinny tails, low kurtosis. This becomes clear if we imagine trying to identify outliers from a uniform distribution. There is no central tendency
and there are no tails. Each value is equally likely. If, for example, covid
cases per country were uniformly distributed with a minimum of 1 and a
maximum of 10,000,000, neither 1 nor 10,000,000 would be considered an
outlier. We need to understand how a variable is distributed, then, before we

can identify outliers. Several Python libraries provide tools to help us
understand how variables of interest are distributed. We use a couple of them
in this recipe to identify when a value is sufficiently out of range to be of

concern.

Getting Ready

You will need the matplotlib, statsmodels, and scipy libraries, in
addition to pandas and numpy, to run the code in this recipe. You can install
matplotlib, statsmodels and scipy by entering

pip command, actually it is a package manager for python. We

continue to work with the Covid cases dataset.

pip install matplotlib
pip install statsmodels
pip install scipy

How To Do It

We take a good look at the distribution of some of the key continuous
variables in the Covid data. We examine the central tendency and shape of

the distribution, generating measures and visualizations of normality.

  • Load the pandas , numpy , matplotlib, statsmodels and scipy
    library and the Covid case data file and you can also, set up the covid case and demographic columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as scistat

covidtotals = pd.read_csv("data/covidtotals720.csv")
covidtotals.set_index("iso_code", inplace=True)
totvars = ['location','total_cases', 'total_deaths','total_cases_pm', 'total_deaths_pm']
demovars = ['population','pop_density', 'median_age','gdp_per_capita', 'hosp_beds']
  • Get descriptive statistics for the Covid case data. Create a data frame with just the key case data.
covidtotalsonly = covidtotals.loc[:, totvars]
covidtotalsonly.describe()

Output:

total_cases total_deaths total_cases_pm \
count 209 209 209
mean 60,757 2,703 2,297
std 272,440 11,895 4,040
min 3 0 1
25% 342 9 203
50% 2,820 53 869
75% 25,611 386 2,785
max 3,247,684 134,814 35,795
total_deaths_pm
count 209
mean 74
std 156
min 0
25% 3
50% 15
75% 58
max 1,238
  • Show more detailed percentile data.
covidtotalsonly.quantile(np.arange(0.0, 1.1, 0.1))

Output:

total_cases total_deaths total_cases_pm \
0.00 3.00 0.00 1.23
0.10 63.60 0.00 63.33
0.20 231.20 3.60 144.82
0.30 721.60 14.40 261.51
0.40 1,324.40 28.40 378.78
0.50 2,820.00 53.00 868.87
0.60 6,695.60 116.60 1,398.33
0.70 14,316.40 279.00 2,307.93
0.80 40,245.40 885.20 3,492.31
0.90 98,632.80 4,719.00 5,407.65
1.00 3,247,684.00 134,814.00 35,795.16
total_deaths_pm
0.00 0.00
0.10 0.00
0.20 1.24
0.30 3.76
0.40 7.02
0.50 15.22
0.60 29.37
0.70 47.73
0.80 76.28
0.90 201.42
1.00 1,237.55

Also show skewness and kurtosis. Skewness and kurtosis describe how
symmetrical the distribution is and how fat the tails of the distribution are respectively. Both measures are significantly higher than we would expect if
our variables were distributed normally.

covidtotalsonly.skew()

Output:

total_cases 9.33
total_deaths 8.13
total_cases_pm 4.28
total_deaths_pm 3.91
dtype: float64
covidtotalsonly.kurtosis()
total_cases 99.15
total_deaths 79.38
total_cases_pm 26.14
total_deaths_pm 19.44
dtype: float64
  • Test the Covid data for normality.

Use the Shapiro-Wilk test from the scipy library. Print out the p-value from
the test. The null hypothesis of a normal distribution can be rejected at the
95% level at any p-value below 0.05.

def testnorm(var, df):
stat, p = scistat.shapiro(df[var])
return p
print("total cases: %.5f" % testnorm("total_cases", covidtotalso
nly))
print("total deaths: %.5f" % testnorm("total_deaths", covidtotal
sonly))
print("total cases pm: %.5f" % testnorm("total_cases_pm", covidt
otalsonly))
print("total deaths pm: %.5f" % testnorm("total_deaths_pm", covi
dtotalsonly))

Output:

total cases: 0.00000
total deaths: 0.00000
total cases pm: 0.00000
total deaths pm: 0.00000
  • Show normal quantile-quantile plots (qqplots) of total cases and total
    cases per million.

The straight lines show what the distributions would look like if they were
normal.

sm.qqplot(covidtotalsonly[['total_cases']]. \ sort_values(['total_cases']), line='s')
plt.title("QQ Plot of Total Cases")
sm.qqplot(covidtotals[['total_cases_pm']]. \ sort_values(['total_cases_pm']), line='s')
plt.title("QQ Plot of Total Cases Per Million")
plt.show()

Output:

Distribution of Covid cases compared with a normal distribution

Even when adjusted by population with the total cases per million column,
the distribution is substantially different from normal.

Distribution of Covid cases per million compared with a normal
distribution
  • Show the outlier range for total cases.

One way to define an outlier for a continuous variable is by distance above
the third quartile or below the first quartile. If that distance is more than 1.5
times the inter quartile range, the distance between the first and third

quartiles that value is considered an outlier. In this case, since only 0 or
positive values are possible, any total cases value above 25,028 is considered
an outlier.

thirdq, firstq = covidtotalsonly.total_cases.quantile(0.75), cov
idtotalsonly.total_cases.quantile(0.25)
interquartilerange = 1.5*(thirdq-firstq)
outlierhigh, outlierlow = interquartilerange+thirdq, firstq-interquartilerange
print(outlierlow, outlierhigh, sep=" <--> ")

Output:

37561.5 <--> 63514.5
  • Generate a data frame of outliers and write it to Excel.

Iterate over the four Covid case columns. Calculate the outlier thresholds for
each column as we did in the previous step. Select from the data frame those
rows above the high threshold or below the low threshold. Add columns that
indicate the variable examined (varname) for outliers and the threshold
levels.

def getoutliers():
dfout = pd.DataFrame(columns=covidtotals. \
columns, data=None)
for col in covidtotalsonly.columns[1:]:
thirdq, firstq = covidtotalsonly[col].\
quantile(0.75),covidtotalsonly[col].\
quantile(0.25)
interquartilerange = 1.5*(thirdq-firstq)
outlierhigh, outlierlow = \
interquartilerange+thirdq, \
firstq-interquartilerange
df = covidtotals.loc[(covidtotals[col]> \
outlierhigh) | (covidtotals[col]< \
outlierlow)]
df = df.assign(varname = col,
threshlow = outlierlow,
threshhigh = outlierhigh)
dfout = pd.concat([dfout, df])
return dfout
outliers = getoutliers()
outliers.varname.value_counts()

Output:

total_deaths 36
total_cases 33
total_deaths_pm 28
total_cases_pm 17
Name: varname, dtype: int64
outliers.to_excel("views/outlierscases.xlsx")
  • Look a little more closely at outliers for cases per million.

Use the varname column we created in the previous step to select the outliers
for total_cases_pm. Also show columns (pop_density and

gdp_per_capita) that might help to explain the extreme values and the inter quartile range for those columns.

outliers.loc[outliers.varname=="total_cases_pm",
['location','total_cases_pm','pop_density',
'gdp_per_capita']].\ sort_values(['total_cases_pm'], ascending=False)

Output:

location total_cases_pm pop_density \
QAT Qatar 35,795 227
SMR San Marino 21,038 557
BHR Bahrain 19,082 1,936
CHL Chile 16,323 24
VAT Vatican 14,833 NaN
KWT Kuwait 12,658 232
AND Andorra 11,066 164
OMN Oman 10,711 15
ARM Armenia 10,594 103
PAN Panama 10,274 55
USA United States 9,812 36
PER Peru 9,787 25
BRA Brazil 8,656 25
SGP Singapore 7,826 7,916
LUX Luxembourg 7,735 231
SWE Sweden 7,416 25
BLR Belarus 6,854 47
gdp_per_capita
QAT 116,936
SMR 56,861
BHR 43,291
CHL 22,767
VAT NaN
KWT 65,531
AND NaN
OMN 37,961
ARM 8,788
PAN 22,267
USA 54,225
PER 12,237
BRA 14,103
SGP 85,535
LUX 94,278
SWE 46,949
BLR 17,168
covidtotals[['pop_density','gdp_per_capita']].quantile([0.25,0.5,0.75])

Output:

pop_density gdp_per_capita
0.25 37.42 4,485.33
0.50 87.25 13,031.53
0.75 213.54 27,882.13
  • Show a histogram of total cases.
plt.hist(covidtotalsonly['total_cases']/1000, bins=7)
plt.title("Total Covid Cases (thousands)")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()

Output:

Histogram of total Covid cases
  • Perform a log transformation of the Covid data. Show a histogram of the
    log transformation of total cases.
covidlogs = covidtotalsonly.copy()
for col in covidtotalsonly.columns[1:]:
.covidlogs[col] = np.log1p(covidlogs[col])
plt.hist(covidlogs['total_cases'], bins=7)
plt.title("Total Covid Cases (log)")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()

Output:

Histogram of total Covid cases with log transformation

The tools we used in the preceding steps tell us a fair bit about how covid
cases and deaths are distributed and about where outliers are located.

How It Works

The percentile data shown in step 3 reflect the skewness of the cases and deaths data. If, for example, we look at the range of values between the 20th
and 30th percentiles, and compare it with the range from the 70th

to the 80th
percentiles, we see that the range is much greater in the higher percentiles for
each variable. This is confirmed by the very high values for skewness and

kurtosis, compared with normal distribution values of 0 and 3, respectively.
We run formal tests of normality in step 4, which indicate that the
distributions of the Covid variables are not normal at high levels of
significance.This is consistent with the qqplots we run in step 5. The
distributions of both total cases and total cases per million differ significantly
from normal, as represented by the straight line. Many cases hover around
zero and there is a dramatic increase in slope at the right tail. We identify
outliers in steps 6 and 7. Using 1.5 times the interquartile range to determine

outliers is a reasonable rule of thumb. I like to output those values to an Excel
file, along with associated data, to see what patterns I can detect in the data.

This often leads to more questions, of course. We will try to answer some of
them in the next recipe, but one question we can consider now is what
accounts for the countries with high cases per million, displayed in step 8.

Some of the countries with extreme values are very small, in terms of land
mass, so perhaps population density matters. But half of the countries on this

lists are near or below the 75th percentile in population density. On the other
hand, most countries on this list are above the 75th percentile in GDP per
capita. It is worth exploring these bivariate relationships further, which we do

in subsequent recipes. Our identification of outliers in step 7 assumes a
normal distribution, an assumption that we have shown to be unwarranted.
Looking again at the distribution in step 9, it seems much more like a log-normal distribution, with values clustered around 0 and a right skew. We

transform the data in step 10 and plot the results of the transformation.

There’s More

We could have also used standard deviation, rather than inter quartile ranges to identify outliers in steps 6 and 7. I should add here that outliers are not
necessarily data collection or measurement errors and we may or may not

need to make adjustments to the data. However, extreme values can have a
meaningful and persistent impact on our analysis, particularly with small

datasets like this one. The overall impression we should have of the covid
case data is that it is relatively clean, that is, there are not many invalid values, narrowly defined. Looking at each variable independently of how it

moves with other variables does not identify much that screams out as a clear
data error. However, the distribution of the variables is quite problematic
statistically. Building statistical models dependent on these variables will be
complicated, as we might have to rule out parametric tests.It is also worth
remembering that our sense of what constitutes an outlier is shaped by our

assumption of a normal distribution. If, instead, we allow our expectations to
be guided by the actual distribution of the data, we have a different
understanding of extreme values. If our data reflects a social or biological or

physical process that is inherently not normally distributed uniform,
logarithmic, exponential, weibull, Poisson and so on, our sense of what
constitutes an outlier should adjust accordingly.

See Also

Box plots might have also been illuminating here. We do a few box plots on
this data in section 5, Using Visualizations for Exploratory Data
Analysis. We explore bivariate relationships in this same dataset in the next
recipe for any insights they might provide about outliers and unexpected
values. In subsequent sections, we consider strategies for imputing values for

missing data and for making adjustments to extreme values.

Conclusion

In conclusion, identifying outliers with one variable is a crucial step in ensuring data integrity and improving the accuracy of analyses. By recognizing and addressing outliers, businesses and analysts can prevent skewed insights and make more informed decisions. Getting ready involves proper data preparation, understanding the dataset and setting the stage for meaningful analysis. Once the groundwork is laid, knowing how to do it becomes simpler, as it requires using statistical techniques like Z-scores, IQR or visualization tools to detect anomalies. As we explored how it works, it’s evident that outlier detection is an essential part of data analytics, helping businesses minimize risks and enhance the reliability of their models. But there’s more to this process, outliers often offer hidden insights that, when properly understood, can reveal new trends, risks or opportunities. By going beyond simple removal or correction, businesses can turn outliers into actionable intelligence. For those looking to dive deeper, exploring advanced techniques, resources and related methodologies can provide even greater clarity and power in their data driven decision making journey.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet