20 — Pandas Data Cleaning: Identifying outliers and unexpected values in bivariate relationships
In the world of data analysis, identifying outliers and unexpected values in bivariate relationships is more than just a statistical exercise, it’s about uncovering the hidden patterns that could either make or break the insights you extract. Whether you’re dealing with customer behavior, financial trends or product performance, outliers often hold the key to understanding anomalies that affect decision making. Getting ready to dive into this process requires not only a keen understanding of your data but also the right tools and mindset to interpret the results effectively. So, how do you actually do it? The process involves careful analysis, selecting the appropriate methods, and interpreting results that may challenge your initial assumptions. And once you have understood how it works, you will realize that outlier detection isn’t just about identifying data points, it’s about gaining deeper insights into the dynamics of your dataset. But there’s more to this than meets the eye. Mastering this skill opens doors to more advanced techniques and analyses that can further refine your approach. For additional insights, see also complementary methods and tools that can take your data analysis to the next level.
Table of Content
- Identifying outliers and unexpected values in
bivariate relationships - Getting ready
- How to do it
- How it works
- There’s more
- See also
Identifying Outliers and Unexpected Values in Bivariate Relationships
A value might be unexpected, even if it is not an extreme value, when it does
not deviate significantly from the distribution mean. Some values for a
variable are unexpected when a second variable has certain values. This is
easy to illustrate when one variable is categorical and the other is
continuous. The following diagram illustrates the number of bird sightings per
day over a several year period, but shows different distributions for each of
the two sites. One site has mean sightings per day of 33 and the other 52.
This is fictional data, The overall mean not shown is 42. What should we
make of a value of 58 for daily sightings? Is that an outlier? That clearly
depends on which of the two sites was being observed. If there were 58
sightings on a day at site A, 58 would be an unusually high number. Not so for site B, where 58 sightings would not be very different from the mean for
that site.
This hints at a useful rule of thumb: whenever a variable of interest is
significantly correlated with another variable, we should take that relationship
into account when trying to identify outliers or any statistical analysis with
that variable actually. It is helpful to state this a little more precisely, and
extend it to cases where both variables are continuous. If we assume a linear
relationship between variable x and variable y, we can describe that
relationship with the familiar y = mx + b equation, where m is the slope and b
is the y-intercept. We can then expect for y to increase by m for every 1 unit
increase in x. Unexpected values are those that deviate substantially from this
relationship, where the value of y is much higher or lower than what would
be predicted given the value of x. This can be extended to multiple x or
predictor, variables. In this recipe, we demonstrate how to identify outliers
and unexpected values by examining the relationship of a variable to one
other variable. In subsequent recipes in this section, we use multivariate
techniques to make additional improvements in our outlier detection.
Getting Ready
We use the matplotlib and seaborn libraries in this recipe. You can install
them with pip (python package manager).
pip install matplotlib
pip install seaborn
How To Do It
We examine the relationship between total cases and total deaths. We take a
closer look at those countries where deaths are higher or lower than expected
given the number of cases.
- Load pandas, matplotlib, seaborn and the Covid cumulative data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
covidtotals = pd.read_csv("data/covidtotals720.csv")
covidtotals.set_index("iso_code", inplace=True)
totvars = ['location','total_cases', 'total_deaths','total_cases_pm', 'total_deaths_pm']
demovars = ['population','pop_density', 'median_age','gdp_per_capita', 'hosp_beds']
- Generate a correlation matrix for the cumulative and demographic
columns.
Unsurprisingly, there is a very high correlation (0.93) between total cases and
total deaths and a smaller (0.59) but still substantial one between total cases
per million and total deaths per million. There is a strong (0.65) relationship
between GDP per capita and cases per million.
covidtotals.corr(method="pearson")
Output:
total_cases total_deaths \
total_cases 1.00 0.93
total_deaths 0.93 1.00
total_cases_pm 0.23 0.20
total_deaths_pm 0.26 0.41
population 0.34 0.28
pop_density -0.03 -0.03
median_age 0.12 0.17
gdp_per_capita 0.13 0.16
hosp_beds -0.01 -0.01
total_cases_pm ... median_age \
total_cases 0.23 ... 0.12
total_deaths 0.20 ... 0.17
total_cases_pm 1.00 ... 0.22
total_deaths_pm 0.49 ... 0.38
population -0.04 ... 0.03
pop_density 0.08 ... 0.14
median_age 0.22 ... 1.00
gdp_per_capita 0.58 ... 0.64
hosp_beds 0.02 .. 0.66
gdp_per_capita hosp_beds
total_cases 0.13 -0.01
total_deaths 0.16 -0.01
total_cases_pm 0.58 0.02
total_deaths_pm 0.37 0.09
population -0.06 -0.04
pop_density 0.30 0.31
median_age 0.64 0.66
gdp_per_capita 1.00 0.30
hosp_beds 0.30 1.00
[9 rows x 9 columns]
- Check to see whether some countries have unexpectedly high or low
total deaths, given total cases.
First create a data frame with only the cases and deaths columns. Use qcut
to create a column that breaks the data into quantiles. Show a crosstab of total
cases quantiles by total deaths quantiles.
covidtotalsonly = covidtotals.loc[:, totvars]
covidtotalsonly['total_cases_q'] = pd.\ qcut(covidtotalsonly['total_cases'], labels=['very low','low','medium', 'high','very high'], q=5, precision=0)
covidtotalsonly['total_deaths_q'] = pd.\ qcut(covidtotalsonly['total_deaths'],
labels=['very low','low','medium', 'high','very high'], q=5, precision=0)
pd.crosstab(covidtotalsonly.total_cases_q, covidtotalsonly.total_deaths_q)
Output:
total_deaths_q very low low medium high very high
total_cases_q
very low 35 7 0 0 0
low 7 25 10 0 0
medium 0 8 24 9 0
high 0 1 7 26 8
very high 0 1 0 7 34
- Take a look at countries that do not fit along the diagonal.
There is one country with very high total cases but low total deaths. Also,
look at countries with low cases but high deaths. Since the covidtotals
and covidtotalsonly DataFrames have the same index, we can use the
Boolean series created from the latter to return selected rows from the
former.
covidtotals.loc[(covidtotalsonly. \ total_cases_q=="very high") & \
(covidtotalsonly.total_deaths_q=="low")].T
Output:
iso_code SGP
lastdate 2020-07-12
location Singapore
total_cases 45,783.00
total_deaths 26.00
total_cases_pm 7,825.69
total_deaths_pm 4.44
population 5,850,343.00
pop_density 7,915.73
median_age 42.40
gdp_per_capita 85,535.38
hosp_beds 2.40
region East Asia
- Do a scatter plot of total cases by total deaths.
Use Seaborn’s regplot method to generate a linear regression line in
addition to the scatter plot.
ax = sns.regplot(x=covidtotals.total_cases/1000, y=covidtotals.total_deaths)
ax.set(xlabel="Cases (thousands)", ylabel="Deaths", title="Total Covid Cases and Deaths by Country")
plt.show()
Output:
- Examine unexpected values above the regression line.
It is good to take a closer look at countries with cases and deaths coordinates
that are noticeably above or below the regression line through the data. There
are five countries with fewer than 400,000 cases and more than 25,000
deaths.
covidtotals.loc[(covidtotals.total_cases<400000) \ & (covidtotals.total_deaths>25000)].T
Output:
iso_code FRA ITA MEX \
lastdate 2020-07-12 2020-07-12 2020-07-12
location France Italy Mexico
total_cases 170,752 242,827 295,268
total_deaths 30,004 34,945 34,730
total_cases_pm 2,616 4,016 2,290
total_deaths_pm 460 578 269
population 65,273,512 60,461,828 128,932,753
pop_density 123 206 66
median_age 42 48 29
gdp_per_capita 38,606 35,220 17,336
hosp_beds 6 3 1
region Western Europe Western Europe North America
iso_code ESP GBR
lastdate 2020-07-11 2020-07-12
location Spain United Kingdom
total_cases 253,908 288,953
total_deaths 28,403 44,798
total_cases_pm 5,431 4,256
total_deaths_pm 607 660
population 46,754,783 67,886,004
pop_density 93 273
median_age 46 41
gdp_per_capita 34,272 39,753
hosp_beds 3 3
region Western Europe Western Europe
- Examine unexpected values below the regression line.
There are two countries with more than 700,000 cases but fewer than 25,000
deaths.
covidtotals.loc[(covidtotals.total_cases>700000) \ & (covidtotals.total_deaths<25000)].T
Output:
iso_code IND RUS
lastdate 2020-07-12 2020-07-12
location India Russia
total_cases 849,553 720,547
total_deaths 22,674 11,205
total_cases_pm 616 4,937
total_deaths_pm 16 77
population 1,380,004,385 145,934,460
pop_density 450 9
median_age 28 40
gdp_per_capita 6,427 24,766
hosp_beds 1 8
region South Asia Eastern Europe
- Do a scatter plot of total cases per million by total deaths per million.
ax = sns.regplot(x="total_cases_pm", y="total_deaths_pm", data=covidtotals)
ax.set(xlabel="Cases Per Million", ylabel="Deaths Per Million",
title="Total Covid Cases per Million and Deaths per Million by Country")
plt.show()
Output:
regression line
- Examine deaths per million above and below the regression line.
covidtotals.loc[(covidtotals.total_cases_pm<7500) \
& (covidtotals.total_deaths_pm>600),\ ['location','total_cases_pm','total_deaths_pm']]
Output:
location total_cases_pm total_deaths_pm
iso_code
BEL Belgium 5,402 844
ESP Spain 5,431 607
GBR United Kingdom 4,256 660
covidtotals.loc[(covidtotals.total_cases_pm>15000) \
& (covidtotals.total_deaths_pm<=100), \
['location','total_cases_pm','total_deaths_pm']]
Output:
location total_cases_pm total_deaths_pm
iso_code
BHR Bahrain 19,082 61
QAT Qatar 35,795 51
The preceding steps examined the relationship between variables in order to
identify outliers.
How It Works
A number of questions are raised by looking at the bivariate relationships that
did not surface in our univariate exploration in the previous recipe. There is confirmation of anticipated relationships, such as with total cases and total
deaths, but this makes deviations from this all the more curious. There are
possible substantive explanations for unusually high death rates, given a
certain number of cases, but measurement error or poor reporting of cases
cannot be ruled out either. Step 2 shows a high correlation (0.93) between
total cases and total deaths, but there is variation even there. We divide the
cases and deaths into quantiles in step 3 and then do a crosstab of the quantile
values. Most countries are along the diagonal or close to it. However, one
country has a very high number of cases but low deaths, Singapore. This is
also a reminder that Singapore had a very high total cases per million, well
into the 90th percentile. It is reasonable to wonder if there are potential
reporting issues.One country, Yemen, had a low number of cases but a high
number of deaths. This could perhaps be seen as consistent with the very low
number of hospital beds per 100,000 people in Yemen. But it could also
mean that coronavirus cases have been under reported. We do a scatter plot in
step 5 of total cases and deaths. The strong upward sloping relationship
between the two is confirmed, but there are a number of countries whose
deaths are above the regression line. We can see that five countries France,
Italy, Mexico, Spain and Great Britain have higher deaths than would be
predicted by the number of cases. Two countries, Russia and India, have a
much lower number of deaths. It is at least worth wondering about whether
this is a reporting problem or reflects differences in how countries define a
covid death. Not surprisingly, there is even more scatter around the regression line in the scatter plot of cases per million and deaths per million.
Countries such as Belgium, Spain, and the United Kingdom have much
higher deaths per million than the number of cases per million would suggest.
Bahrain and Qatar have significantly lower rates.
There’s More
We are beginning to get a good sense of what our data looks like, but the data
in this form does not enable us to examine how the univariate distributions
and bivariate relationships might change over time. For example, one reason
why countries might have more deaths per million than the number of cases
per million would indicate could be that more time has passed since the first
confirmed cases. We are not able to explore that in the cumulative data. We
need the daily data for that, which we look at in subsequent sections. This
recipe and the previous one, show how much data cleaning can bleed into
exploratory data analysis, even when you are first starting to get a sense of
your data. I would definitely draw a distinction between data exploration and
what we are doing here. We are trying to get a sense of how the data hangs
together, why certain variables take on certain values in certain situations and
not others. We want to get to the point where there are not huge surprises
when we begin to do the analysis.I find it helpful to do small things to
formalize this process. I use different naming conventions for files that are
not quite ready for analysis. If nothing else, this helps remind me that any
numbers produced at this point are far from ready for distribution.
See Also
We still have not done much to examine possible data issues that only
become apparent when examining subsets of data; for example, positive wage
income values for people who say they are not working both variables are on
the National Longitudinal Survey. We do that in the next recipe. We do much
more with matplotlib and seaborn in section 5, Using Visualizations for
Exploratory Data Analysis.
Conclusion
In conclusion, identifying outliers and unexpected values in bivariate relationships is a critical step in ensuring the accuracy and reliability of data analysis. By detecting these anomalies, businesses and analysts can avoid skewed results and uncover deeper insights that might otherwise be hidden. Proper preparation is essential to this process, including ensuring data cleanliness and selecting appropriate tools. Once you are ready, the methodology for detecting outliers is straight forward with various techniques such as scatter plots, regression analysis and machine learning algorithms offering clear paths to finding irregularities. Understanding how these methods work allows you to apply them with confidence and refine your results to ensure they reflect reality. But it doesn’t end there, there’s always more to explore. Further improvements in detection techniques, understanding of contextual outliers and new tools will continue to shape how we analyze data. If you are interested in diving deeper into the topic, be sure to explore related methods and tools for handling multivariate outliers and anomaly detection for even richer analysis.