26 — Pandas Data Cleaning: Using Boxplot To Identify Outliers For Continuous Variables

A.I Hub
7 min readSep 14, 2024

--

Image owned by Canva

In the world of data analysis, outliers can often obscure critical insights and lead to misleading conclusions. One powerful tool to spot these anomalies is the boxplot a simple yet effective graphical representation. Whether you are preparing your dataset for analysis or trying to clean up messy data, understanding how to identify outliers using a boxplot is an essential skill. Getting ready involves knowing which variables to focus on and how to visualize them properly. As you dive into the process, you will discover how this method works by displaying the distribution of your continuous variables, revealing any unusual data points that deviate from the norm. But that’s not all, there’s much more to boxplots than meets the eye and once you master this technique, your ability to refine datasets will reach new heights. For further exploration of related techniques and tools, check out these additional resources and take your data analysis to the next level.

Table of Content

  • Using boxplot to identify outliers for continous variables
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Using Boxplot To Identify Outliers For Continuous Variables

Boxplots are essentially a graphical representation of our work in the

Identifying outliers with one variable recipe in section 4, Identifying Missing
Values and Outliers in Subsets of Data. There, we used the concept of interquartile range (IQR), the distance between the value at the first
quartile and the value at the third quartile to determine outliers. Any value
greater than ( 1.5 * IQR ) + the third quartile value, or less than the first

quartile value ( 1.5 * IQR ) was considered an outlier. That is precisely
what is revealed in a boxplot.

Getting Ready

We will work with cumulative data on coronavirus cases and deaths by
country and the National Longitudinal Surveys data. You will need
the matplotlib library to run the code on your computer.

How To Do It

We use boxplots to show the shape and spread of Scholastic Assessment
Test scores, weeks worked and Covid cases and deaths.

  • Load the pandas and matplotlib libraries. Also, load the NLS and Covid data.
import pandas as pd
import matplotlib.pyplot as plt
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)
  • Do a boxplot of SAT verbal scores.

Produce some descriptives first. The boxplot method produces a rectangle
that represents the IQR, the values between the first and third quartile. The
whiskers go from that rectangle to 1.5 times the IQR. Any values above or

below the whiskers, what we have labeled the outlier threshold are

considered outliers, we use annotate to point to the first and third quartile
points, the median, and to the outlier thresholds.

nls97.satverbal.describe()

Output:

count 1,406
mean 500
std 112
min 14
25% 430
50% 500
75% 570
max 800
Name: satverbal, dtype: float64
plt.boxplot(nls97.satverbal.dropna(), labels=['SAT Verbal'])
plt.annotate('outlier threshold', xy=(1.05,780), xytext=(1.15,780), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
plt.annotate('3rd quartile', xy=(1.08,570), xytext=(1.15,570), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
plt.annotate('median', xy=(1.08,500), xytext=(1.15,500), size=7,
arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
plt.annotate('1st quartile', xy=(1.08,430), xytext=(1.15,430), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
plt.annotate('outlier threshold', xy=(1.05,220), xytext=(1.15,220), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
#plt.annotate('outlier threshold', xy=(1.95,15), xytext=(1.55,15), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
plt.show()

Output:

Boxplot of SAT verbal scores with labels for interquartile range
and outliers
  • Show some descriptives on weeks worked.
weeksworked = nls97.loc[:, ['highestdegree','weeksworked16','weeksworked17']]
weeksworked.describe()

Output:

weeksworked16 weeksworked17
count 7,068 6,670
mean 39 39
std 21 19
min 0 0
25% 23 37
50% 53 49
75% 53 52
max 53 52
  • Do boxplots of weeks worked.
plt.boxplot([weeksworked.weeksworked16.dropna(),
.weeksworked.weeksworked17.dropna()],
labels=['Weeks Worked 2016','Weeks Worked 2017'])
plt.title("Boxplots of Weeks Worked")
plt.tight_layout()
plt.show()

Output:

Boxplots of two variables side by side
  • Show some descriptives for the Covid data. Create a list of labels (totvarslabels) for columns to use in a later step.
totvars = ['total_cases','total_deaths','total_cases_pm','total_deaths_pm']
totvarslabels = ['cases','deaths', 'cases per million','deaths per million']
covidtotalsonly = covidtotals[totvars]
covidtotalsonly.describe()

Output:

total_cases total_deaths total_cases_pm \
count 209 209 209
mean 60,757 2,703 2,297
std 272,440 11,895 4,040
min 3 0 1
25% 342 9 203
50% 2,820 53 869
75% 25,611 386 2,785
max 3,247,684 134,814 35,795
total_deaths_pm
count 209
mean 74
std 156
min 0
25% 3
50% 15
75% 58
max 1,238
  • Do a boxplot of cases and deaths per million.
fig, ax = plt.subplots()
plt.title("Boxplots of Covid Cases and Deaths Per Million")
ax.boxplot([covidtotalsonly.total_cases_pm,covidtotalsonly.total_deaths_pm],\
labels=['cases per million','deaths per million'])
plt.tight_layout()
plt.show()

Output:

Boxplots of two variables side by side
  • Show boxplots as separate subplots on one figure.

It is hard to view multiple boxplots on one figure when the variable values
are very different, as is true for Covid cases and deaths. Fortunately,
matplotlib allows us to create multiple subplots on each figure, each of which
can use different x and y axes.

fig, axes = plt.subplots(2, 2)
fig.suptitle("Boxplots of Covid Cases and Deaths")
axes = axes.ravel()
for j, ax in enumerate(axes):
ax.boxplot(covidtotalsonly.iloc[:, j], labels=[totvarslabels[j]])
plt.tight_layout()
fig.subplots_adjust(top=0.94)
plt.show()

Output:

Boxplots with different y axes

Boxplots are a relatively straightforward but exceedingly useful way to view
how variables are distributed. They make it easy to visualize spread, central

tendency and outliers, all in one graphic.

How It Works

It is fairly easy to create a boxplot with matplotlib, as Step 2 shows. Passing a
series to pyplot is all that is required and we use the plt alias. We call pyplot’s
show method to show the figure. This step also demonstrates how to use annotate to add text and symbols to your figure. We show multiple boxplots

in Step 4 by passing multiple series to pyplot.It can be difficult to show
multiple boxplots in a single figure when the scales are very different, as is
the case with the Covid outcome data cases, deaths, cases per million and
deaths per million. Step 7 shows one way to deal with that. We can create

several subplots on one plot. We start by indicating that we want four
subplots, in two columns and two rows. That is what we get with

plt.subplots(2, 2), which returns both a figure and the four axes. We can
then loop through the axes, calling the boxplot on each one. Nifty! However, it
is still hard to see the IQR for cases and deaths because of some of the
extreme values. In the next recipe, we remove some of the extreme values to
give us a better visualization of the remaining data.

There’s More

The boxplot of SAT verbal scores in Step 2 suggests a relatively normal

distribution. The median is close to the center of the IQR. This is not
surprising given that the descriptives we ran show that the mean and median
have the same value. There is, however, substantially more room for outliers
at the lower end than at the upper end. Indeed, the very low SAT verbal
scores seem implausible and should be checked. The boxplots of weeks

worked in 2016 and 2017 in Step 4 show variables that are distributed much
differently than SAT scores. The medians are near the top of the IQR and are
much greater than the means. This suggests a negative skew. Also, notice that

there are no whiskers or outliers at the upper end of the distribution as the
median value is at or near, the maximum.

See Also

Some of these boxplots suggest that the data we are examining is not
normally distributed. The Identifying outliers with one variable recipe in
section 4 covers some normal distribution tests. It also shows how to take a
closer look at the values outside of the outlier thresholds, the circles in the
boxplots.

Conclusion

In conclusion, utilizing boxplots to identify outliers for continuous variables is a powerful technique that provides visual clarity and statistical accuracy, enabling data analysts to detect anomalies effectively. By getting ready with a clear understanding of the data and its distribution, the groundwork for meaningful insights is laid. The process of creating and interpreting a boxplot is straightforward yet impactful, allowing for quick identification of outliers that might skew results or hide deeper trends. The way boxplots work, by visually representing the quartiles and detecting points outside the whiskers, simplifies complex data into an intuitive graphic. But there’s more, once familiar with boxplots, you will discover how versatile they are, complementing other exploratory data analysis techniques and providing a robust toolset for tackling various types of data. To deepen your understanding, it’s worth exploring additional resources and techniques that further refine your approach to outlier detection and data analysis as a whole.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet