27 — Pandas Data Cleaning: Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group

8 min readSep 15, 2024

Discovering hidden patterns in data can often feel like searching for a needle in a haystack, especially when dealing with multiple variables. However, by utilizing a grouped boxplot, you can uncover unexpected values within particular groups that would otherwise go unnoticed. Grouped boxplots provide a visual breakdown of your data, enabling you to identify outliers, trends and anomalies that could impact your analysis or decision making process. Whether you are just getting started or looking to dive deeper into how it works understanding the power of this tool will transform how you interpret complex datasets. Let’s explore how you can master it and be prepared, there’s always more to discover. Additionally, make sure to check out related techniques that complement this approach for a well rounded understanding of data visualization.

Table of Content

Using grouped boxplot to uncover unexpected values in a particular group
Getting ready
How to do it
How it works
There’s more
See also

Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group

We saw in the previous recipe that boxplots are a great tool for examining the
distribution of continuous variables. They can also be useful when we want to

see if those variables are distributed differently for parts of our dataset,
salaries for different age groups, number of children by marital status, litter
size for different mammal species. Grouped boxplots are a handy and

intuitive way to view differences in variable distribution by categories in our
data.

Getting Ready

We will work with the NLS and the Covid case data. You will need

matplotlib and seaborn installed on your computer to run the code in this
recipe.

How To Do It

We generate descriptive statistics of weeks worked by highest degree earned.
We then use grouped boxplots to visualize the spread of the weeks worked
distribution by degree and of Covid cases by region.

Import the pandas, matplotlib, and seaborn libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)

View the median, and first and third quartile values for weeks worked
for each degree attainment level.

First, define a function that returns those values as a series, then use apply
to call it for each group.

def gettots(x):
out = {}
out['min'] = x.min()
out['qr1'] = x.quantile(0.25)
out['med'] = x.median()
out['qr3'] = x.quantile(0.75)
out['max'] = x.max()
out['count'] = x.count()
return pd.Series(out)

nls97.groupby(['highestdegree'])['weeksworked17'].\
apply(gettots).unstack()

Output:

min qr1 med qr3 max count
highestdegree
0. None 0 0 40 52 52 510
1. GED 0 8 47 52 52 848
2. High School 0 31 49 52 52 2,665
3. Associates 0 42 49 52 52 593
4. Bachelors 0 45 50 52 52 1,342
5. Masters 0 46 50 52 52 538
6. PhD 0 46 50 52 52 51
7. Professional 0 47 50 52 52 97

Do a boxplot of weeks worked by highest degree earned.

Use Seaborn for these boxplots. First, create a subplot and name it myplt.
This makes it easier to access subplot attributes later. Use the order
parameter of boxplot to order by highest degree earned. We notice that there
are no outliers or whiskers at the lower end for individuals with no degree

ever received. This is because the IQR for those individuals covers the whole
range of values; that is, the value at the 25th percentile is 0 and the value at
the 75th percentile is 52.

myplt = sns.boxplot('highestdegree','weeksworked17', data=nls97,
order=sorted(nls97.highestdegree.dropna().unique()))
myplt.set_title("Boxplots of Weeks Worked by Highest Degree")
myplt.set_xlabel('Highest Degree Attained')
myplt.set_ylabel('Weeks Worked 2017')
myplt.set_xticklabels(myplt.get_xticklabels(), rotation=60, hori
zontalalignment='right')
plt.tight_layout()
plt.show()

Output:

Boxplots of weeks worked with IQR and outliers by highest

degree

View the minimum, maximum, median and first and third quartile
values for total cases per million by region. Use the gettots function defined in Step 2.

covidtotals.groupby(['region'])['total_cases_pm'].\
apply(gettots).unstack()

Output:

min qr1 med qr3 max count
region
Caribbean 95 252 339 1,726 4,435 22
Central Africa 15 71 368 1,538 3,317 11
Central America 93 925 1,448 2,191 10,274 7
Central Asia 374 919 1,974 2,907 10,594 6
East Africa 9 65 190 269 5,015 13
East Asia 3 16 65 269 7,826 16
Eastern Europe 347 883 1,190 2,317 6,854 22
North Africa 105 202 421 427 793 5
North America 2,290 2,567 2,844 6,328 9,812 3
Oceania / Aus 1 61 234 424 1,849 8
South America 284 395 2,857 4,044 16,323 13
South Asia 106 574 885 1,127 19,082 9
Southern Africa 36 86 118 263 4,454 9
West Africa 26 114 203 780 2,862 17
West Asia 23 273 2,191 5,777 35,795 16
Western Europe 200 2,193 3,769 5,357 21,038 32

Do boxplots of cases per million by region.

Flip the axes since there are a large number of regions. Also, do a swarm plot
to give some sense of the number of countries by region. The swarm plot
displays a dot for each country in each region. Some of the IQRs are hard to
see because of the extreme values.

sns.boxplot('total_cases_pm', 'region', data=covidtotals)
sns.swarmplot(y="region", x="total_cases_pm", data=covidtotals, size=2, color=".3", linewidth=0)
plt.title("Boxplots of Total Cases Per Million by Region")
plt.xlabel("Cases Per Million")
plt.ylabel("Region")
plt.tight_layout()
plt.show()

Output:

Boxplots and swarm plots of cases per million by region, with

IQR and outliers

Show the most extreme values for cases per million.

covidtotals.loc[covidtotals.total_cases_pm>=14000,\
['location','total_cases_pm']]

Output:

location total_cases_pm
iso_code
BHR Bahrain 19,082
CHL Chile 16,323
QAT Qatar 35,795
SMR San Marino 21,038
VAT Vatican 14,833

Redo the boxplots without the extreme values.

sns.boxplot('total_cases_pm', 'region', data=covidtotals.loc[covidtotals.total_cases_pm<14000])
sns.swarmplot(y="region", x="total_cases_pm", data=covidtotals.l
oc[covidtotals.total_cases_pm<14000], size=3, color=".3", linewidth=0)
plt.title("Total Cases Without Extreme Values")
plt.xlabel("Cases Per Million")
plt.ylabel("Region")
plt.tight_layout()
plt.show()

Output:

Boxplots of cases per million by region without the extreme

values

These grouped boxplots reveal how much the distribution of cases, adjusted
by population, varies by region.

How It Works

We use seaborn for the figures we create in this recipe. We could have also
used matplotlib. Seaborn is actually built on top of matplotlib, extending it in
some areas and making some things easier. It sometimes produces more
aesthetically pleasing figures with the default settings than matplotlib does. It
is a good idea to have some descriptives in front of us before creating figures
with multiple boxplots. In Step 2, we get the first and third quartile values,

and the median, for each degree attainment level. We do this by first creating
a function called gettots, which returns a series with those values. We
apply gettots to each group in the data frame with the following statement.

nls97.groupby(['highestdegree'])['weeksworked17'].apply(gettots).unstack()

The groupby method creates a data frame with grouping information, which
is passed to the apply function. gettots then calculates summary values
for each group. unstack reshapes the returned rows, from multiple rows per

group one for each summary statistic to one row per group, with columns
for each summary statistic. In Step 3, we generate a boxplot for each degree
attainment level. We do not normally need to name the subplot object we

create when we use seaborn’s boxplot method. We do so in this step,
naming it myplt , so that we can easily change attributes, such as tick labels
later. We rotate the labels on the x axis using set_xticklabels so that the

labels do not run into each other. We flip the axes for the boxplots in Step 5
since there are more group levels (regions) than there are ticks for the
continuous variable, cases per million. We do that by making
total_cases_pm the value for the first argument, rather than the second. We
also do a swarm plot to give some sense of the number of observations

(countries) in eachregion. Extreme values can sometimes make it difficult to
view a boxplot. Boxplots show both the outliers and the IQR, but the IQR
rectangle will be so small that it is not viewable when outliers are several

times the third or first quartile value. In Step 5, we remove all values of
total_cases_pm greater than or equal to 14000. This improves the
presentation of each IQR.

There’s More

The boxplots of weeks worked by educational attainment in Step 3 reveal
high variation in weeks worked, something that is not obvious in univariate
analysis. The lower the educational attainment level, the greater the spread in

weeks worked. There is substantial variability in weeks worked in 2017 for
individuals with less than a high school degree and very little variability for
individuals with college degrees. This is quite relevant, of course to our

understanding of what is an outlier in terms of weeks worked. For example,
someone with a college degree who worked 20 weeks is an outlier, but they
would not be an outlier if they had less than a high school diploma. The
Cases Per Million boxplots also invite us to think more flexibly about
what an outlier is. For example, none of the outliers for cases per million in

East Africa would have been identified as an outlier in the dataset as a whole.
In addition, those values are all lower than the third quartile value for North

America. But they definitely are outliers for East Africa. One of the first
things I notice when looking at a boxplot is where the median is in the IQR.
When the median is not at all close to the center, I know I am not dealing
with a normally distributed variable. It also gives me a good sense of the

direction of the skew. If it is near the bottom of the IQR, meaning that the
median is much closer to the first quartile than the third, then there is positive
skew. Compare the boxplot for the Caribbean to that of Western Europe. A
large number of low values and a few high values bring the median close to
the first quartile value for the Caribbean.

Conclusion

In conclusion, using grouped boxplots offers a powerful way to uncover hidden insights and unexpected values within specific groups of data, making it an invaluable tool for any data driven analysis. Whether you are just getting started or have experience with data visualization, the step-by-step process makes it easy to create and interpret boxplots, shedding light on outliers, distributions and trends that may otherwise go unnoticed. Once you understand how it works, you can refine your analyses, optimize your decision making and derive more precise conclusions. And remember, this is just the beginning, there are more advanced techniques and tools that can further enhance your data exploration. Be sure to explore additional resources and methods to maximize the impact of your analysis and continue unlocking the potential of your data.

27 — Pandas Data Cleaning: Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group

Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

See Also

Conclusion

Written by A.I Hub

No responses yet