27 — Pandas Data Cleaning: Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group
Discovering hidden patterns in data can often feel like searching for a needle in a haystack, especially when dealing with multiple variables. However, by utilizing a grouped boxplot, you can uncover unexpected values within particular groups that would otherwise go unnoticed. Grouped boxplots provide a visual breakdown of your data, enabling you to identify outliers, trends and anomalies that could impact your analysis or decision making process. Whether you are just getting started or looking to dive deeper into how it works understanding the power of this tool will transform how you interpret complex datasets. Let’s explore how you can master it and be prepared, there’s always more to discover. Additionally, make sure to check out related techniques that complement this approach for a well rounded understanding of data visualization.
Table of Content
- Using grouped boxplot to uncover unexpected values in a particular group
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Using Grouped Boxplot To Uncover Unexpected Values in a Particular Group
We saw in the previous recipe that boxplots are a great tool for examining the
distribution of continuous variables. They can also be useful when we want to
see if those variables are distributed differently for parts of our dataset,
salaries for different age groups, number of children by marital status, litter
size for different mammal species. Grouped boxplots are a handy and
intuitive way to view differences in variable distribution by categories in our
data.
Getting Ready
We will work with the NLS and the Covid case data. You will need
matplotlib and seaborn installed on your computer to run the code in this
recipe.
How To Do It
We generate descriptive statistics of weeks worked by highest degree earned.
We then use grouped boxplots to visualize the spread of the weeks worked
distribution by degree and of Covid cases by region.
- Import the pandas, matplotlib, and seaborn libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)
- View the median, and first and third quartile values for weeks worked
for each degree attainment level.
First, define a function that returns those values as a series, then use apply
to call it for each group.
def gettots(x):
out = {}
out['min'] = x.min()
out['qr1'] = x.quantile(0.25)
out['med'] = x.median()
out['qr3'] = x.quantile(0.75)
out['max'] = x.max()
out['count'] = x.count()
return pd.Series(out)
nls97.groupby(['highestdegree'])['weeksworked17'].\
apply(gettots).unstack()
Output:
min qr1 med qr3 max count
highestdegree
0. None 0 0 40 52 52 510
1. GED 0 8 47 52 52 848
2. High School 0 31 49 52 52 2,665
3. Associates 0 42 49 52 52 593
4. Bachelors 0 45 50 52 52 1,342
5. Masters 0 46 50 52 52 538
6. PhD 0 46 50 52 52 51
7. Professional 0 47 50 52 52 97
- Do a boxplot of weeks worked by highest degree earned.
Use Seaborn for these boxplots. First, create a subplot and name it myplt.
This makes it easier to access subplot attributes later. Use the order
parameter of boxplot to order by highest degree earned. We notice that there
are no outliers or whiskers at the lower end for individuals with no degree
ever received. This is because the IQR for those individuals covers the whole
range of values; that is, the value at the 25th percentile is 0 and the value at
the 75th percentile is 52.
myplt = sns.boxplot('highestdegree','weeksworked17', data=nls97,
order=sorted(nls97.highestdegree.dropna().unique()))
myplt.set_title("Boxplots of Weeks Worked by Highest Degree")
myplt.set_xlabel('Highest Degree Attained')
myplt.set_ylabel('Weeks Worked 2017')
myplt.set_xticklabels(myplt.get_xticklabels(), rotation=60, hori
zontalalignment='right')
plt.tight_layout()
plt.show()
Output:
degree
- View the minimum, maximum, median and first and third quartile
values for total cases per million by region. Use the gettots function defined in Step 2.
covidtotals.groupby(['region'])['total_cases_pm'].\
apply(gettots).unstack()
Output:
min qr1 med qr3 max count
region
Caribbean 95 252 339 1,726 4,435 22
Central Africa 15 71 368 1,538 3,317 11
Central America 93 925 1,448 2,191 10,274 7
Central Asia 374 919 1,974 2,907 10,594 6
East Africa 9 65 190 269 5,015 13
East Asia 3 16 65 269 7,826 16
Eastern Europe 347 883 1,190 2,317 6,854 22
North Africa 105 202 421 427 793 5
North America 2,290 2,567 2,844 6,328 9,812 3
Oceania / Aus 1 61 234 424 1,849 8
South America 284 395 2,857 4,044 16,323 13
South Asia 106 574 885 1,127 19,082 9
Southern Africa 36 86 118 263 4,454 9
West Africa 26 114 203 780 2,862 17
West Asia 23 273 2,191 5,777 35,795 16
Western Europe 200 2,193 3,769 5,357 21,038 32
- Do boxplots of cases per million by region.
Flip the axes since there are a large number of regions. Also, do a swarm plot
to give some sense of the number of countries by region. The swarm plot
displays a dot for each country in each region. Some of the IQRs are hard to
see because of the extreme values.
sns.boxplot('total_cases_pm', 'region', data=covidtotals)
sns.swarmplot(y="region", x="total_cases_pm", data=covidtotals, size=2, color=".3", linewidth=0)
plt.title("Boxplots of Total Cases Per Million by Region")
plt.xlabel("Cases Per Million")
plt.ylabel("Region")
plt.tight_layout()
plt.show()
Output:
IQR and outliers
- Show the most extreme values for cases per million.
covidtotals.loc[covidtotals.total_cases_pm>=14000,\
['location','total_cases_pm']]
Output:
location total_cases_pm
iso_code
BHR Bahrain 19,082
CHL Chile 16,323
QAT Qatar 35,795
SMR San Marino 21,038
VAT Vatican 14,833
- Redo the boxplots without the extreme values.
sns.boxplot('total_cases_pm', 'region', data=covidtotals.loc[covidtotals.total_cases_pm<14000])
sns.swarmplot(y="region", x="total_cases_pm", data=covidtotals.l
oc[covidtotals.total_cases_pm<14000], size=3, color=".3", linewidth=0)
plt.title("Total Cases Without Extreme Values")
plt.xlabel("Cases Per Million")
plt.ylabel("Region")
plt.tight_layout()
plt.show()
Output:
values
These grouped boxplots reveal how much the distribution of cases, adjusted
by population, varies by region.
How It Works
We use seaborn for the figures we create in this recipe. We could have also
used matplotlib. Seaborn is actually built on top of matplotlib, extending it in
some areas and making some things easier. It sometimes produces more
aesthetically pleasing figures with the default settings than matplotlib does. It
is a good idea to have some descriptives in front of us before creating figures
with multiple boxplots. In Step 2, we get the first and third quartile values,
and the median, for each degree attainment level. We do this by first creating
a function called gettots, which returns a series with those values. We
apply gettots to each group in the data frame with the following statement.
nls97.groupby(['highestdegree'])['weeksworked17'].apply(gettots).unstack()
The groupby method creates a data frame with grouping information, which
is passed to the apply function. gettots then calculates summary values
for each group. unstack reshapes the returned rows, from multiple rows per
group one for each summary statistic to one row per group, with columns
for each summary statistic. In Step 3, we generate a boxplot for each degree
attainment level. We do not normally need to name the subplot object we
create when we use seaborn’s boxplot method. We do so in this step,
naming it myplt , so that we can easily change attributes, such as tick labels
later. We rotate the labels on the x axis using set_xticklabels so that the
labels do not run into each other. We flip the axes for the boxplots in Step 5
since there are more group levels (regions) than there are ticks for the
continuous variable, cases per million. We do that by making
total_cases_pm the value for the first argument, rather than the second. We
also do a swarm plot to give some sense of the number of observations
(countries) in eachregion. Extreme values can sometimes make it difficult to
view a boxplot. Boxplots show both the outliers and the IQR, but the IQR
rectangle will be so small that it is not viewable when outliers are several
times the third or first quartile value. In Step 5, we remove all values of
total_cases_pm greater than or equal to 14000. This improves the
presentation of each IQR.
There’s More
The boxplots of weeks worked by educational attainment in Step 3 reveal
high variation in weeks worked, something that is not obvious in univariate
analysis. The lower the educational attainment level, the greater the spread in
weeks worked. There is substantial variability in weeks worked in 2017 for
individuals with less than a high school degree and very little variability for
individuals with college degrees. This is quite relevant, of course to our
understanding of what is an outlier in terms of weeks worked. For example,
someone with a college degree who worked 20 weeks is an outlier, but they
would not be an outlier if they had less than a high school diploma. The
Cases Per Million boxplots also invite us to think more flexibly about
what an outlier is. For example, none of the outliers for cases per million in
East Africa would have been identified as an outlier in the dataset as a whole.
In addition, those values are all lower than the third quartile value for North
America. But they definitely are outliers for East Africa. One of the first
things I notice when looking at a boxplot is where the median is in the IQR.
When the median is not at all close to the center, I know I am not dealing
with a normally distributed variable. It also gives me a good sense of the
direction of the skew. If it is near the bottom of the IQR, meaning that the
median is much closer to the first quartile than the third, then there is positive
skew. Compare the boxplot for the Caribbean to that of Western Europe. A
large number of low values and a few high values bring the median close to
the first quartile value for the Caribbean.
See Also
We work much more with groupby in section 7, Fixing Messy Data When
Aggregating. We work more with stack and unstack in section 9, Tidying
and Reshaping Data.
Conclusion
In conclusion, using grouped boxplots offers a powerful way to uncover hidden insights and unexpected values within specific groups of data, making it an invaluable tool for any data driven analysis. Whether you are just getting started or have experience with data visualization, the step-by-step process makes it easy to create and interpret boxplots, shedding light on outliers, distributions and trends that may otherwise go unnoticed. Once you understand how it works, you can refine your analyses, optimize your decision making and derive more precise conclusions. And remember, this is just the beginning, there are more advanced techniques and tools that can further enhance your data exploration. Be sure to explore additional resources and methods to maximize the impact of your analysis and continue unlocking the potential of your data.