28 — Pandas Data Cleaning: Examine Both Distribution Shape and Outliers With Violin Plots

7 min readSep 15, 2024

When diving into data analysis, understanding the distribution and spotting outliers are critical steps that can make or break your insights. One of the most powerful yet underutilized tools for this is the violin plot. It not only allows you to visualize the shape of your data distribution but also highlights those pesky outliers that can distort your conclusions. Whether you’re preparing to analyze complex datasets or fine-tuning your model, mastering the art of visual interpretation with violin plots can significantly enhance your analytical accuracy. Let’s delve into how to leverage this tool effectively and uncover the hidden patterns in your data.

Table of Content

Examine both distribution shape and outliers with violin plots
Getting ready
How to do it
How it works
There’s more
See also

Examine Both Distribution Shape and Outliers With Violin Plots

Violin plots combine histograms and boxplots in one plot. They show the
IQR, median and whiskers, as well as the frequency of observations at all
ranges of values. It is hard to visualize how that is possible without seeing an

actual violin plot. We generate a few violin plots on the same data we used for box plots in the previous recipe, to make it easier to grasp how they work.

Getting Ready

We will work with the NLS and the Covid case data. You need matplotlib
and seaborn installed on your computer to run the code in this recipe.

How To Do It

We do violin plots to view both the spread and shape of the distribution on
the same graphic. We then do violin plots by groups.

Load pandas, matplotlib, and seaborn, and the Covid case and NLS data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)

Do a violin plot of the SAT verbal score.

sns.violinplot(nls97.satverbal, color="wheat", orient="v")
plt.title("Violin Plot of SAT Verbal Score")
plt.ylabel("SAT Verbal")
plt.text(0.08, 780, "outlier threshold", horizontalalignment='center', size='x-small')
plt.text(0.065, nls97.satverbal.quantile(0.75), "3rd quartile", horizontalalignment='center', size='x-small')
plt.text(0.05, nls97.satverbal.median(), "Median", horizontalalignment='center', size='x-small')
plt.text(0.065, nls97.satverbal.quantile(0.25), "1st quartile", horizontalalignment='center', size='x-small')
plt.text(0.08, 210, "outlier threshold", horizontalalignment='center', size='x-small')
plt.text(-0.4, 500, "frequency", horizontalalignment='center', size='x-small')
plt.show()

Output:

Violin plot of SAT verbal score with labels for IQR and outlier

threshold

Get some descriptives for weeks worked.

nls97.loc[:, ['weeksworked16','weeksworked17']].describe()

Output:

weeksworked16 weeksworked17
count 7,068 6,670
mean 39 39
std 21 19
min 0 0
25% 23 37
50% 53 49
75% 53 52
max 53 52

Show weeks worked for 2016 and 2017.

Use a more object-oriented approach to make it easier to access some axes
attributes. We notice that the weeksworked distributions are bimodal, with

bulges near the top and the bottom of the distribution. Also, note the very
different IQR for 2016 and 2017.

myplt = sns.violinplot(data=nls97.loc[:, ['weeksworked16','weeksworked17']])
myplt.set_title("Violin Plots of Weeks Worked")
myplt.set_xticklabels(["Weeks Worked 2016","Weeks Worked 2017"])
plt.show()

Output:

Violin plots showing spread and shape of distribution for two

variables side by side

Do a violin plot of wage income by gender and marital status.

First, create a collapsed marital status column. Specify gender for the x axis,
salary for the y axis, and a new collapsed marital status column for hue. The
hue parameter is used for grouping, which will be added to any grouping
already used for the x axis. We also indicate scale="count" to generate
violin plots are sized according to the number of observations in each category.

nls97["maritalstatuscollapsed"] = nls97.maritalstatus.\ replace(['Married','Never-married','Divorced','Separated','Widowed'],\ ['Married','Never Married','Not Married','Not Married','Not Married'])
sns.violinplot(nls97.gender, nls97.wageincome, hue=nls97.marital
statuscollapsed, scale="count")
plt.title("Violin Plots of Wage Income by Gender and Marital Status")
plt.xlabel('Gender')
plt.ylabel('Wage Income 2017')
plt.legend(title="", loc="upper center", framealpha=0, fontsize=8)
plt.tight_layout()
plt.show()

Output:

Violin plots showing spread and shape of distribution by two

different groups

Do violin plots of weeks worked by the highest degree attained.

myplt = sns.violinplot('highestdegree','weeksworked17', data=nls97, rotation=40)
myplt.set_xticklabels(myplt.get_xticklabels(), rotation=60, horizontalalignment='right')
myplt.set_title("Violin Plots of Weeks Worked by Highest Degree")
myplt.set_xlabel('Highest Degree Attained')
myplt.set_ylabel('Weeks Worked 2017')
plt.tight_layout()
plt.show()

Output:

Violin plots showing spread and shape of distribution by group

These steps show just how much violin plots can tell us about how

continuous variables in our DataFrame are distributed and how that might
vary by group.

How It Works

Similar to boxplots, violin plots show the median first and third quartiles and the whiskers. They also show the relative frequency of variable values.
When the violin plot is displayed vertically, the relative frequency is the
width at a given point. The violin plot produced in Step 2 and the associated
annotations, provide a good illustration. We can tell from the violin plot that
the distribution of SAT verbal scores is not dramatically different from

normal other than the extreme values at the lower end. The greatest bulge (greatest width) is at the median, declining fairly symmetrically from there.

The median is relatively equidistant from the first and third quartiles. We can
create a violin plot in seaborn by passing one or more data series to the

violinplot method. We can also pass a whole data frame of one or more
columns. We do that in Step 4 because we want to plot more than one
continuous variable.We sometimes need to experiment with the legend a bit
to get it to be both informative and unobtrusive. In Step 5, we used the
following command to remove the legend title since it is clear from the

values, locate it in the best place on the figure and make the box transparent
(framealpha=0).

plt.legend(title="", loc="upper center", framealpha=0, fontsize=8)

We can pass data series to violinplot in a variety of ways. If you do not
indicate an axis with "x=" or "y=" , or grouping with "hue=", seaborn will
figure that out based on order. For example, in Step 5, we did the following.

sns.violinplot(nls97.gender, nls97.wageincome, hue=nls97.maritalstatuscollapsed, scale="count")

We would have gotthe same results if we had done the following.

sns.violinplot(x=nls97.gender, y=nls97.wageincome, hue=nls97.maritalstatuscollapsed, scale="count")

We could have also done this to obtain the same result.

sns.violinplot(y=nls97.wageincome, x=nls97.gender, hue=nls97.maritalstatuscollapsed, scale="count")

Although I have highlighted this flexibility in this recipe, these techniques for
sending data to matplotlib and seaborn apply to all of the plotting methods

discussed in this section though not all of them have a hue parameter.

There’s More

Once you get the hang of violin plots, you will appreciate the enormous
amount of information they make available on one figure. We get a sense of
the shape of the distribution, its central tendency, and its spread. We can also easily show that information for different subsets of our data.The distribution
of weeks worked in 2016 is different enough from weeks worked in 2017 to
give the careful analyst pause. The IQR is quite different, 30 for 2016 (23 to

53) and 15 for 2017 (37 to 52).An unusual fact about the distribution of
wage income is revealed when examining the violin plots produced in Step 5.
There is a bunching-up of incomes at the top of the distribution for married

males and somewhat for married females. That is quite unusual for a wage
income distribution. As it turns out, it looks like there is a ceiling on wage
income of $235,884. This is something that we definitely want to take into

account in future analyses that include wage income. The income distributions
have a similar shape across gender and marital status, with bulges slightly
below the median and extended positive tails. The IQRs have relatively

similar lengths. However, the distribution for married males is noticeably
higher or to the right, depending on chosen orientation than that for the
other groups.The violin plots of weeks worked by degree attained show very different distributions by group, as we also discovered in the box plots of the
same data in the previous recipe. What is more clear here, though, is the
bimodal nature of the distribution at lower levels of education. There is a
bunching at low levels of weeks worked for individuals without college degrees. Individuals without high school diplomas or a GED a Graduate
Equivalency Diploma were nearly as likely to work 5 or fewer weeks in
2017 as they were to work 50 or more weeks. We used seaborn exclusively to
produce violin plots in this recipe. Violin plots can also be produced with

matplotlib. However, the default graphics in matplotlib for violin plots look
very different from those for seaborn.

Conclusion

In conclusion, exploring the distribution shape and identifying outliers through violin plots provides a powerful, intuitive way to understand data patterns. With proper preparation and by ensuring the right setup, you are equipped to dive deep into the analysis. The process of creating and interpreting violin plots is straightforward once broken down, revealing not only the distribution but also hidden insights about your data. But this is just the beginning, there’s much more to uncover. Beyond violin plots, there are a variety of advanced techniques that can further enhance your data exploration, pushing your insights to new levels. With these tools at your disposal, you are not just visualizing data , you are gaining a comprehensive understanding of it that empowers decision making. Continue exploring these resources to unlock the full potential of your data analysis journey.

28 — Pandas Data Cleaning: Examine Both Distribution Shape and Outliers With Violin Plots

Table of Content

Examine Both Distribution Shape and Outliers With Violin Plots

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

See Also

Conclusion

Written by A.I Hub

No responses yet