28 — Pandas Data Cleaning: Examine Both Distribution Shape and Outliers With Violin Plots
When diving into data analysis, understanding the distribution and spotting outliers are critical steps that can make or break your insights. One of the most powerful yet underutilized tools for this is the violin plot. It not only allows you to visualize the shape of your data distribution but also highlights those pesky outliers that can distort your conclusions. Whether you’re preparing to analyze complex datasets or fine-tuning your model, mastering the art of visual interpretation with violin plots can significantly enhance your analytical accuracy. Let’s delve into how to leverage this tool effectively and uncover the hidden patterns in your data.
Table of Content
- Examine both distribution shape and outliers with violin plots
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Examine Both Distribution Shape and Outliers With Violin Plots
Violin plots combine histograms and boxplots in one plot. They show the
IQR, median and whiskers, as well as the frequency of observations at all
ranges of values. It is hard to visualize how that is possible without seeing an
actual violin plot. We generate a few violin plots on the same data we used for box plots in the previous recipe, to make it easier to grasp how they work.
Getting Ready
We will work with the NLS and the Covid case data. You need matplotlib
and seaborn installed on your computer to run the code in this recipe.
How To Do It
We do violin plots to view both the spread and shape of the distribution on
the same graphic. We then do violin plots by groups.
- Load pandas, matplotlib, and seaborn, and the Covid case and NLS data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)
- Do a violin plot of the SAT verbal score.
sns.violinplot(nls97.satverbal, color="wheat", orient="v")
plt.title("Violin Plot of SAT Verbal Score")
plt.ylabel("SAT Verbal")
plt.text(0.08, 780, "outlier threshold", horizontalalignment='center', size='x-small')
plt.text(0.065, nls97.satverbal.quantile(0.75), "3rd quartile", horizontalalignment='center', size='x-small')
plt.text(0.05, nls97.satverbal.median(), "Median", horizontalalignment='center', size='x-small')
plt.text(0.065, nls97.satverbal.quantile(0.25), "1st quartile", horizontalalignment='center', size='x-small')
plt.text(0.08, 210, "outlier threshold", horizontalalignment='center', size='x-small')
plt.text(-0.4, 500, "frequency", horizontalalignment='center', size='x-small')
plt.show()
Output:
threshold
- Get some descriptives for weeks worked.
nls97.loc[:, ['weeksworked16','weeksworked17']].describe()
Output:
weeksworked16 weeksworked17
count 7,068 6,670
mean 39 39
std 21 19
min 0 0
25% 23 37
50% 53 49
75% 53 52
max 53 52
- Show weeks worked for 2016 and 2017.
Use a more object-oriented approach to make it easier to access some axes
attributes. We notice that the weeksworked distributions are bimodal, with
bulges near the top and the bottom of the distribution. Also, note the very
different IQR for 2016 and 2017.
myplt = sns.violinplot(data=nls97.loc[:, ['weeksworked16','weeksworked17']])
myplt.set_title("Violin Plots of Weeks Worked")
myplt.set_xticklabels(["Weeks Worked 2016","Weeks Worked 2017"])
plt.show()
Output:
variables side by side
- Do a violin plot of wage income by gender and marital status.
First, create a collapsed marital status column. Specify gender for the x axis,
salary for the y axis, and a new collapsed marital status column for hue. The
hue parameter is used for grouping, which will be added to any grouping
already used for the x axis. We also indicate scale="count" to generate
violin plots are sized according to the number of observations in each category.
nls97["maritalstatuscollapsed"] = nls97.maritalstatus.\ replace(['Married','Never-married','Divorced','Separated','Widowed'],\ ['Married','Never Married','Not Married','Not Married','Not Married'])
sns.violinplot(nls97.gender, nls97.wageincome, hue=nls97.marital
statuscollapsed, scale="count")
plt.title("Violin Plots of Wage Income by Gender and Marital Status")
plt.xlabel('Gender')
plt.ylabel('Wage Income 2017')
plt.legend(title="", loc="upper center", framealpha=0, fontsize=8)
plt.tight_layout()
plt.show()
Output:
different groups
- Do violin plots of weeks worked by the highest degree attained.
myplt = sns.violinplot('highestdegree','weeksworked17', data=nls97, rotation=40)
myplt.set_xticklabels(myplt.get_xticklabels(), rotation=60, horizontalalignment='right')
myplt.set_title("Violin Plots of Weeks Worked by Highest Degree")
myplt.set_xlabel('Highest Degree Attained')
myplt.set_ylabel('Weeks Worked 2017')
plt.tight_layout()
plt.show()
Output:
These steps show just how much violin plots can tell us about how
continuous variables in our DataFrame are distributed and how that might
vary by group.
How It Works
Similar to boxplots, violin plots show the median first and third quartiles and the whiskers. They also show the relative frequency of variable values.
When the violin plot is displayed vertically, the relative frequency is the
width at a given point. The violin plot produced in Step 2 and the associated
annotations, provide a good illustration. We can tell from the violin plot that
the distribution of SAT verbal scores is not dramatically different from
normal other than the extreme values at the lower end. The greatest bulge (greatest width) is at the median, declining fairly symmetrically from there.
The median is relatively equidistant from the first and third quartiles. We can
create a violin plot in seaborn by passing one or more data series to the
violinplot method. We can also pass a whole data frame of one or more
columns. We do that in Step 4 because we want to plot more than one
continuous variable.We sometimes need to experiment with the legend a bit
to get it to be both informative and unobtrusive. In Step 5, we used the
following command to remove the legend title since it is clear from the
values, locate it in the best place on the figure and make the box transparent
(framealpha=0).
plt.legend(title="", loc="upper center", framealpha=0, fontsize=8)
We can pass data series to violinplot in a variety of ways. If you do not
indicate an axis with "x=" or "y=" , or grouping with "hue=", seaborn will
figure that out based on order. For example, in Step 5, we did the following.
sns.violinplot(nls97.gender, nls97.wageincome, hue=nls97.maritalstatuscollapsed, scale="count")
We would have gotthe same results if we had done the following.
sns.violinplot(x=nls97.gender, y=nls97.wageincome, hue=nls97.maritalstatuscollapsed, scale="count")
We could have also done this to obtain the same result.
sns.violinplot(y=nls97.wageincome, x=nls97.gender, hue=nls97.maritalstatuscollapsed, scale="count")
Although I have highlighted this flexibility in this recipe, these techniques for
sending data to matplotlib and seaborn apply to all of the plotting methods
discussed in this section though not all of them have a hue parameter.
There’s More
Once you get the hang of violin plots, you will appreciate the enormous
amount of information they make available on one figure. We get a sense of
the shape of the distribution, its central tendency, and its spread. We can also easily show that information for different subsets of our data.The distribution
of weeks worked in 2016 is different enough from weeks worked in 2017 to
give the careful analyst pause. The IQR is quite different, 30 for 2016 (23 to
53) and 15 for 2017 (37 to 52).An unusual fact about the distribution of
wage income is revealed when examining the violin plots produced in Step 5.
There is a bunching-up of incomes at the top of the distribution for married
males and somewhat for married females. That is quite unusual for a wage
income distribution. As it turns out, it looks like there is a ceiling on wage
income of $235,884. This is something that we definitely want to take into
account in future analyses that include wage income. The income distributions
have a similar shape across gender and marital status, with bulges slightly
below the median and extended positive tails. The IQRs have relatively
similar lengths. However, the distribution for married males is noticeably
higher or to the right, depending on chosen orientation than that for the
other groups.The violin plots of weeks worked by degree attained show very different distributions by group, as we also discovered in the box plots of the
same data in the previous recipe. What is more clear here, though, is the
bimodal nature of the distribution at lower levels of education. There is a
bunching at low levels of weeks worked for individuals without college degrees. Individuals without high school diplomas or a GED a Graduate
Equivalency Diploma were nearly as likely to work 5 or fewer weeks in
2017 as they were to work 50 or more weeks. We used seaborn exclusively to
produce violin plots in this recipe. Violin plots can also be produced with
matplotlib. However, the default graphics in matplotlib for violin plots look
very different from those for seaborn.
See Also
It might be helpful to compare the violin plots in this recipe to histograms,
box plots and grouped box plots in the previous recipes in this section.
Conclusion
In conclusion, exploring the distribution shape and identifying outliers through violin plots provides a powerful, intuitive way to understand data patterns. With proper preparation and by ensuring the right setup, you are equipped to dive deep into the analysis. The process of creating and interpreting violin plots is straightforward once broken down, revealing not only the distribution but also hidden insights about your data. But this is just the beginning, there’s much more to uncover. Beyond violin plots, there are a variety of advanced techniques that can further enhance your data exploration, pushing your insights to new levels. With these tools at your disposal, you are not just visualizing data , you are gaining a comprehensive understanding of it that empowers decision making. Continue exploring these resources to unlock the full potential of your data analysis journey.