25 — Pandas Data Cleaning: Using Histograms To Examine The Distribution Of Continous Variables
When diving into the world of data analysis, one of the most powerful and visually intuitive tools at your disposal is the histogram. By leveraging histograms, you can quickly and effectively examine the distribution of continuous variables, revealing crucial patterns and insights hidden within your data. Whether you’re just getting started or already familiar with analytical techniques, mastering how to interpret these visual representations is essential. But how do you go about it? With the right approach, it becomes an effortless process that not only clarifies how your data behaves but also unlocks deeper understanding of variability and trends. As you explore the intricacies of how histograms work, you’ll find that they do more than simply organize information—they highlight nuances that can drive impactful decisions. And there’s more! Beyond the basics, you’ll discover additional techniques and tools that elevate your data analysis skills to the next level. For those eager to expand their knowledge, see also our recommended readings and resources that will ensure you’re fully equipped to harness the power of histograms and other data visualization methods.
Table of Content
- Using histograms to examine the distribution of continous variables
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Using Histograms To Examine The Distribution Of Continous Variables
The go-to visualization tool for statisticians trying to understand how single
variables are distributed is the histogram. Histograms plot a continuous
variable on the x axis, in bins determined by the researcher and the frequency
of occurrence on the y axis.Histograms provide a clear and meaningful
illustration of the shape of a distribution, including central tendency,
skewness (symmetry), excess kurtosis (relatively fat tails) and spread. This
matters for statistical testing, as many tests make assumptions about a
variable’s distribution. Moreover, our expectation of what data values to
expect should be guided by our understanding of the distribution’s shape. For
example, a value at the 90th percentile has very different implications when it
comes from a normal distribution rather than from a uniform distribution.One
of the first tasks I ask introductory statistics students to do is construct a
histogram manually from a small sample. We do boxplots in the following
class. Together, histograms and boxplots provide a solid foundation for
subsequent analysis. In my data science work, I try to remember to construct
histograms and boxplots on all continuous variables of interest shortly after
the initial importing and cleaning of data. We create histograms in this recipe and boxplots in the following two recipes.
Getting Ready
We will use the matplotlib library to generate histograms. Some tasks can be
done quickly and straightforwardly in matplotlib. Histograms are one of those
tasks. We will switch between matplotlib and seaborn which is built on
matplotlib in this section, based on which tool gets us to the required
graphic more easily. We will also use the statsmodels library. You can install
matplotlib and statsmodels with pip python package manager. We will work with data on land temperature and on coronavirus cases in this recipe. The land temperature DataFrame has one
row per weather station. The coronavirus DataFrame has one row per country
and reflects totals as of July 18, 2020.
pip install matplotlib
pip install statsmodels
The land temperature DataFrame has the average temperature
reading (in °C) in 2019 from over 12,000 stations across the world,
though a majority of the stations are in the United States. The raw
data was retrieved from the Global Historical Climatology Network
integrated database. It is made available for public use by the
United States National Oceanic and Atmospheric Administration at:https://www.ncdc.noaa.gov/data-access/land-based-station-
data/land-based-datasets/global-historical-climatology-network-monthly-version-4.
Our World in Data provides Covid-19 public use data at:https://ourworldindata.org/coronavirus-source-data
. The data used
in this recipe was downloaded on June 1, 2020. Some of the data
was missing for Hong Kong as of this date, but this problem was
fixed in files after that.
How To Do It
We take a close look at the distribution of land temperatures by weather
station in 2019 and total coronavirus cases per million in population for each
country. We start with a few descriptive statistics before doing a QQ plot,
histograms and stacked histograms.
- Import the pandas, matplotlib and statsmodels libraries.
Also, load data on land temperatures and Covid cases.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
landtemps = pd.read_csv("data/landtemps2019avgs.csv")
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
covidtotals.set_index("iso_code", inplace=True)
- Show some of the station temperature rows.
The latabs column is the value of latitude without the north or south
indicators; so, Cairo, Egypt at approximately 30 degrees north and Porto
Alegre, Brazil at about 30 degrees south have the same value.
Alegre, Brazil at about 30 degrees south have the same value:
landtemps[['station','country','latabs',
'elevation','avgtemp']].\
sample(10, random_state=1)
Output:
station country latabs
10526 NEW_FORK_LAKE United States 43
1416 NEIR_AGDM Canada 51
2230 CURICO Chile 35
6002 LIFTON_PUM... United States 42
2106 HUAILAI China 40
2090 MUDANJIANG China 45
7781 CHEYENNE_6S... United States 36
10502 SHARKSTOOTH United States 38
11049 CHALLIS_AP United States 45
2820 METHONI Greece 37
elevation avgtemp
10526 2,542 2
1416 1,145 2
2230 225 16
6002 1,809 4
2106 538 11
2090 242 6
7781 694 15
10502 3,268 4
11049 1,534 7
2820 52 18
- Show some descriptive statistics.
Also, look at the skew and the kurtosis.
landtemps.describe()
Output:
latabs elevation avgtemp
count 12,095 12,095 12,095
mean 40 589 11
std 13 762 9
min 0 -350 -61
25% 35 78 5
50% 41 271 10
75% 47 818 17
max 90 9,999 34
landtemps.avgtemp.skew()
Output:
-0.2678382583481769
landtemps.avgtemp.kurtosis()
Output:
2.1698313707061074
- Do a histogram of average temperatures.
Also, draw a line at the overall mean.
plt.hist(landtemps.avgtemp)
plt.axvline(landtemps.avgtemp.mean(), color='red', linestyle='da
shed', linewidth=1)
plt.title("Histogram of Average Temperatures (Celsius)")
plt.xlabel("Average Temperature")
plt.ylabel("Frequency")
plt.show()
Output:
2019
- Run a QQ plot to examine where the distribution deviates from a normal
distribution.
We Notice that much of the distribution of temperatures falls along the red line, all dots would fall on the red line if the distribution were perfectly normal but the tails fall off dramatically from normal.
sm.qqplot(landtemps[['avgtemp']].sort_values(['avgtemp']), line='s')
plt.title("QQ Plot of Average Temperatures")
plt.show()
Output:
distribution
- Show skewness and kurtosis for total Covid cases per million.
This is from the coronavirus DataFrame, which has one row for each country.
covidtotals.total_cases_pm.skew()
Output:
26.137524276840452
- Do a stacked histogram of the Covid case data.
Select data from four of the regions. Stacked histograms can get messy with any more categories than that. Define a getcases function that returns a
series for total_cases_pm for the countries of a region. Pass those series to
the hist method ([getcases(k) for k in showregions]) to create the
stacked histogram. Notice that much of the distribution, almost 40 countries
out of the 65 countries in these regions has cases per million below 2,000.
showregions = ['Oceania / Aus','East Asia','Southern Africa',
'Western Europe']
def getcases(regiondesc):
return covidtotals.loc[covidtotals.\
region==regiondesc,
'total_cases_pm']
plt.hist([getcases(k) for k in showregions],\
color=['blue','mediumslateblue','plum','mediumvioletred'],
\
label=showregions,\
stacked=True)
plt.title("Stacked Histogram of Cases Per Million for Selected R
egions")
plt.xlabel("Cases Per Million")
plt.ylabel("Frequency")
plt.xticks(np.arange(0, 22500, step=2500))
plt.legend()
plt.show()
Output:
cases per million levels
- Show multiple histograms on one figure.
This allows different x and y axis values. We need to loop through each axis
and select a different region from show regions for each subplot.
fig, axes = plt.subplots(2, 2)
fig.subtitle("Histograms of Covid Cases Per Million by Selected Regions")
axes = axes.ravel()
for j, ax in enumerate(axes):
ax.hist(covidtotals.loc[covidtotals.region==showregions[j]
].\
total_cases_pm, bins=5)
ax.set_title(showregions[j], fontsize=10)
for tick in ax.get_xticklabels()
plt.tight_layout()
fig.subplots_adjust(top=0.88)
plt.show()
Output:
per million levels
The preceding steps demonstrated how to visualize the distribution of a
continuous variable using histograms and QQ plots.
How It Works
Step 4 shows how easy it is to display a histogram. This can be done by
passing a series to the hist method of Matplotlib’s pyplot module. We
use an alias of plt for matplotlib. We could have also passed any ndarray or even a list of data series.We also get great access to the attributes of the
figure and its axes. We can set the labels for each axis, as well as the tick
marks and tick labels. We can also specify the content and look and feel of
the legend. We will be taking advantage of this functionality often in this
section. We pass multiple series to the hist method in Step 7 to produce the
stacked histogram. Each series is the total_cases_pm cases per million of
population value for the countries in a region. To get the series for each
region, we call the getcases function for each item in showregions . We
choose colors for each series rather than allowing that to happen
automatically. We also use the showregions list to select labels for the
legend. In Step 8, we start by indicating that we want four subplots, in two
rows and two columns. That is what we get with plt.subplots(2, 2),
which returns both a figure and the four axes. We loop through the axes with
for j, ax in enumerate(axes). Within each loop, we select a different
region for the histogram from showregions . Within each axis, we loop
through the tick labels and change the rotation. We also adjust the start of the
subplots to make enough room for the figure title. Note that we need to use
suptitle to add a title in this case. Using title would add the title to a
subplot.
There’s More
The land temperature data is not quite normally distributed, as the histograms
and the skew and kurtosis measures show. It is skewed to the left skew of
-0.26 and actually has somewhat skinnier tails than normal kurtosis of
2.1, compared with 3. Although there are some extreme values, there are
not that many of them relative to the overall size of the dataset. While it is not
perfectly bell-shaped, the land temperature DataFrame is a fair bit easier to
deal with than the Covid case data.The skew and kurtosis of the Covid
cases per million variable show that it is some distance from normal. The
skew of 4 and kurtosis of 26 indicates a high positive skew and much fatter
tails than with a normal distribution. This is also reflected in the histograms,
even when we look at the numbers by region. There are a number of
countries at very low levels of cases per million in most regions, and just a few countries with high levels of cases. The Using grouped boxplots to
uncover unexpected values in a particular group recipe in this chapter shows
that there are outliers in almost every region.If you work through all of the
recipes in this section, and you are relatively new to matplotlib and seaborn,
you will find those libraries either usefully flexible or confusingly flexible. It
is difficult to even pick one strategy and stick with it because you might need
to set up your figure and axes in a particular way to get the visualization you
want. It is helpful to keep two things in mind when working through these
recipes, first, you will generally need to create a figure and one or more
subplots and second, the main plotting functions work similarly regardless,
so plt.hist and ax.hist will both often work.
Conclusion
In conclusion, using histograms to examine the distribution of continuous variables is an essential step in understanding data patterns and making informed decisions. As we have explored, getting ready involves ensuring that your dataset is properly prepared and your variables are correctly identified. By knowing how to create and interpret histograms, you gain a powerful tool to visualize data, spot outliers and identify skewness or normality which can significantly impact analysis outcomes. Understanding how it works helps clarify relationships within your data and empowers deeper insights. But the story doesn’t end here, there’s always more to explore. Beyond histograms, a range of other visualization techniques and statistical methods await to provide further clarity and depth to your analysis. To expand your understanding, be sure to check out related concepts such as box plots, density plots and other essential visual tools for continuous variables. By mastering these techniques, you will be better equipped to uncover the hidden insights within your data, leading to smarter, data driven decisions.