16 — Pandas Data Cleaning: Generating Summary Statistics For Continuous Variables
When diving into data analysis, generating summary statistics for continuous variables is a crucial first step, providing a powerful snapshot of the underlying trends and distributions. As you get ready to transform raw data into actionable insights, understanding how to approach and interpret these statistics can be a game changer. Whether you are dealing with mean, median, variance or standard deviation, knowing how to do it effectively will help you unveil patterns and anomalies hidden within the data. As we explore how it works, you will see the magic unfold in numbers, showing how each statistic plays a role in revealing the bigger picture. But that’s not all, there’s more to uncover when you dig deeper, optimizing your analysis further. To enhance your journey, make sure to explore the additional resources and tools that will take your understanding of data to the next level. See also related topics to broaden your expertise and keep pushing the boundaries of what’s possible in data analysis.
Table of Content
- Generating summary statistics for continuous variables
- Getting ready
- How to do it
- How it works
- See also
Generating Summary Statistics For Continuous Variables
Pandas has a good number of tools we can use to get a sense of the
distribution of continuous variables. We will focus on the splendid
functionality described in this recipe demonstrates the usefulness of
histograms for visualizing variable distributions. Before doing any analysis
with a continuous variable it is important to have a good understanding of
how it is distributed, its central tendency, its spread and its skewness. This
understanding greatly informs our efforts to identify outliers and unexpected
values. But it is also crucial information in and of itself. I do not think it
overstates the case to say that we understand a particular variable well if we
have a good understanding of how it is distributed and any interpretation
without that understanding will be incomplete or flawed in some way.
Getting Ready
We will work with the COVID totals data in this recipe. You will need
Matplotlib to run this. If it is not installed on your machine already, you can
install it at the terminal by entering pip install matplotlib.
How To Do It
We take a look at the distribution of a few key continuous variables.
- Import pandas, numpy and matplotlib and load the COVID case
total data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
covidtotals = pd.read_csv("data/covidtotals720.csv", parse_dates=['lastdate'])
covidtotals.set_index("iso_code", inplace=True)
- Let’s remind ourselves of the structure of the data.
covidtotals.shape
Output:
(209, 12)
covidtotals.sample(1, random_state=1).T
Output:
iso_code COG
lastdate 2020-07-12 00:00:00
location Congo
total_cases 2,028.00
total_deaths 47.00
total_cases_pm 367.52
total_deaths_pm 8.52
population 5,518,092.00
pop_density 15.40
median_age 19.00
gdp_per_capita 4,881.41
hosp_beds NaN
region Central Africa
covidtotals.dtypes
Output:
lastdate datetime64[ns]
location object
total_cases float64
total_deaths float64
total_cases_pm float64
total_deaths_pm float64
population float64
pop_density float64
median_age float64
gdp_per_capita float64
hosp_beds float64
region object
dtype: object
- Get the descriptive statistics on the COVID totals and demographic
columns.
covidtotals.describe()
Output:
total_cases total_deaths total_cases_pm ... \
count 209.0 209.0 209.0 ...
mean 60,757.4 2,703.0 2,297.0 ...
std 272,440.1 11,895.0 4,039.8 ...
min 3.0 0.0 1.2 ...
25% 342.0 9.0 202.8 ...
50% 2,820.0 53.0 868.9 ...
75% 25,611.0 386.0 2,784.9 ...
max 3,247,684.0 134,814.0 35,795.2 ...
median_age gdp_per_capita hosp_beds
count 185.0 182.0 164.0
mean 30.6 19,285.0 3.0
std 9.1 19,687.7 2.5
min 15.1 661.2 0.1
25% 22.2 4,485.3 1.3
50% 29.9 13,031.5 2.4
75% 38.7 27,882.1 3.9
max 48.2 116,935.6 13.8
[8 rows x 9 columns]
- Take a closer look at the distribution of values for the cases and deaths
columns.
Use NumPy’s arange method to pass a list of floats from 0 to 1.0 to the
quantile method of the DataFrame.
totvars = ['location','total_cases','total_deaths',
'total_cases_pm','total_deaths_pm']
covidtotals[totvars].quantile(np.arange(0.0, 1.1, 0.1))
Output:
total_cases total_deaths total_cases_pm total_deaths_pm
0.0 3.0 0.0 1.2 0.0
0.1 63.6 0.0 63.3 0.0
0.2 231.2 3.6 144.8 1.2
0.3 721.6 14.4 261.5 3.8
0.4 1,324.4 28.4 378.8 7.0
0.5 2,820.0 53.0 868.9 15.2
0.6 6,695.6 116.6 1,398.3 29.4
0.7 14,316.4 279.0 2,307.9 47.7
0.8 40,245.4 885.2 3,492.3 76.3
0.9 98,632.8 4,719.0 5,407.7 201.4
1.0 3,247,684.0 134,814.0 35,795.2 1,237.6
- View the distribution of total cases.
plt.hist(covidtotals['total_cases']/1000, bins=12)
plt.title("Total Covid Cases")
plt.xlabel('Cases')
plt.ylabel("Number of Countries")
plt.show()
Output:
The preceding steps demonstrated the use of describe and Matplotlib’s
hist method, which are essential tools when working with continuous
variables.
How It Works
We use the described method in step 3 to examine some summary statistics
and the distribution of the key variables. It is often a red flag when the mean
and median (50%) have dramatically different values. Cases and deaths are heavily skewed to the right reflected in the mean being much higher than the
median. This alerts us to the presence of outliers at the upper end. This is
true even with the adjustment for population size, as both total_cases_pm
and total_deaths_pm show this same skew. We do more analysis of outliers
in the next section. The more detailed percentile data in step 4 further
supports this sense of skewness. For instance, the gap between the 90th
percentile and 100th
percentile values for cases and deaths are substantial.
These are good first indicators that we are not dealing with normally
distributed data here. Even if this is not due to errors, this matters for the
statistical testing we will do down the road. On the list of things we want to
note when asked, “How does the data look?”, this is one of the first things
we want to say.We should also note the large number of zero values for total
deaths, over 10%. This will also matter for statistical testing when we get to
that point. The histogram of total cases confirms that much of the distribution
is between 0 and 150,000 with a few outliers and 1 extreme outlier. Visually,
the distribution looks much more log normal than normal. Log normal
distributions have fatter tails and do not have negative values.
See Also
We take a closer look at outliers and unexpected values in the next section.
We do much more with visualizations in section 5, Using Visualizations for
Exploratory Analysis.
Conclusion
In conclusion, generating summary statistics for continuous variables is an essential skill that provides crucial insights into your data. By getting ready with the appropriate tools and understanding your dataset, you set the foundation for effective analysis. The "how to do it" section gives you a clear path, while "how it works" demystifies the technical aspects behind these calculations. But there’s more to explore beyond just the basics, advanced techniques and further exploration will deepen your understanding of statistical methods. For those eager to expand their knowledge, related resources and recommendations can help you build on what you have learned ensuring you continue progressing on your data analysis journey.