15 — Pandas Data Cleaning: Generating Frequencies For Categorical Variables
Unlocking the power of data often begins with understanding its hidden patterns and one of the most critical steps is generating frequencies for categorical variables. Whether you are preparing to tackle complex datasets or just getting started, mastering this skill can transform the way you analyze and interpret information. It’s not just about organizing data, it’s about uncovering trends, making predictions and ultimately driving better decisions. As you get ready to dive into the world of data frequencies, knowing how to do it effectively becomes key. From choosing the right tools to understanding how it works behind the scenes, each step is crucial to ensuring accurate results. But there’s more to it than meets the eye, advanced techniques can take your skills to the next level, opening doors to deeper insights and greater precision. And if you are eager to explore further, there’s a wealth of related topics and resources to guide your journey into the exciting world of data manipulation and analysis.
Table of Content
- Generating frequencies for categorical variables
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Generating Frequencies For Categorical Variables
Many years ago, a very seasoned researcher said to me, 90% of what we are
going to find, we will see in the frequency distributions. That message has
stayed with me. The more one way and two way frequency distributions
(crosstabs) I do on a DataFrame, the better I understand it. We will do one way distributions in this recipe and crosstabs in subsequent recipes.
Getting Ready
We continue our work with the NLS. We will also be doing a fair bit of
column selection using filter methods. It is not necessary to review the recipe
in this section on column selection, but it might be helpful.
How To Do It
We use pandas tools to generate frequencies, particularly the very handy
value_counts.
- Load the pandas library and the nls97 file. Also convert the columns with object data type to category data type.
import pandas as pd
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97.loc[:, nls97.dtypes == 'object'] = \
.nls97.select_dtypes(['object']). \
apply(lambda x: x.astype('category'))
- Show the names for columns with the category data type and check for
the number of missing values.
Notice that there are no missing values for gender and few for
highestdegree, but many for maritalstatus and other columns.
catcols = nls97.select_dtypes(include=["category"]).columns
nls97[catcols].isnull().sum()
Output:
gender 0
maritalstatus 2312
weeklyhrscomputer 2274
weeklyhrstv 2273
highestdegree 31
...
colenroct15 1515
colenrfeb16 1948
colenroct16 2251
colenrfeb17 2251
colenroct17 2250
Length: 57, dtype: int64
- Show the frequencies for marital status.
nls97.maritalstatus.value_counts()
Output:
Married 3066
Never-married 2766
Divorced 663
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
- Turn off sorting by frequency.
nls97.maritalstatus.value_counts(sort=False)
Output:
Divorced 663
Married 3066
Never-married 2766
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
- Turn off sorting by frequency.
nls97.maritalstatus.value_counts(sort=False)
Output:
Divorced 663
Married 3066
Never-married 2766
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
- Show percentages instead of counts.
nls97.maritalstatus.value_counts(sort=False, normalize=True)
Output:
Divorced 0.10
Married 0.46
Never-married 0.41
Separated 0.02
Widowed 0.00
Name: maritalstatus, dtype: float64
- Show the percentages for all government responsibility columns.
Filter the DataFrame for just the government responsibility columns, then use
apply to run value_counts on all columns in that DataFrame.
nls97.filter(like="gov").apply(pd.value_counts, normalize=True)
Output:
govprovidejobs govpricecontrols ... \
1. Definitely 0.25 0.54 ...
2. Probably 0.34 0.33 ...
3. Probably not 0.25 0.09 ...
4. Definitely not 0.16 0.04 ...
govdecenthousing govprotectenvironment
1. Definitely 0.44 0.67
2. Probably 0.43 0.29
3. Probably not 0.10 0.03
4. Definitely not 0.02 0.02
- Find the percentages for all government responsibility columns of
people who are married.
Do what we did in step 6, but first select only rows with marital status equal
to Married.
nls97[nls97.maritalstatus=="Married"].\
filter(like="gov").\ apply(pd.value_counts, normalize=True)
Output:
govprovidejobs govpricecontrols ... \
1. Definitely 0.17 0.46 ...
2. Probably 0.33 0.38 ...
3. Probably not 0.31 0.11 ...
4. Definitely not 0.18 0.05 ...
govdecenthousing govprotectenvironment
1. Definitely 0.36 0.64
2. Probably 0.49 0.31
3. Probably not 0.12 0.03
4. Definitely not 0.03 0.01
- Find the frequencies and percentages for all category columns in the
DataFrame.
First, open a file to write out the frequencies.
freqout = open('views/frequencies.txt', 'w')
for col in nls97.select_dtypes(include=["category"]):
.print(col, "----------------------", "frequencies",
nls97[col].value_counts(sort=False),"percentages",
nls97[col].value_counts(normalize=True, sort=False),
sep="\n\n", end="\n\n\n", file=freqout)
freqout.close()
This generates a file, the beginning of which looks like this.
Output:
gender
----------------------
frequencies
Female 4385
Male 4599
Name: gender, dtype: int64
percentages
Female 0.49
Male 0.51
Name: gender, dtype: float64
As these steps demonstrate, value_counts is quite useful when we need to
generate frequencies for one or more columns of a DataFrame.
How It Works
Most of the columns in the nls97 DataFrame (57 out of 88) have the object
data type. If we are working with data that is logically categorical, but does
not have a category data type in pandas, there are good reasons to convert it
to the category type. Not only does this save memory, it also makes data cleaning a little easier, as we saw in this recipe. The 0 star of the show for this
recipe is the value_counts method. It can generate frequencies for a series,
as we do with nls97.maritalstatus.value_counts. It can also be run on a
whole DataFrame as we0 do with
nls97.filter(like="gov").apply(pd.value_counts, normalize=True).
We first create a DataFrame with just the government responsibility columns
and then pass the resulting DataFrame to value_counts with apply .You
probably noticed that in step 7, I split the chaining over several lines to make
it is easier to read. There is no rule about when it makes sense to do that. I
generally try to do that whenever the chaining involves three or more
operations.In step 8, we iterate over all of the columns with the category data
type: for col in nls97.select_dtypes(include=["category"]). For
each of those columns, we run value_counts to get frequencies and
value_counts again to get percentages. We use a print function so that we
can generate the carriage returns necessary to make the output readable. All
of this is saved to the frequencies.txt file in the views subfolder. I find it
handy to have a bunch of one-way frequencies around just to check before
doing any work with categorical variables. Step 8 accomplishes that.
There’s More
Frequency distributions may be the most important statistical tool for
discovering potential data issues with categorical data. The one way
frequencies we generate in this recipe are a good foundation for further
insights. However, we often only detect problems once we examine the
relationships between categorical variables and other variables, categorical or
continuous. Although we stop short of doing two-way frequencies in this
recipe, we do start the process of splitting up the data for investigation in step
7. In that step, we look at government responsibility responses for married
individuals and see that those responses differ from those for the sample
overall. This raises several questions about our data that we need to explore.
Are there important differences in response rates by marital status and might
this matter for the distribution of the government responsibility variables?
We also want to be careful about drawing conclusions before considering
potential confounding variables. Are married respondents likely to be older or
to have more children, and are those more important factors in their
government responsibility answers? I am using the marital status variable as an example of the kind of queries that produce one way frequencies, like
the ones in this recipe are likely to generate. It is always good to have some
bivariate analyses a correlation matrix, some crosstabs or a few scatter
plots at the ready should questions like these come up. We will generate
those in the next two sections.
Conclusion
In conclusion, generating frequencies for categorical variables is a fundamental technique that empowers you to gain deeper insights into your data, helping to make more informed decisions. Getting ready for this task involves understanding the structure of your dataset, preparing the categories and determining the specific goals you want to achieve. Once you know how to do it, leveraging built-in tools and functions simplifies the process, ensuring that you can quickly and accurately calculate the frequencies of categorical variables. As you see how it works in practice, the visualization and analysis of data become seamless, allowing you to easily spot patterns, anomalies and trends that drive actionable business intelligence. But there’s more beyond the basics, you can explore advanced methodologies and integrate these practices into broader data pipelines. See also related techniques and deeper statistical analyses that complement categorical frequency generation, enhancing your overall data analytics toolkit for more robust, reliable and impactful outcomes.