15 — Pandas Data Cleaning: Generating Frequencies For Categorical Variables

A.I Hub
6 min readSep 11, 2024

--

Image owned by Canva

Unlocking the power of data often begins with understanding its hidden patterns and one of the most critical steps is generating frequencies for categorical variables. Whether you are preparing to tackle complex datasets or just getting started, mastering this skill can transform the way you analyze and interpret information. It’s not just about organizing data, it’s about uncovering trends, making predictions and ultimately driving better decisions. As you get ready to dive into the world of data frequencies, knowing how to do it effectively becomes key. From choosing the right tools to understanding how it works behind the scenes, each step is crucial to ensuring accurate results. But there’s more to it than meets the eye, advanced techniques can take your skills to the next level, opening doors to deeper insights and greater precision. And if you are eager to explore further, there’s a wealth of related topics and resources to guide your journey into the exciting world of data manipulation and analysis.

Table of Content

  • Generating frequencies for categorical variables
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Generating Frequencies For Categorical Variables

Many years ago, a very seasoned researcher said to me, 90% of what we are
going to find, we will see in the frequency distributions. That message has
stayed with me. The more one way and two way frequency distributions

(crosstabs) I do on a DataFrame, the better I understand it. We will do one way distributions in this recipe and crosstabs in subsequent recipes.

Getting Ready

We continue our work with the NLS. We will also be doing a fair bit of
column selection using filter methods. It is not necessary to review the recipe
in this section on column selection, but it might be helpful.

How To Do It

We use pandas tools to generate frequencies, particularly the very handy
value_counts.

  • Load the pandas library and the nls97 file. Also convert the columns with object data type to category data type.
import pandas as pd
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97.loc[:, nls97.dtypes == 'object'] = \
.nls97.select_dtypes(['object']). \
apply(lambda x: x.astype('category'))
  • Show the names for columns with the category data type and check for
    the number of missing values.

Notice that there are no missing values for gender and few for

highestdegree, but many for maritalstatus and other columns.

catcols = nls97.select_dtypes(include=["category"]).columns
nls97[catcols].isnull().sum()

Output:

gender 0
maritalstatus 2312
weeklyhrscomputer 2274
weeklyhrstv 2273
highestdegree 31
...
colenroct15 1515
colenrfeb16 1948
colenroct16 2251
colenrfeb17 2251
colenroct17 2250
Length: 57, dtype: int64
  • Show the frequencies for marital status.
nls97.maritalstatus.value_counts()

Output:

Married 3066
Never-married 2766
Divorced 663
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
  • Turn off sorting by frequency.
nls97.maritalstatus.value_counts(sort=False)

Output:

Divorced 663
Married 3066
Never-married 2766
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
  • Turn off sorting by frequency.
nls97.maritalstatus.value_counts(sort=False)

Output:

Divorced 663
Married 3066
Never-married 2766
Separated 154
Widowed 23
Name: maritalstatus, dtype: int64
  • Show percentages instead of counts.
nls97.maritalstatus.value_counts(sort=False, normalize=True)

Output:

Divorced 0.10
Married 0.46
Never-married 0.41
Separated 0.02
Widowed 0.00
Name: maritalstatus, dtype: float64
  • Show the percentages for all government responsibility columns.

Filter the DataFrame for just the government responsibility columns, then use
apply to run value_counts on all columns in that DataFrame.

nls97.filter(like="gov").apply(pd.value_counts, normalize=True)

Output:

govprovidejobs govpricecontrols ... \
1. Definitely 0.25 0.54 ...
2. Probably 0.34 0.33 ...
3. Probably not 0.25 0.09 ...
4. Definitely not 0.16 0.04 ...
govdecenthousing govprotectenvironment
1. Definitely 0.44 0.67
2. Probably 0.43 0.29
3. Probably not 0.10 0.03
4. Definitely not 0.02 0.02
  • Find the percentages for all government responsibility columns of
    people who are married.

Do what we did in step 6, but first select only rows with marital status equal
to Married.

nls97[nls97.maritalstatus=="Married"].\
filter(like="gov").\ apply(pd.value_counts, normalize=True)

Output:

govprovidejobs govpricecontrols ... \
1. Definitely 0.17 0.46 ...
2. Probably 0.33 0.38 ...
3. Probably not 0.31 0.11 ...
4. Definitely not 0.18 0.05 ...
govdecenthousing govprotectenvironment
1. Definitely 0.36 0.64
2. Probably 0.49 0.31
3. Probably not 0.12 0.03
4. Definitely not 0.03 0.01
  • Find the frequencies and percentages for all category columns in the
    DataFrame.

First, open a file to write out the frequencies.

freqout = open('views/frequencies.txt', 'w')
for col in nls97.select_dtypes(include=["category"]):
.print(col, "----------------------", "frequencies",
nls97[col].value_counts(sort=False),"percentages",
nls97[col].value_counts(normalize=True, sort=False),
sep="\n\n", end="\n\n\n", file=freqout)
freqout.close()

This generates a file, the beginning of which looks like this.

Output:

gender
----------------------
frequencies
Female 4385
Male 4599
Name: gender, dtype: int64
percentages
Female 0.49
Male 0.51
Name: gender, dtype: float64

As these steps demonstrate, value_counts is quite useful when we need to
generate frequencies for one or more columns of a DataFrame.

How It Works

Most of the columns in the nls97 DataFrame (57 out of 88) have the object
data type. If we are working with data that is logically categorical, but does
not have a category data type in pandas, there are good reasons to convert it

to the category type. Not only does this save memory, it also makes data cleaning a little easier, as we saw in this recipe. The 0 star of the show for this
recipe is the value_counts method. It can generate frequencies for a series,

as we do with nls97.maritalstatus.value_counts. It can also be run on a
whole DataFrame as we0 do with

nls97.filter(like="gov").apply(pd.value_counts, normalize=True).

We first create a DataFrame with just the government responsibility columns
and then pass the resulting DataFrame to value_counts with apply .You

probably noticed that in step 7, I split the chaining over several lines to make
it is easier to read. There is no rule about when it makes sense to do that. I
generally try to do that whenever the chaining involves three or more

operations.In step 8, we iterate over all of the columns with the category data
type: for col in nls97.select_dtypes(include=["category"]). For
each of those columns, we run value_counts to get frequencies and
value_counts again to get percentages. We use a print function so that we
can generate the carriage returns necessary to make the output readable. All
of this is saved to the frequencies.txt file in the views subfolder. I find it
handy to have a bunch of one-way frequencies around just to check before

doing any work with categorical variables. Step 8 accomplishes that.

There’s More

Frequency distributions may be the most important statistical tool for
discovering potential data issues with categorical data. The one way
frequencies we generate in this recipe are a good foundation for further
insights. However, we often only detect problems once we examine the

relationships between categorical variables and other variables, categorical or
continuous. Although we stop short of doing two-way frequencies in this

recipe, we do start the process of splitting up the data for investigation in step
7. In that step, we look at government responsibility responses for married
individuals and see that those responses differ from those for the sample
overall. This raises several questions about our data that we need to explore.
Are there important differences in response rates by marital status and might

this matter for the distribution of the government responsibility variables?
We also want to be careful about drawing conclusions before considering
potential confounding variables. Are married respondents likely to be older or
to have more children, and are those more important factors in their
government responsibility answers? I am using the marital status variable as an example of the kind of queries that produce one way frequencies, like

the ones in this recipe are likely to generate. It is always good to have some
bivariate analyses a correlation matrix, some crosstabs or a few scatter
plots at the ready should questions like these come up. We will generate
those in the next two sections.

Conclusion

In conclusion, generating frequencies for categorical variables is a fundamental technique that empowers you to gain deeper insights into your data, helping to make more informed decisions. Getting ready for this task involves understanding the structure of your dataset, preparing the categories and determining the specific goals you want to achieve. Once you know how to do it, leveraging built-in tools and functions simplifies the process, ensuring that you can quickly and accurately calculate the frequencies of categorical variables. As you see how it works in practice, the visualization and analysis of data become seamless, allowing you to easily spot patterns, anomalies and trends that drive actionable business intelligence. But there’s more beyond the basics, you can explore advanced methodologies and integrate these practices into broader data pipelines. See also related techniques and deeper statistical analyses that complement categorical frequency generation, enhancing your overall data analytics toolkit for more robust, reliable and impactful outcomes.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet