21 — Pandas Data Cleaning: Using Subsetting To Examine Logical Inconsistencies in Variable Relationships

A.I Hub
8 min readSep 13, 2024

--

Image owned by canva

In the world of data analysis, uncovering hidden insights requires a keen eye for detail and an ability to question the relationships between variables. Using subsetting to examine logical inconsistencies not only sharpens that insight but reveals the underlying patterns that could easily be missed. As we prepare to dive into the practicalities of these techniques, it’s essential to understand how to approach the process and what steps to follow to make the most out of this powerful method. By understanding how subsetting works and its inner mechanics, you will unlock a new level of precision in your analyses, revealing not just what the data shows but how it behaves in unexpected ways. But that’s just the beginning, there’s so much more to explore, as each new perspective opens the door to further possibilities. For those looking to deepen their understanding and discover related methods, we will also point you toward resources that can expand your expertise and sharpen your analytical skills even further.

Table of Content

  • Using subsetting to examine logical inconsistencies in variable relationships
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Using Subsetting To Examine Logical Inconsistencies in Variable Relationships

At a certain point, data issues come down to deductive logic problems, such
as variable x has to be greater than some quantity when variable y is less
than some quantity b. Once we are through some initial data cleaning, it is

important to check for logical inconsistencies. Pandas makes this kind of
error checking relatively straightforward with subsetting tools such as loc
and Boolean indexing. This can be combined with summary methods on

series and data frames to allow us to easily compare values for a particular
row to values for the whole dataset or some subset of rows. We can also
easily aggregate over columns. Just about any question we might have about

the logical relationships between variables can be answered with these tools.
We work through some examples in this recipe.

Getting Ready

We will work with the National Longitudinal Survey of Youth (NLS), mainly
with data on employment and education. We use apply and lambda
functions several times in this recipe, but go into more detail on their use in

Section 7, Fixing Messy Data when Aggregating. It is not necessary to

review section 7 to follow along, however, even if you have no experience
with those tools.

The NLS, administered by the United States Bureau of Labor

Statistics, is a longitudinal survey of individuals who were in high

school in 1997 when the survey started. Participants were surveyed
each year through 2017.

How To Do It

We run a number of logical checks on the NLS data, such as individuals with
post graduate enrollment but no undergraduate enrollment or having wage

income but no weeks worked. We also check for large changes in key values
for a given individual from one period to the next.

  • Import pandas and then load the NLS data.
import pandas as pd
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
  • Look at some of the employment and education data.

The dataset has weeks worked each year from 2000 through 2017, and
college enrollment status each month from February 1997 through October
2017. We use the ability of the loc accessor to choose all columns from the

column indicated on the left of the colon through the column indicated on the
right for example, nls97.loc[:,"colenroct09":"colenrfeb14"].

nls97[['wageincome','highestgradecompleted','highestdegree']].head(3).T

Output:

personid 100061 100139 100284
wageincome 12,500 120,000 58,000
highestgradecompleted 13 12 7
highestdegree 2. High School 2. High School 0. None
nls97.loc[:, "weeksworked12":"weeksworked17"].head(3).T

Output:

personid 100061 100139 100284
weeksworked12 40 52 0
weeksworked13 52 52 nan
weeksworked14 52 52 11
weeksworked15 52 52 52
weeksworked16 48 53 47
weeksworked17 48 52 0
nls97.loc[:, "colenroct09":"colenrfeb14"].head(2).T

Output:

personid 100061 100139
colenroct09 1. Not enrolled 1. Not enrolled
colenrfeb10 1. Not enrolled 1. Not enrolled
colenroct10 1. Not enrolled 1. Not enrolled
colenrfeb11 1. Not enrolled 1. Not enrolled
colenroct11 3. 4-year college 1. Not enrolled
colenrfeb12 3. 4-year college 1. Not enrolled
colenroct12 3. 4-year college 1. Not enrolled
colenrfeb13 1. Not enrolled 1. Not enrolled
colenroct13 1. Not enrolled 1. Not enrolled
colenrfeb14 1. Not enrolled 1. Not enrolled
  • Show individuals with wage income but no weeks worked.

The wage income variable reflects wage income for 2016.

nls97.loc[(nls97.weeksworked16==0) & nls97.wageincome>0, ['weeks worked16','wageincome']]

Output:

weeksworked16 wageincome
personid
102625 0 1,200
109403 0 5,000
118704 0 25,000
130701 0 12,000
131151 0 65,000
... ... ...
957344 0 90,000
966697 0 65,000
969334 0 5,000
991756 0 9,000
992369 0 35,000
[145 rows x 2 columns]
  • Check for whether an individual was ever enrolled in a 4-year college.

Chain several methods. First, create a data frame with columns that start with

colenr(nls97.filter(like="colenr")). These are the college enrollment

columns for October and February of each year. Then, use apply to run a
lambda function that examines the first character of each colenr column

(apply(lambda x: x.str[0:1]==’3’)). This returns a value of True or

False for all of the college enrollment columns, True if the first value of
the string is 3, meaning enrollment at a 4 year college. Finally, use the any

function to test whether any of the values returned from the previous step has
a value of True (any(axis=1)). This will identify whether the individual
was enrolled in a 4-year college between February 1997 and October 2017.
The first statement here shows the results of the first two steps for
explanatory purposes only. Only the second statement needs to be run to get
the desired results: whether the individual was enrolled at a 4 year college at

some point.

nls97.filter(like="colenr").apply(lambda x: x.str[0:1]=='3').head(2).T

Output:

personid 100061 100139
...
colenroct09 False False
colenrfeb10 False False
colenroct10 False False
colenrfeb11 False False
colenroct11 True False
colenrfeb12 True False
colenroct12 True False
colenrfeb13 False False
colenroct13 False False
colenrfeb14 False False
nls97.filter(like="colenr"). \ apply(lambda x: x.str[0:1]=='3').\ any(axis=1).head(2)

Output:

personid
100061 True
100139 False
dtype: bool
  • Show individuals with post graduate enrollment but no bachelor’s
    enrollment.

We can use what we tested in step 4 to do some checking. We want

individuals who have a 4 graduate enrollment as the first character for
colenr any month, but who never had a 3 bachelor enrollment. Note the
“~” before the second half of the test, for negation. There are 22 individuals
who fall into this category.

nobach = nls97.loc[nls97.filter(like="colenr").\
.apply(lambda x: x.str[0:1]=='4').\
any(axis=1) & ~nls97.filter(like="colenr").\
apply(lambda x: x.str[0:1]=='3').\
any(axis=1), "colenrfeb97":"colenroct17"]
len(nobach)

Output:

22
nobach.head(2).T

Output:

personid 153051 154535
...
colenroct08 1. Not enrolled 1. Not enrolled
colenrfeb09 1. Not enrolled 1. Not enrolled
colenroct09 1. Not enrolled 1. Not enrolled
colenrfeb10 1. Not enrolled 1. Not enrolled
colenroct10 1. Not enrolled 4. Graduate program
colenrfeb11 1. Not enrolled 4. Graduate program
colenroct11 1. Not enrolled 4. Graduate program
colenrfeb12 1. Not enrolled 4. Graduate program
colenroct12 1. Not enrolled 4. Graduate program
colenrfeb13 4. Graduate program 4. Graduate program
colenroct13 1. Not enrolled 4. Graduate program
colenrfeb14 4. Graduate program 4. Graduate program
colenroct14 4. Graduate program 4. Graduate program
colenrfeb15 1. Not enrolled 4. Graduate program
colenroct16 1. Not enrolled 4. Graduate program
colenrfeb17 1. Not enrolled 4. Graduate program
colenroct15 1. Not enrolled 4. Graduate program
colenrfeb16 1. Not enrolled 4. Graduate program
colenroct17 1. Not enrolled 4. Graduate program
  • Show individuals with bachelor’s degrees or more, but no 4 year college
    enrollment.

Use isin to compare the first character in highestdegree with all of the
values in a list

(nls97.highestdegree.str[0:1].isin([’4’,’5’,’6’,’7’])).

nls97.highestdegree.value_counts().sort_index()

Output:

0. None 953
1. GED 1146
2. High School 3667
3. Associates 737
4. Bachelors 1673
5. Masters 603
6. PhD 54
7. Professional 120
Name: highestdegree, dtype: int64
no4yearenrollment = \
nls97.loc[nls97.highestdegree.str[0:1].\
isin(['4','5','6','7']) & \
~nls97.filter(like="colenr").\
apply(lambda x: x.str[0:1]=='3').\
any(axis=1), "colenrfeb97":"colenroct17"]
len(no4yearenrollment)

Output:

39
no4yearenrollment.head(3).T

Output:

personid 113486 118749
colenroct01 2. 2-year college 1. Not enrolled
colenrfeb02 2. 2-year college 1. Not enrolled
colenroct02 2. 2-year college 1. Not enrolled
colenrfeb03 2. 2-year college 1. Not enrolled
colenroct03 2. 2-year college 1. Not enrolled
colenrfeb04 2. 2-year college 1. Not enrolled
colenroct04 1. Not enrolled 1. Not enrolled
colenrfeb05 1. Not enrolled 1. Not enrolled
colenroct05 1. Not enrolled 1. Not enrolled
colenrfeb06 1. Not enrolled 1. Not enrolled
colenroct06 1. Not enrolled 1. Not enrolled
colenrfeb07 1. Not enrolled 2. 2-year college
colenroct07 1. Not enrolled 2. 2-year college
colenrfeb08 1. Not enrolled 1. Not enrolled
  • Show individuals with a high wage income.

Define high wages as 3 standard deviations above the mean. It looks as
though wage income values have been truncated at $235,884.

highwages = \ nls97.loc[nls97.wageincome > nls97.wageincome.mean()+ \
(nls97.wageincome.std()*3),['wageincome']]

print(highwages)

Output:

wageincome
personid
131858 235,884
133619 235,884
151863 235,884
164058 235,884
164897 235,884
... ...
964406 235,884
966024 235,884
976141 235,884
983819 235,884
989896 235,884
[121 rows x 1 columns]
  • Show individuals with large changes in weeks worked for the most recent years.

Calculate the average value for weeks worked between 2012 and 2016 for
each person

( nls97.loc[:, "weeksworked12":"weeksworked16"].mean(axis=1)). We
indicate axis=1 to calculate the mean across columns for each individual,

rather than over individuals. We then check to see whether the mean is either
less than 50% of the weeks worked in 2017 value or more than twice as
much. We also indicate that we are not interested in rows that satisfy those
criteria by being null for weeks worked in 2017. There are 1,160
individuals with sharp changes in weeks worked in 2017.

workchanges = nls97.loc[~nls97.loc[:,
"weeksworked12":"weeksworked16"].mean(axis=1).\
between(nls97.weeksworked17*0.5,\
nls97.weeksworked17*2) \
& ~nls97.weeksworked17.isnull(), "weeksworked12":"weeksworked17"]
len(workchanges)

Output:

1160
workchanges.head(6).T

Output:

personid 100284 101526 ... 102228 102454
weeksworked12 0 0 ... 52 52
weeksworked13 nan 0 ... 52 52
weeksworked14 11 0 ... 17 7
weeksworked15 52 0 ... 0 0
weeksworked16 47 0 ... 0 0
weeksworked17 0 45 ... 0 0
  • Show inconsistencies in the highest grade completed and the highest
    degree.

Use the crosstab function to show highestgradecompleted by

highestdegree for people with highestgradecompleted less than 12. A
good number of these individuals indicate that they have completed high
school, which is unusual in the United States if the highest grade completed
is less than 12.

ltgrade12 = nls97.loc[nls97.highestgradecompleted<12, ['highestgradecompleted','highestdegree']]
pd.crosstab(ltgrade12.highestgradecompleted, ltgrade12.highestdegree)

Output:

highestdegree 0. None 1. GED 2. High School
highestgradecompleted
5 0 0 1
6 11 5 0
7 24 6 1
8 113 78 7
9 112 169 8
10 111 204 13
11 120 200 41

These steps reveal a number of logical inconsistences in the NLS data.

How It Works

The syntax required to do the kind of subsetting that we have done in this
recipe may seem a little complicated if you are seeing it for the first time.
You do get used to it, however, and it allows for quickly running any query
against the data that you might imagine.Some of the inconsistencies or
unexpected values suggest either respondent or entry error, so warrant further

investigation. It is hard to explain positive values for wage income when
weeks worked is 0. Other unexpected values might not be data problems at
all, but suggest that we should be careful about how we use that data. For
example, we might not want to use the weeks worked in 2017 by itself.

Instead, we might consider using three year averages in many analyses.

See Also

The Selecting and organizing columns and Selecting rows recipes in the
Taking the Measure of Your Data section demonstrates some of the
techniques for subsetting data used here. We examine apply functions in

more detail in the Fixing Messy Data when Aggregating section.

Conclusion:

In conclusion, leveraging subsetting to examine logical inconsistencies in variable relationships is a powerful technique that enhances data integrity and analysis precision. By carefully preparing your dataset and understanding how subsetting works, you can uncover hidden patterns and contradictions that may otherwise go unnoticed. This step-by-step approach ensures that you not only spot inconsistencies but also address them effectively, leading to more accurate insights and informed decision making. And the journey doesn’t stop there, there’s always more to explore as you refine your methods and dive deeper into complex data. Be sure to explore additional resources and techniques to continue honing your skills and expanding your analytical toolkit.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet