14 — Pandas Data Cleaning: Selecting Rows

A.I Hub
10 min readSep 11, 2024

--

Image owned by Canva

When it comes to diving into any data driven project, selecting the right rows is the first key to unlocking meaningful insights. Getting ready for this step involves more than just data filtering, it requires a solid understanding of what you are looking for and how to pinpoint it efficiently. Once you have a clear vision, knowing how to do it becomes second nature. Whether you are using Python, SQL or any other tool, the process remains straightforward when done right. But understanding how it works behind the scenes can give you an edge, revealing optimization tips and advanced techniques that most overlook. And just when you think you have covered it all, there’s more hidden features and tricks that take your approach to the next level. If you are eager to explore further, be sure to check out related resources, which dive deeper into this ever evolving subject.

Table of Content

  • Selecting rows
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Selecting Rows

When we are taking the measure of our data and otherwise answering the
question, “How does it look?”, we are constantly zooming in and out. We are

looking at aggregated numbers and particular rows. But there are also
important data issues that are only obvious at an intermediate zoom level issues that we only notice when looking at some subset of rows. This recipe

demonstrates how to use the pandas tools for detecting data issues in subsets
of our data.

Getting Ready

We will continue working with the NLS data in this recipe.

How To Do It

We will go over several techniques for selecting rows in a pandas DataFrame.

  • Import pandas and numpy and load the nls97 data.
import pandas as pd
import numpy as np
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
  • Use slicing to start at the 1001st
    row and go to the 1004th
    row.

nls97[1000:1004] selects every row starting from the row indicated by the
integer to the left of the colon (1000 , in this case) to, but not including, the
row indicated by the integer to the right of the colon (1004). The row at
1000 is actually the 1001st
row because of zero-based indexing. Each row

appears as a column in the output since we have transposed the resulting
DataFrame.

nls97[1000:1004].T

Output:

personid 195884 195891 195970 195996
gender Male Male Female Female
birthmonth 12 9 3 9
birthyear 1981 1980 1982 1980
highestgradecompleted NaN 12 17 NaN
maritalstatus NaN Never-married Never-married NaN
... ... ... ... ...
colenroct15 NaN 1. Not enrolled 1. Not enrolled NaN
colenrfeb16 NaN 1. Not enrolled 1. Not enrolled NaN
colenroct16 NaN 1. Not enrolled 1. Not enrolled NaN
colenrfeb17 NaN 1. Not enrolled 1. Not enrolled NaN
colenroct17 NaN 1. Not enrolled 1. Not enrolled NaN
  • Use slicing to start at the 1001st
    row and go to the 1004th

    row, skipping
    every other row.

The integer after the second colon (2 in this case) indicates the size of the
step. When the step is excluded it is assumed to be 1. Notice that by setting
the value of the step to 2, we are skipping every other row

nls97[1000:1004:2].T

Output:

personid 195884 195970
gender Male Female
birthmonth 12 3
birthyear 1981 1982
highestgradecompleted NaN 17
maritalstatus NaN Never-married
... ... ...
colenroct15 NaN 1. Not enrolled
colenrfeb16 NaN 1. Not enrolled
colenroct16 NaN 1. Not enrolled
colenrfeb17 NaN 1. Not enrolled
colenroct17 NaN 1. Not enrolled
  • Select the first three rows using head and [] operator slicing.

Note that nls97[:3] returns the same DataFrame as nls97.head(3) . By
not providing a value to the left of the colon in [:3], we are telling the
operator to get rows from the start of the DataFrame.

nls97.head(3).T

Output:

personid 100061 100139 100284
gender Female Male Male
birthmonth 5 9 11
birthyear 1980 1983 1984
... ... ... ...
colenroct15 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb17 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct17 1. Not enrolled 1. Not enrolled 1. Not enrolled
nls97[:3].T

Output:

personid 100061 100139 100284
gender Female Male Male
birthmonth 5 9 11
birthyear 1980 1983 1984
... ... ... ...
colenroct15 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb17 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct17 1. Not enrolled 1. Not enrolled 1. Not enrolled
  • Select the last three rows using tail and [] operator slicing.

Note that nls97.tail(3) returns the same DataFrame as nls97[-3:].

nls97.tail(3).T

Output:

personid 999543 999698 999963
gender Female Female Female
birthmonth 8 5 9
birthyear 1984 1983 1982
... ... ... ...
colenroct15 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb17 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct17 1. Not enrolled 1. Not enrolled 1. Not enrolled
  • Select a few rows using the loc data accessor.

Use the loc accessor to select by index label. We can pass a list of index
labels or we can specify a range of labels. Recall that we have set personid
as the index. Note thatnls97.loc[[195884,195891,195970]] and
nls97.loc[195884:195970] return the same DataFrame.

nls97.loc[[195884,195891,195970]].T

Output:

personid 195884 195891 195970
gender Male Male Female
birthmonth 12 9 3
birthyear 1981 1980 1982
highestgradecompleted NaN 12 17
maritalstatus NaN Never-married Never-married
... ... ... ...
colenroct15 NaN 1. Not enrolled 1. Not enrolled
colenrfeb16 NaN 1. Not enrolled 1. Not enrolled
colenroct16 NaN 1. Not enrolled 1. Not enrolled
colenrfeb17 NaN 1. Not enrolled 1. Not enrolled
colenroct17 NaN 1. Not enrolled 1. Not enrolled
nls97.loc[195884:195970].T

Output:

personid 195884 195891 195970
gender Male Male Female
birthmonth 12 9 3
birthyear 1981 1980 1982
highestgradecompleted NaN 12 17
maritalstatus NaN Never-married Never-married
... ... ... ...
colenroct15 NaN 1. Not enrolled 1. Not enrolled
colenrfeb16 NaN 1. Not enrolled 1. Not enrolled
colenroct16 NaN 1. Not enrolled 1. Not enrolled
colenrfeb17 NaN 1. Not enrolled 1. Not enrolled
colenroct17 NaN 1. Not enrolled 1. Not enrolled
  • Select a row from the beginning of the DataFrame with the iloc data
    accessor.

iloc differs from loc in that it takes a list of row position integers, rather
than index labels. For that reason, it works similarly to bracket operator
slicing. In this step, we first pass a one-item list with the value of 0. That
returns a DataFrame with the first row.

nls97.iloc[[0]].T

Output:

personid 100061
gender Female
birthmonth 5
birthyear 1980
highestgradecompleted 13
maritalstatus Married
... ...
colenroct15 1. Not enrolled
colenrfeb16 1. Not enrolled
colenroct16 1. Not enrolled
colenrfeb17 1. Not enrolled
colenroct17 1. Not enrolled
  • Select a few rows from the beginning of the DataFrame with the iloc
    data accessor.

We pass a three-item list, [0,1,2] , to return a DataFrame of the first three
rows of nls97 . We would get the same result if we passed [0:3] to the accessor.

nls97.iloc[[0,1,2]].T

Output:

personid 999543 999698 999963
gender Female Female Female
birthmonth 8 5 9
birthyear 1984 1983 1982
... ... ... ...
colenroct15 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct16 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenrfeb17 1. Not enrolled 1. Not enrolled 1. Not enrolled
colenroct17 1. Not enrolled 1. Not enrolled 1. Not enrolled
  • Select multiple rows conditionally using boolean indexing.

Create a DataFrame of just individuals receiving very little sleep. About 5%
of survey respondents got 4 or fewer hours’ sleep per night of the 6,706

individuals who responded to that question. Test who is getting 4 or fewer
hours of sleep with nls97.nightlyhrssleep<=4 , which generates pandas
series of True and False values that we assign to sleepcheckbool. Pass
that series to the loc accessor to create a lowsleep DataFrame. lowsleep

has approximately the number of rows we are expecting. We do not need to
do the extra step of assigning the boolean series to a variable. This is done
here only for explanatory purposes.

nls97.nightlyhrssleep.quantile(0.05)

Output:

4.0
nls97.nightlyhrssleep.count()

Output:

6706
sleepcheckbool = nls97.nightlyhrssleep<=4
sleepcheckbool

Output:

personid
100061 False
100139 False
100284 False
100292 False
100583 False
999291 False
999406 False
999543 False
999698 False
999963 False
Name: nightlyhrssleep, Length: 8984, dtype: bool
lowsleep = nls97.loc[sleepcheckbool]
lowsleep.shape

Output:

(364, 88)
  • Select rows based on multiple conditions.

It may be that folks who are not getting a lot of sleep also have a fair number
of children who live with them. Use describe to get a sense of the
distribution of the number of children for those who have lowsleep. About a
quarter have three or more children. Create a new DataFrame with

individuals who have nightlyhrssleep of 4 or less and the number of
children at home of 3 or more. The & is the logical and operator in pandas
and indicates that both conditions have to be true for the row to be selected.

(We would have gotten the same result if we worked from the lowsleep
DataFrame lowsleep3pluschildren = lowsleep.loc[lowsleep.childathome>=3] but then we would not have been able to demonstrate testing multiple
conditions.

lowsleep.childathome.describe()

Output:

count 293.00
mean 1.79
std 1.40
min 0.00
25% 1.00
50% 2.00
75% 3.00
max 9.00
lowsleep3pluschildren = nls97.loc[(nls97.nightlyhrssleep<=4) & (nls97.childathome>=3)]
lowsleep3pluschildren.shape

Output:

(82, 88)
  • Select rows and columns based on multiple conditions.

Pass the condition to the loc accessor to select rows. Also, pass a list of
column names to select.

lowsleep3pluschildren = nls97.loc[(nls97.nightlyhrssleep<=4) & (
nls97.childathome>=3),['nightlyhrssleep','childathome']]
lowsleep3pluschildren

Output:

nightlyhrssleep childathome
personid
119754 4 4
141531 4 5
152706 4 4
156823 1 3
158355 4 4
... ... ...
905774 4 3
907315 4 3
955166 3 3
956100 4 6
991756 4 3

The preceding steps demonstrated the key techniques for selecting rows in
pandas.

How It Works

We used the [] bracket operator in steps 2 through 5 to do standard Python like slicing to select rows. That operator allows us to easily select rows based
on a list or a range of values indicated with slice notation. This notation takes

the form of [start:end:step], where a value of 1 for step is assumed if

no value is provided. When a negative number is used to start, it
represents the number of rows from the end of theDataFrame. The loc
accessor used in step 6, selects rows based on row index labels. Since
personid is the index for the DataFrame, we can pass a list of one or more
personid values to the loc accessor to get a DataFrame with rows for those

index labels. We can also pass a range of index labels to the accessor, which
will return a DataFrame with all rows having index labels between the label

to the left of the colon and the label to the right (inclusive) so,

nls97.loc[195884:195970] returns a DataFrame for rows with personid. between 195884 and 195970, including those two values. The iloc
accessor works very much like the bracket operator. We see this in step 7
through 9. We can pass either a list of integers or a range using slicing
notation. One of the most valuable pandas capabilities is boolean indexing. It
makes it easy to select rows conditionally. We see this in step 10. A test

returns a boolean series. The loc accessor selects all rows for which the test
is True . We actually didn’t need to assign the boolean data series to the
variable that we then passed to the loc operator. We could have just passed
the test to the loc accessor with
nls97.loc[nls97.nightlyhrssleep<=4] .We should take a closer look at

how we used the loc accessor to select rows in step 11. Each condition in
nls97.loc[(nls97.nightlyhrssleep<=4) & (nls97.childathome>=3)] is

placed in parentheses. An error will be generated if the parentheses are
excluded. The & operator is the equivalent of and in standard Python,
meaning that both conditions have to be True for the row to be selected. We
would have used | for or if we had wanted to select the row if either
condition was True. Finally, step 12 demonstrates how to select both rows
and columns in one call to the loc accessor. The criteria for rows appear
before the comma and the columns to select appear after the comma, as in the
following statement.

nls97.loc[(nls97.nightlyhrssleep<=4) & (nls97.childathome>=3), [
'nightlyhrssleep','childathome']]

This returns the nightlyhrssleep and childathome columns for all rows
where the individual has nightlyhrssleep of less than or equal to 4 and
childathome greater than or equal to 3.

There’s More

We used three different tools to select rows from a pandas DataFrame in this
recipe: the [] bracket operator and two pandas-specific accessors, loc and

iloc. This is a little confusing if you are new to pandas, but it becomes clear
which tool to use in which situation after just a few months. If you came to
pandas with a fair bit of Python and NumPy experience, you likely find the
[] operator most familiar. However, the pandas documentation recommends

against using the [] operator for production code. I have settled on a routine
of using that operator only for selecting columns from a DataFrame. I use the loc accessor when selecting rows by boolean indexing or by index label and the iloc accessor for selecting rows by row number. Since my
workflow has me using a fair bit of boolean indexing, I use loc much more

than the other methods.

See Also

The recipe immediately preceding this one has a more detailed discussion on
selecting columns.

Conclusion

In conclusion, mastering the process of selecting rows is an essential foundation for effectively manipulating and analyzing data. Getting ready involves preparing your environment and ensuring you have the right tools at hand, making the execution smoother and more efficient. Understanding how to do it step by step allows you to apply these skills confidently, while knowing how it works under the hood deepens your grasp on the mechanics behind the scenes. But don’t stop there, there’s always more to explore and refine as you advance in your data manipulation journey. For those eager to dive deeper, be sure to see additional resources and references that will further enhance your knowledge and expertise. With these steps, you are not just selecting rows you are unlocking the true potential of your data analysis capabilities.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet