13 — Pandas Data Cleaning: Selecting and Organizing Columns

8 min readSep 10, 2024

When it comes to navigating complex data sets and ensuring clarity in your results, selecting and organizing the right columns is more than just a technical step it’s the foundation of effective data analysis. But before diving into the details, it’s essential to get your environment ready, ensuring that you have the necessary tools and frameworks in place for seamless execution. Once you are set, the process becomes streamlined there’s a clear pathway to follow. How exactly do you do it? That’s where step-by-step precision comes into play, breaking down each action to ensure your data aligns perfectly with your goals. And as you dig deeper, the inner workings of the process reveal themselves, showing the mechanisms that drive your outcomes. But this is only the beginning there’s even more to uncover as you fine tune your methods, push boundaries and explore additional resources that will elevate your data mastery to new heights. For those curious to dive deeper, you will find plenty of additional pathways to broaden your understanding and optimize your approach.

Table of Content

Selecting and organizing columns
Getting ready
How to do it
How it works
There’s more
See also

Selecting and Organizing Columns

We explore several ways to select one or more columns from your

DataFrame in this recipe. We can select columns by passing a list of column
names to the [] bracket operator, or by using the pandas specific data
accessors loc and iloc .When cleaning data or doing exploratory or
statistical analyses, it is helpful to focus on the variables that are relevant to

the issue or analysis at hand. This makes it important to group columns
according to their substantive or statistical relationships with each other or to limit the columns we are investigating at any one time. How many times have
we said to ourselves something like, “Why does variable A have a value of x

When does variable B have a value of y?” We can only do that when the amount of
data we are viewing at a given moment does not exceed our perceptive
abilities at that moment.

Getting Ready

We will continue working with the National Longitudinal Survey
data in this recipe.

How To Do It

We will explore several ways to select columns.

Import the pandas library and load the NLS data into pandas.

Also convert all columns with object data type in the NLS data to category
data type. Do this by selecting object data type columns with
select_dtypes and using apply plus a lambda function to change the data
type to category.

import pandas as pd
import numpy as np
nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97.loc[:, nls97.dtypes == 'object'] = \
.nls97.select_dtypes(['object']). \
apply(lambda x: x.astype('category'))

Select a column using the pandas [] bracket operator, and the loc and
iloc accessors.

We pass a string matching a column name to the bracket operator to return a
pandas series. If we pass a list of one element with that column name

(nls97[[’gender’]]), a DataFrame is returned. We can also use the loc

and iloc accessors to select columns

analysisdemo = nls97['gender']
type(analysisdemo)

Output:

<class 'pandas.core.series.Series'>

analysisdemo = nls97[['gender']]
type(analysisdemo)

Output:

<class 'pandas.core.frame.DataFrame'>

analysisdemo = nls97.loc[:,['gender']]
type(analysisdemo)

Output:

<class 'pandas.core.frame.DataFrame'>

analysisdemo = nls97.iloc[:,[0]]
type(analysisdemo)

Output:

<class 'pandas.core.frame.DataFrame'>

Select multiple columns from a pandas DataFrame. Use the bracket operator and loc to select a few columns.

analysisdemo = nls97[['gender','maritalstatus', 'highestgradecompleted']]
analysisdemo.shape

Output:

(8984, 3)

analysisdemo.head()

Output:

gender maritalstatus highestgradecompleted
personid
100061 Female Married 13
100139 Male Married 12
100284 Male Never-married 7
100292 Male NaN nan
100583 Male Married 13

analysisdemo = nls97.loc[:,['gender','maritalstatus', 'highestgradecompleted']]
analysisdemo.shape

Output:

(8984, 3)

analysisdemo.head()

Output:

gender maritalstatus highestgradecomplet
ed
personid
100061 Female Married 13
100139 Male Married 12
100284 Male Never-married 7
100292 Male NaN nan
100583 Male Married 13

Select multiple columns based on a list of columns.

If you are selecting more than a few columns, it is helpful to create the list of
column names separately. Here, we create a keyvars list of key variables for

analysis.

keyvars = ['gender','maritalstatus',
'highestgradecompleted','wageincome',
'gpaoverall','weeksworked17','colenroct17']
analysiskeys = nls97[keyvars]
analysiskeys.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 8984 non-null category
1 maritalstatus 6672 non-null category
2 highestgradecompleted 6663 non-null float64
3 wageincome 5091 non-null float64
4 gpaoverall 6004 non-null float64
5 weeksworked17 6670 non-null float64
6 colenroct17 6734 non-null category
dtypes: category(3), float64(4)
memory usage: 377.7 KB

Select one or more columns by filtering on the column name. Select all of the weeks worked columns using the filter operator.

analysiswork = nls97.filter(like="weeksworked")
analysiswork.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 18 columns):
# Column Non-Null Count Dtype
0 weeksworked00 8603 non-null float64
1 weeksworked01 8564 non-null float64
2 weeksworked02 8556 non-null float64
3 weeksworked03 8490 non-null float64
4 weeksworked04 8458 non-null float64
5 weeksworked05 8403 non-null float64
6 weeksworked06 8340 non-null float64
7 weeksworked07 8272 non-null float64
8 weeksworked08 8186 non-null float64
9 weeksworked09 8146 non-null float64
10 weeksworked10 8054 non-null float64
11 weeksworked11 7968 non-null float64
12 weeksworked12 7747 non-null float64
13 weeksworked13 7680 non-null float64
14 weeksworked14 7612 non-null float64
15 weeksworked15 7389 non-null float64
16 weeksworked16 7068 non-null float64
17 weeksworked17 6670 non-null float64
dtypes: float64(18)
memory usage: 1.3 MB

Select all columns with the category data type. Use the select_dtypes method to select columns by data type.

analysiscats = nls97.select_dtypes(include=["category"])
analysiscats.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 57 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 8984 non-null category
1 maritalstatus 6672 non-null category
2 weeklyhrscomputer 6710 non-null category
3 weeklyhrstv 6711 non-null category
4 highestdegree 8953 non-null category
...
49 colenrfeb14 7624 non-null category
50 colenroct14 7469 non-null category
51 colenrfeb15 7469 non-null category
52 colenroct15 7469 non-null category
53 colenrfeb16 7036 non-null category
54 colenroct16 6733 non-null category
55 colenrfeb17 6733 non-null category
56 colenroct17 6734 non-null category
dtypes: category(57)
memory usage: 580.0 KB

Select all columns with numeric data types.

analysisnums = nls97.select_dtypes(include=["number"])
analysisnums.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 birthmonth 8984 non-null int64
1 birthyear 8984 non-nul int64
2 highestgradecompleted 6663 non-null float64
...
23 weeksworked10 8054 non-null float64
24 weeksworked11 7968 non-null float64
25 weeksworked12 7747 non-null float64
26 weeksworked13 7680 non-null float64
27 weeksworked14 7612 non-null float64
28 weeksworked15 7389 non-null float64
29 weeksworked16 7068 non-null float64
30 weeksworked17 6670 non-null float64
dtypes: float64(29), int64(2)
memory usage: 2.2 MB

Organize columns using lists of column names.

Use lists to organize the columns in your DataFrame. You can easily change
the order of columns or exclude some columns in this way. Here, we move
the columns in the demo adult list to the front.

demo = ['gender','birthmonth','birthyear']
highschoolrecord = ['satverbal','satmath','gpaoverall',
'gpaenglish','gpamath','gpascience']
govresp = ['govprovidejobs','govpricecontrols',
'govhealthcare','govelderliving','govindhelp',
'govunemp','govincomediff','govcollegefinance',
'govdecenthousing','govprotectenvironment']
demoadult = ['highestgradecompleted','maritalstatus',
'childathome','childnotathome','wageincome',
'weeklyhrscomputer','weeklyhrstv','nightlyhrssleep',
'highestdegree']
weeksworked = ['weeksworked00','weeksworked01', 
'weeksworked02','weeksworked03','weeksworked04',

'weeksworked14','weeksworked15','weeksworked16',
'weeksworked17']
colenr = ['colenrfeb97','colenroct97','colenrfeb98',
'colenroct98','colenrfeb99','colenroct99',

'colenrfeb15','colenroct15','colenrfeb16','colenroct 16','colenrfeb17','colenroct17']

Create the new reorganized DataFrame.

nls97 = nls97[demoadult + demo + highschoolrecord + \
.govresp + weeksworked + colenr]
nls97.dtypes

Output:

highestgradecompleted float64
maritalstatus category
childathome float64
childnotathome float64
wageincome float64
...
colenroct15 category
colenrfeb16 category
colenroct16 category
colenrfeb17 category
colenroct17 category
Length: 88, dtype: object

The preceding steps showed how to select columns and change the order of
columns in a pandas DataFrame.

How It Works

Both the [] bracket operator and the loc data accessor are very handy for
selecting and organizing columns. Each returns a DataFrame when passed a
list of names of columns. The columns will be ordered according to the

passed list of column names.In step 1, we use nls97.select_dtypes([’object’]) to select columns with object data type
and chain that with apply and a lambda function

(apply(lambda x: x.astype(’category’))) to change those columns to
category. We use the loc accessor to only update columns with object data

type (nls97.loc[:, nls97.dtypes == 'object’]). We go into much more detail on apply and lambda functions in section 6, Cleaning and
Exploring Data with Series Operations.We also select columns by data type
in steps 6 and 7. select_dtypes becomes quite useful when passing
columns to methods such as describe or value_counts and you want to

limit the analysis to continuous or categorical variables. In step 9, we

concatenate six different lists when using the bracket operator. This moves
the column names in demo adult to the front and organizes all of the columns

by those six groups. There are now clear high school records and weeks
worked sections in our DataFrame’s columns.

There’s More

We can also use select_dtypes to exclude data types. Also, if we are just
interested in the info results, we can chain the select_dtypes call with the
info method.

nls97.select_dtypes(exclude=["category"]).info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 highestgradecompleted 6663 non-null float64
1 childathome 4791 non-null float64
2 childnotathome 4791 non-null float64
3 wageincome 5091 non-null float64
4 nightlyhrssleep 6706 non-null float64
5 birthmonth 8984 non-null int64
6 birthyear 8984 non-null int64
...
25 weeksworked12 7747 non-null float64
26 weeksworked13 7680 non-null float64
27 weeksworked14 7612 non-null float64
28 weeksworked15 7389 non-null float64
29 weeksworked16 7068 non-null float64
30 weeksworked17 6670 non-null float64
dtypes: float64(29), int64(2)
memory usage: 2.2 MB

The filter operator can also take a regular expression. For example, you can
return the columns that have income in their names.

nls97.filter(regex='income')

Output:

wageincome govincomediff
personid
100061 12,500 NaN
100139 120,000 NaN
100284 58,000 NaN
100292 nan NaN
100583 30,000 NaN
... ... ...
999291 35,000 NaN
999406 116,000 NaN
999543 nan NaN
999698 nan NaN
999963 50,000 NaN

Conclusion

In conclusion, selecting and organizing columns effectively sets the foundation for any data driven process, ensuring clarity and efficiency from the start. As you get ready to dive into the core of your task, preparation is key, allowing you to streamline your workflow and avoid potential roadblocks. The step-by-step approach to how to do it gives you the actionable guidance needed to accomplish your goals with precision. Once you understand how it works, you unlock the full potential of your data and maximize your output. But remember, there’s always more to explore in this space, as evolving tools and techniques offer new ways to enhance your process. Be sure to check out related resources in the "See Also" sections to further expand your expertise and keep ahead of the curve.

13 — Pandas Data Cleaning: Selecting and Organizing Columns

Table of Content

Selecting and Organizing Columns

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

Output:

Output:

See Also

Conclusion

Written by A.I Hub

No responses yet