12 — Pandas Data Cleaning: Getting a First Look at Your Data

A.I Hub
7 min readSep 10, 2024

--

Image owned by Canva

When diving into the world of data, your first look can feel like uncovering hidden treasure. The raw numbers may seem overwhelming, but they hold insights that can transform the way you make decisions. Before you can fully grasp their potential, it’s crucial to prepare yourself for the journey ahead, ensuring you have the right tools and mindset. Whether you are cleaning, organizing or analyzing, the process of working with data is both an art and a science, requiring focus and precision. As you begin, you will quickly discover how each step fits together to reveal patterns, trends and answers you never imagined. But the story doesn’t end there there’s always more to learn, deeper insights to uncover and new techniques to apply. For those hungry to expand their understanding, the path forward is endless, offering opportunities to dive even deeper into the world of data and its possibilities. If you are curious, take a look at additional resources that can broaden your perspective and push your skills further.

Table of Content

  • Getting a first look at your data
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Getting a First Look at Your Data

We will work with two datasets in this section. The first one is the data of National Longitudinal
Survey of Youth for 1997, a survey conducted by the United States

government that surveyed the same group of individuals from 1997 through
2017 and the second one is the data of counts of COVID cases and deaths by country from Our World
in Data.

Getting Ready

We will mainly be using the pandas library for this recipe. We will use
pandas tools to take a closer look at the National Longitudinal Survey
and coronavirus case data.

The NLS of Youth was conducted by the United States Bureau of

Labor Statistics. This survey started with a cohort of individuals in
1997 who were born between 1980 and 1985 with annual follow ups each year through 2017. For this recipe, I pulled 89 variables

on grades, employment, income and attitudes toward government

from the hundreds of data items on the survey. Separate files for

SPSS, Stata and SAS can be downloaded from the repository. NLS
data can be downloaded from the link below.

How To Do It

We will get an initial look at the NLS and COVID data, including the number
of rows and columns and the data types.

  • Import libraries and load the DataFrames.
import pandas as pd
import numpy as np
nls97 = pd.read_csv("data/nls97.csv")
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=['lastdate'])
  • Set and show the index and the size of the nls97 data. Also check to see whether the index values are unique.
nls97.set_index("personid", inplace=True)
nls97.index

Output:

Int64Index([100061, 100139, 100284, 100292, 100583, 100833,
999543, 999698, 999963],
dtype='int64', name='personid', length=8984)
nls97.shape

Output:

(8984, 88)
nls97.index.nunique()

Output:

8984
  • Show the data types and non-null value counts.
nls97.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 88 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 8984 non-null object
1 birthmonth 8984 non-null int64
2 birthyear 8984 non-null int64
3 highestgradecompleted 6663 non-null float64
4 maritalstatus 6672 non-null object
5 childathome 4791 non-null float64
6 childnotathome 4791 non-null float64
7 wageincome 5091 non-null float64
8 weeklyhrscomputer 6710 non-null object
9 weeklyhrstv 6711 non-null object
10 nightlyhrssleep 6706 non-null float64
11 satverbal 1406 non-null float64
12 satmath 1407 non-null float64
...
83 colenroct15 7469 non-null object
84 colenrfeb16 7036 non-null object
85 colenroct16 6733 non-null object
86 colenrfeb17 6733 non-null object
87 colenroct17 6734 non-null object
dtypes: float64(29), int64(2), object(57)
memory usage: 6.1+ MB
  • Show the first row of the nls97 data. Use transpose to show a little more of the output.
nls97.head(2).T

Output:

personid 100061 100139
gender Female Male
birthmonth 5 9
birthyear 1980 1983
highestgradecompleted 13 12
maritalstatus Married Married
... ...
colenroct15 1. Not enrolled 1. Not enrolled
colenrfeb16 1. Not enrolled 1. Not enrolled
colenroct16 1. Not enrolled 1. Not enrolled
colenrfeb17 1. Not enrolled 1. Not enrolled
colenroct17 1. Not enrolled 1. Not enrolled
  • Set and show the index and size for the COVID data. Also check to see whether index values are unique.
covidtotals.set_index("iso_code", inplace=True)
covidtotals.index

Output:

Index(['AFG', 'ALB', 'DZA', 'AND', 'AGO', 'AIA', 'ATG', 'ARG', '
ARM',
'ABW',
...
'VIR', 'URY', 'UZB', 'VAT', 'VEN', 'VNM', 'ESH', 'YEM', '
ZMB',
'ZWE'],
dtype='object', name='iso_code', length=209)
covidtotals.shape

Output:

(209, 12)
covidtotals.index.nunique()

Output:

209
  • Show the data types and non-null value counts.
covidtotals.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 209 entries, AFG to ZWE
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lastdate 209 non-null datetime64[ns]
1 location 209 non-null object
2 total_cases 209 non-null float64
3 total_deaths 209 non-null float64
4 total_cases_pm 209 non-null float64
5 total_deaths_pm 209 non-null float64
6 population 209 non-null float64
7 pop_density 198 non-null float64
8 median_age 185 non-null float64
9 gdp_per_capita 182 non-null float64
10 hosp_beds 164 non-null float64
11 region 209 non-null object
dtypes: datetime64[ns](1), float64(9), object(2)
memory usage: 29.3+ KB
  • Show a sample of a few rows of the COVID case data.
covidtotals.sample(2, random_state=1).T

Output:

iso_code COG THA
lastdate 2020-07-12 00:00:00 2020-07-12 00:00:00
location Congo Thailand
total_cases 2,028 3,217
total_deaths 47 58
total_cases_pm 368 46
total_deaths_pm 9 1
population 5,518,092 69,799,978
pop_density 15 135
median_age 19 40
gdp_per_capita 4,881 16,278
hosp_beds NaN 2
region Central Africa East Asia

This has given us a good foundation for understanding our DataFrames,
including their size and column data types.

How It Works

We set and display the index of the nls97 DataFrame, which is called

personid , in step 2. It is a more meaningful index than the default pandas
RangeIndex , which is essentially the row numbers with zero base. Often
there is a unique identifier when working with individuals as the unit of

analysis. This is a good candidate for an index. It makes selecting a row by
that identifier easier. Rather than using the statement

nls97.loc[personid==1000061] to get the row for that person, we can use
nls97.loc[1000061] . We try this out in the next recipe.Pandas makes it
easy to view the number of rows and columns, the data type and number of
non-missing values for each column, and the values for the columns for a few

rows of your data. This can be accomplished by using the shape attribute
and calling the info and head or sample, methods. Using the head(2)
method shows the first two rows, but sometimes it is helpful to grab a row
from anywhere in the DataFrame, in which case we would use sample . We
set the seed when we call sample (random_state=1) to get the same results
whenever we run the code. We can chain our call to head or sample with a
T to transpose it. This reverses the display of rows and columns. That is
helpful when there are more columns than can be shown horizontally and you
want to be able to see all of them. By transposing the rows and columns we
are able to see all of the columns.The shape attribute of the nls97
DataFrame tells us that there are 8,984 rows and 88 non-index columns.
Since personid is the index, it is not included in the column count. The
info method shows us that many of the columns have object data types and

that some have a large number of missing values. satverbal and satmath
have only about 1,400 valid values.The shape attribute of the covidtotals
DataFrame tells us that there are 210 rows and 11 columns which does not

include the country iso_code column used for the index iso_code is a
unique three digit identifier for each country. The key variables for most
analyses we would do are total_cases, total_deaths,
total_cases_pm and total_deaths_pm. total_cases and

total_deaths are present for each country, but total_cases_pm and

total_deaths_pm are missing for one country.

There’s More

I find that thinking through the index when working with a data file can
remind me of the unit of analysis. That is not actually obvious with the NLS
data as it is actually panel data disguised as person level data. Panel or

longitudinal, datasets have data for the same individuals over some regular
duration. In this case, data was collected for each person over a 21-year span from 1997 till 2017. The administrators of the survey have flattened it for
analysis purposes by creating columns for certain responses over the years,

such as college enrollment colenroct15 through colenroct17. This is a
fairly standard practice, but it is likely that we will need to do some reshaping
for some analyses.One thing I pay careful attention to when receiving any

panel data is drop off in responses to key variables over time. Notice the drop
off in valid values from colenroct15 to colenroct17 . By October of 2017,
only 75% of respondents provided a valid response (6,734/8,984). That is

definitely worth keeping in mind during subsequent analysis, since the 6,734
remaining respondents may be different in important ways from the overall
sample of 8,984.

See Also

A recipe in section 1, Anticipating Data Cleaning Issues when Importing
Tabular Data into Pandas, shows how to persist pandas DataFrames as
feather or pickle files. In later recipes in this section, we will look at

descriptives and frequencies for these two DataFrames. We reshape the NLS
data in section 9, Tidying and Reshaping Data, recovering some of its actual
structure as panel data. This is necessary for statistical methods such as

survival analysis and is closer to tidy data ideals.

Conclusion

In summary, mastering the fundamentals of data exploration and preparation is the first critical step toward successful data analysis. By taking the time to thoroughly inspect your data at the outset, you can uncover valuable insights, detect patterns and anticipate challenges before diving into more complex analysis. Once you are ready to proceed, having a structured plan on how to clean, organize and transform your data ensures smoother execution of your analysis. Understanding how it all works together, from data exploration to processing, enables you to efficiently extract meaningful insights and make informed decisions. However, this is just the beginning there’s always more to discover and learn as data science is an ever evolving field with new tools, techniques and best practices emerging constantly. For those seeking to deepen their expertise, exploring additional resources and related methodologies can help unlock even greater potential in data driven projects.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet