18 — Pandas Data Cleaning: Finding Missing Values

A.I Hub
6 min readSep 12, 2024

--

Image owned by Canva

In the realm of data science and business analytics, uncovering hidden patterns begins with addressing a fundamental challenge, finding missing values. Whether you are preparing your dataset for analysis or gearing up for more advanced techniques, the process of getting ready is crucial for ensuring your models yield accurate and actionable insights. Understanding how to do it right can make all the difference between muddled results and meaningful conclusions. But how does it work and what steps should you take to achieve seamless integration of missing data handling in your workflow? There’s more to this than meets the eye, as the art of managing incomplete data goes beyond simple imputation. For those curious to dive deeper, see also some expert tips and resources that can elevate your approach to tackling data imperfections and unlock the true potential of your projects.

Table of Content

  • Finding missing values
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Finding Missing Values

Before starting any analysis, we need to have a good sense of the number of
missing values for each variable, and why those values are missing. We also
want to know which rows in our data frame are missing values for several

key variables. We can get this information with just a couple of statements in
pandas. We also need good strategies for dealing with missing values before
we begin statistical modeling, since those models do not typically handle
missing values flexibly. We introduce imputation strategies in this recipe and
go into more detail in subsequent recipes in this section.

Getting Ready

We will work with cumulative data on coronavirus cases and deaths by
country. The data frame has other relevant information including population
density, age and GDP.

Our World in Data provides Covid-19 public use data at (https://ourworldindata.org/coronavirus-source-data). The data used
in this recipe was downloaded on June 1, 2020. The covid case and
death data were missing for Hong Kong as of this date, but this
problem was rectified in files after that.

We will also be doing some routine plotting with Matplotlib in this recipe to
help us visualize the distributions of Covid cases and deaths. You can install

Matplotlib.

pip install matplotlib

How To Do It

We make good use of the isnull and sum functions to count the number of
missing values for selected columns and the number of rows that have
missing values for several key variables. We then use the very handy data frame fillna method to impute missing values.

  • Load the pandas library, along with the Covid case data file.

Also, set up the covid case and demographic columns.

import pandas as pd
covidtotals = pd.read_csv("data/covidtotalswithmissings.csv")
totvars = ['location','total_cases', 'total_deaths','total_cases_pm', 'total_deaths_pm']
demovars = ['population','pop_density', 'median_age','gdp_per_capita', 'hosp_beds']
  • Check the demographic columns for missing data.

Set the axis to 0 (the default) to check for the count of countries that are
missing values for each of the demographic variables missing values down
columns. You can notice that 46 out of 210 countries more than 20 percent of
countries are missing hosp_beds. Set the axis to 1 to check for the number

of demographic variables that are missing for each country missing values
across rows. Next, get value_counts on the resulting demovarsmisscnt
series to see whether some countries have missing values for much of the

demographic data. Notice that 10 countries are missing values for 3 out of the
5 demographic variables, while 8 countries are missing values for 4 out of 5

demographic variables.

covidtotals[demovars].isnull().sum(axis=0)

Output:

population 0
pop_density 12
median_age 24
gdp_per_capita 28
hosp_beds 46
dtype: int64
demovarsmisscnt = covidtotals[demovars].isnull().sum(axis=1)
demovarsmisscnt.value_counts()

Output:

0 156
1 24
2 12
3 10
4 8
dtype: int64
  • List the countries with three or more missing values for the demographic
    data.

Index alignment and Boolean indexing allow us to use the count of missing
values (demovarsmisscnt) to select rows. Append the location to the

demo vars list to see the country. We only show the first five of these
countries here.

covidtotals.loc[demovarsmisscnt>=3, ['location'] + demovars].head(5).T

Output:

iso_code AND AIA BES \
location Andorra Anguilla Bonaire ...
population 77,265 15,002 26,221
pop_density 164 NaN NaN
median_age NaN NaN NaN
gdp_per_capita NaN NaN NaN
hosp_beds NaN NaN NaN
iso_code VGB FRO
location British Vi ... Faeroe Islands
population 30,237 48,865
pop_density 208 35
median_age NaN NaN
gdp_per_capita NaN NaN
hosp_beds NaN NaN
type(demovarsmisscnt)

Output:

<class 'pandas.core.series.Series'>
  • Check the covid case data for missing values. You can notice that only one country has missing values for any of this data.
covidtotals[totvars].isnull().sum(axis=0)

Output:

location 0
total_cases 0
total_deaths 0
total_cases_pm 1
total_deaths_pm 1
dtype: int64
totvarsmisscnt = covidtotals[totvars].isnull().sum(axis=1)
totvarsmisscnt.value_counts()

Output:

0 209
2 1
dtype: int64
covidtotals.loc[totvarsmisscnt>0].T

Output:

iso_code HKG
lastdate 2020-05-26 00:00:00
location Hong Kong
total_cases 0
total_deaths 0
total_cases_pm NaN
total_deaths_pm NaN
population 7,496,988
pop_density 7,040
median_age 45
gdp_per_capita 56,055
hosp_beds NaN
  • Use the fillna method to fix the missing cases data for the one country
    affected (Hong Kong).

We could just set the values to 0 , since the numerator is 0 in both cases.
However, it is helpful in terms of code reuse to use the correct logic.

covidtotals.total_cases_pm. \ fillna(covidtotals.total_cases/
(covidtotals.population/1000000), inplace=True)
covidtotals.total_deaths_pm. \
fillna(covidtotals.total_deaths/
(covidtotals.population/1000000), inplace=True)
covidtotals[totvars].isnull().sum(axis=0)

Output:

location 0
total_cases 0
total_deaths 0
total_cases_pm 0
total_deaths_pm 0
dtype: int64

These steps give us a good sense of the number of missing values that we
have for each column and which countries have many missing values.

How It Works

Step 2 shows that there is a fair bit of missing data for the demographic
variables, particularly for the number of hospital beds. 18 countries have at

least 3 of the 5 demographic variables missing. We will either have to
exclude those variables from any multivariate analyses we will do in the
future or impute values for those variables. We make no attempt to fix those

values here. We look more at fixing missing values, including by imputing
values, in subsequent sections. The key Covid case data is relatively free of
missing values. We have one country with missing cases or death data, which
we resolve in step 5. We use fillna to fix the missing value. We could have
also used fillna to set the missing value to 0. We should not gloss over the
little bit of pandas magic in steps 2 and 3. We create a series,
demovarsmisscnt, which has the count of demographic columns that have
missing values for each country. We are able to use that series, along with the

three or more test series (demovarsmisscnt>=3), because of pandas index
alignment and Boolean indexing.

See Also

We examine other pandas techniques for fixing missing values in section 6,
Cleaning and Wrangling Data with Pandas Series Operations.

Conclusion

In conclusion, finding missing values is a critical first step in ensuring data accuracy and integrity. As you get ready to tackle the issue, having a well defined approach will streamline the process and set you up for success. By knowing how to do it, whether through simple imputation, advanced algorithms or leveraging powerful tools, you can ensure the missing data won’t derail your analysis. Understanding how it works allows you to make informed decisions, empowering your data driven strategies. But there’s more. Addressing missing values is just one aspect of data preparation and there are countless other techniques and tools that can further refine your analysis. To expand your knowledge and dive deeper into these topics, be sure to explore the resources in the "See Also" section, where you will find valuable insights to help you master the art of data handling and elevate your analytical skills.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet