17 — Pandas Data Cleaning: Using Generative AI To View Our Data

A.I Hub
8 min readSep 11, 2024

--

Image owned by Canva

In the fast paced world of data-driven decision making, generative AI is revolutionizing how we perceive and interact with our data. No longer bound by traditional visualization techniques, we now have the power to generate fresh insights, uncover hidden patterns and unlock new dimensions of data interpretation. But before diving headfirst into this transformation, it’s crucial to ensure that you are fully prepared to leverage the full potential of generative AI, both in terms of tools and mindset. It’s not just about implementing the technology but understanding the process of how it works and applying it in a way that aligns with your business objectives. As you explore the mechanics behind generative AI, you will discover a world of possibilities from creating predictive models to enhancing your data analytics pipelines. Yet, the journey doesn’t end there, there’s even more to discover beyond the basic functionalities. This is just the beginning. For those eager to take the next step, a deeper exploration awaits, where the potential of AI extends far beyond its current applications. Ready to see what’s possible? Dive into the related resources and take your understanding to the next level.

Table of Content

  • Using generative ai to view our data
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Using Generative AI To View Our Data

Generative AI tools provide data scientists with a great opportunity to
streamline the data cleaning and exploration parts of our workflow. Large
language models, in particular, have the potential to make this work much
easier and more intuitive. Using these tools we can select rows and columns

by criteria, generate summary statistics, and plot variables. A simple way to
introduce generative AI tools into your data exploration is with PandasAI.
PandasAI uses the OpenAI API to translate natural language queries into data

selection and operations that pandas can understand. As of July 2023,
OpenAI is the only large language model API that can be used with
PandasAI, though the developers of the library anticipate adding other APIs.We can use PandasAI to substantially reduce the lines of code we need

to write to produce some of the tabulations and visualizations we have
created so far in this chapter. The steps in this recipe show how you can do
that.

Getting Ready

You need to install PandasAI to run the code in this recipe. You can do that
with pandas ai. We will work with the COVID data again, which is
available in the GitHub repository, as is the code.You will also need an API
key from OpenAI. You can get one at: (https://www.platform.openai.com). You will need to
setup an account and then click on your profile in the upper right corner and
then View API keys.

How To Do It

We create a PandasAI instance in the following steps and use it to take a look
at the COVID data.

  • We start by importing pandas and the PandasAI library.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
  • Next, we load the COVID data and instantiate a PandasAI object.
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=['lastdate'])
covidtotals.set_index("iso_code", inplace=True)
llm = OpenAI(api_token="Your API Key")
pandas_ai = PandasAI(llm)
  • Let’s start looking at the COVID data by displaying the first two rows
    and looking at the data types. We can do this by passing a DataFrame
    and natural language instructions to the PandasAI run method.
pandas_ai.run(covidtotals, "Show first two rows.").T

Output:

iso_code AFG
ALB
lastdate 2020-07-12 00:00:00 2020-07-12 00:00:00
location Afghanistan Albania
total_cases 34,451.0 3,371.0
total_deaths 1,010.0 89.0
total_cases_pm 885.0 1,171.4
total_deaths_pm 25.9 30.9
population 38,928,341.0 2,877,800.0
pop_density 54.4 104.9
median_age 18.6 38.0
gdp_per_capita 1,804.0 11,803.4
hosp_beds 0.5 2.9
region South Asia Eastern Europe
pandas_ai.run(covidtotals, "Show column types.")

Output:

lastdate datetime64[ns]
location object
total_cases float64
total_deaths float64
total_cases_pm float64
total_deaths_pm float64
population float64
pop_density float64
median_age float64
gdp_per_capita float64
hosp_beds float64
region object
dtype: object
  • We can see which locations (countries) have the highest total cases.
pandas_ai.run(covidtotals, "Show total cases for locations with the most.and location for locations with the 10 highest total cases pm.")

Output:

total_cases_pm total_deaths_pm location
iso_code
QAT 35,795.2 50.7 Qatar
SMR 21,038.4 1,237.6 San Marino
BHR 19,082.2 61.1 Bahrain
CHL 16,322.7 360.0 Chile
VAT 14,833.1 0.0 Vatican
KWT 12,658.3 90.4 Kuwait
AND 11,065.8 673.0 Andorra
OMN 10,711.0 48.6 Oman
ARM 10,593.8 188.6 Armenia
PAN 10,274.5 207.0 Panama
  • We can also create a DataFrame with selected columns. This is easy
    because the PandasAI run method returns a pandas DataFrame.
covidtotalsabb = pandas_ai.run(covidtotals, "Select total cases
pm, total deaths pm, and location.")

print(covidtotalsabb)

Output:

total_cases_pm total_deaths_pm location
iso_code
AFG 885.0 25.9 Afghanistan
ALB 1,171.4 30.9 Albania
DZA 426.7 22.9 Algeria
AND 11,065.8 673.0 Andorra
AGO 14.7 0.8 Angola
... ... ...
VNM 3.8 0.0 Vietnam
ESH 868.9 1.7 Western Sahara
YEM 46.6 12.2 Yemen
ZMB 103.1 2.3 Zambia
ZWE 66.1 1.2 Zimbabwe
[209 rows x 3 columns]
  • We do not need to be very precise in the language we pass to PandasAI.
    Instead of writing Select , we could have written Get. We get the
    same results. Even Grab will work not shown.
covidtotalsabb = pandas_ai.run(covidtotals, "Get total cases pm, total deaths pm, and location.")

print(covidtotalsabb)

Output:

total_cases_pm total_deaths_pm location
iso_code
AFG 885.0 25.9 Afghanistan
ALB 1,171.4 30.9 Albania
DZA 426.7 22.9 Algeria
AND 11,065.8 673.0 Andorra
AGO 14.7 0.8 Angola
... ... ...
VNM 3.8 0.0 Vietnam
ESH 868.9 1.7 Western Sahara
YEM 46.6 12.2 Yemen
ZMB 103.1 2.3 Zambia
ZWE 66.1 1.2 Zimbabwe
[209 rows x 3 columns]
  • We can select rows by summary statistic. For example, we can choose
    those rows where total cases per million is greater than the 95th
    percentile. This might take a little while to run on your machine.
pandas_ai.run(covidtotals, "Show total cases pm and location where total cases pm greater than 95th percentile.")

Output:

total_cases_pm location
iso_code
AND 11,065.8 Andorra
ARM 10,593.8 Armenia
BHR 19,082.2 Bahrain
CHL 16,322.7 Chile
KWT 12,658.3 Kuwait
OMN 10,711.0 Oman
PAN 10,274.5 Panama
QAT 35,795.2 Qatar
SMR 21,038.4 San Marino
USA 9,811.7 United States
VAT 14,833.1 Vatican
  • We can see how continuous variables are distributed by asking for their
    distribution.
pandas_ai.run(covidtotals, "Show the distribution of total cases pm and total deaths pm.")

Output:

total_cases_pm total_deaths_pm
count 209.0 209.0
mean 2,297.0 73.6
std 4,039.8 156.3
min 1.2 0.0
25% 202.8 2.8
50% 868.9 15.2
75% 2,784.9 58.4
max 35,795.2 1,237.6
  • We can also generate summary statistics. This reveals that there were
    12,698,299 cases and 564,917 deaths worldwide as of the time this
    dataset was downloaded.
pandas_ai.run(covidtotals, "Show sum of total cases and total deaths.")

Output:

'12698299 564917'
  • We can get group totals. Let’s get the total cases and deaths by region.
pandas_ai.run(covidtotals, "Show sum of total cases and total deaths by region.")

Output:

total_cases total_deaths
region
Caribbean 68,689 1,325
Central Africa 41,848 862
Central America 120,018 3,211
Central Asia 142,814 1,365
East Africa 44,604 1,219
East Asia 307,282 11,041
Eastern Europe 996,788 19,805
North Africa 118,046 5,106
North America 3,650,287 178,317
Oceania / Aus 11,207 136
South America 2,845,898 103,917
South Asia 1,368,451 31,352
Southern Africa 272,934 4,094
West Africa 101,128 1,708
West Asia 1,088,492 25,165
Western Europe 1,519,813 176,294
  • We can easily generate plots on the COVID data.
pandas_ai.run(covidtotals, "Plot a histogram of total cases pm")

Output:

  • We can also generate a scatterplot. Let’s look at total cases per million
    by total deaths per million.
pandas_ai.run(covidtotals, "Plot total cases pm by total deaths pm")

Output:

  • We can indicate which plotting tool we want to use. A regplot here
    might be helpful. It might give us a better sense of the relationship
    between cases and deaths.
pandas_ai.run(covidtotals, "Use regplot to show total deaths pm by total cases pm")

Output:

  • The extreme values for cases or deaths make it harder to visualize the
    relationship between the two. Let’s also ask PandasAI to remove
    extreme values.
pandas_ai.run(covidtotals, "Use regplot to show total deaths pm by total cases pm without extreme values")

Output:

This removed deaths per million above 350 and cases per million above
10000. It is easier to see the slope of the relationship over much of the data.
We will work more with regplot and many other plotting tools in section

5, Using Visualizations for the Identification of Unexpected Values.

How It Works

These examples demonstrate how intuitive it is to use PandasAI. Generative
AI tools like PandasAI have the potential to improve our exploratory work by making it possible to interact with the data nearly as quickly as we can

imagine new analyses. We only need to pass natural language queries to the
PandasAI object to get the results we want.The queries we pass are not
commands. We can use any language we want that conveys our intent.
Recall, for example, that we were able write select or get or even grab

to choose columns. OpenAI’s large language model is generally very good at
understanding what we mean.A tool that helps us move more swiftly from
question to answer can improve our thinking and analysis. It is definitely worth experimenting with if you have not done so already, even if you have

well established routines for looking at your data.

See Also

The PandasAI GitHub repository is a great place to go for more information
and to keep apprised of updates in the library. You can get to it here: (https://github.com/gventuri/pandas-ai). We will return to the PandasAI library
in section 5, Using Visualizations for the Identification of Unexpected

Values and section 8, Fixing Messy Data When Aggregating.

Conclusion

In conclusion, using generative AI to view our data is not just a futuristic concept but a powerful tool that businesses can leverage to gain deeper insights, streamline operations and unlock new levels of creativity. As we prepare to harness its potential, it’s essential to get ready by understanding the right tools, infrastructure and strategies needed to integrate AI effectively into business processes. Implementing it requires a clear plan, starting with defining objectives, selecting the right models and ensuring data readiness. How it works is rooted in complex algorithms, but with user friendly platforms, businesses can easily use generative AI to transform raw data into actionable insights. Beyond these capabilities, there’s more, AI is constantly evolving, offering new solutions that stretch far beyond data analysis into creativity, automation and decision making. To continue exploring, take a closer look at related technologies and applications, which can further enhance your understanding of how AI is reshaping industries. Generative AI is not just a tool, it’s a catalyst for business innovation and those who embrace it will lead in this new era of data driven success.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet