3 — Pandas Data Cleaning: Importing Data From SQL Database

A.I Hub
9 min readSep 8, 2024

--

Image owned by Canva

Unlocking the true potential of data starts with accessing it effectively and for many, that means tapping into SQL databases. Whether you are working with complex datasets or managing vast repositories of information, the ability to import data seamlessly from a SQL database can transform your workflow. But before diving into the technicalities, it’s essential to ensure you are equipped with the right tools and understanding, preparing for a smooth integration. As you follow the steps to master this process, you will see how easy and efficient data transfer can be when done correctly. Once you grasp how it works, you will be surprised at the speed and accuracy of your results, paving the way for more complex tasks. But it doesn’t end here there are even more advanced techniques and tricks waiting to elevate your data management skills further. Make sure to check out related resources to deepen your expertise and stay ahead of the curve.

Table of Content

  • Importing data from SQL database
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Importing Data From SQL Database

In this recipe, we will use pymssql and mysql apis to read data from

Microsoft SQL Server and MySQL now owned by Oracle databases,

respectively. Data from sources such as these tends to be well structured
since it is designed to facilitate simultaneous transactions by members of

organizations and those who interact with them. Each transaction is also
likely related to some other organizational transaction. This means that

although data tables from enterprise systems are more reliably structured than

data from CSV files and Excel files, their logic is less likely to be self

contained. You need to know how the data from one table relates to data from
another table to understand its full meaning. These relationships need to be
preserved, including the integrity of primary and foreign keys, when pulling

data. Moreover, well structured data tables are not necessarily uncomplicated
data tables. There are often sophisticated coding schemes that determine data

values and these coding schemes can change over time. For example, codes
for staff ethnicity at a retail store chain might be different in 1998 than they
are in 2020. Similarly, frequently there are codes for missing values, such as
99999, that pandas will be understood as valid values. Since much of this logic is
business logic and implemented in stored procedures or other applications, it
is lost when pulled out of this larger system. Some of what is lost will
eventually have to be reconstructed when preparing data for analysis. This
almost always involves combining data from multiple tables, so it is

important to preserve the ability to do that. But it also may involve adding
some of the coding logic back after loading the SQL table into a pandas
DataFrame. We explore how to do that in this recipe.

Getting Ready

This recipe assumes you have pymssql and mysql apis installed. If you do
not, it is relatively straightforward to install them with pip . From the
terminal or powershell in Windows.

pip install pymssql
pip install mysql-connector-pytho

The dataset used in this recipe is available for public use at the link below:

https://archive.ics.uci.edu/ml/machine-learning-databases/00320/.

How To Do It

We import SQL Server and MySQL data tables into a pandas data frame as
follows.

  • Import pandas, numpy, pymssql and mysql .

This step assumes that you have installed pymssql and mysql apis.

import pandas as pd
import numpy as np
import pymssql
import mysql.connector
  • Use pymssql api and read_sql to retrieve and load data from a SQL
    Server instance.

Select the columns we want from the SQL Server data and use SQL aliases to
improve column names for example, fedu AS father education. Create a
connection to the SQL Server data by passing database credentials to the

pymssql connect function. Create a pandas data frame by passing the
select statement and connection object to read_sql. Close the
connection to return it to the pool on the server.

> sqlselect = "SELECT studentid, school, sex, age, famsize,\
medu AS mothereducation, fedu AS fathereducation,\
traveltime, studytime, failures, famrel, freetime,\
goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
g3 AS gradeperiod3 From studentmath"

server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"

conn = pymssql.connect(server=server,
user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()
  • Check the data types and the first few rows.
studentmath.dtypes

Output:

studentid object
school object
sex object
age int64
famsize object
mothereducation int64
fathereducation int64
traveltime int64
studytime int64
failures int64
famrel int64
freetime int64
goout int64
gradeperiod1 int64
gradeperiod2 int64
gradeperiod3 int64
dtype: object
studentmath.head()

Output:

studentid school ... gradeperiod2 gradeperiod3
0 001 GP ... 6 6
1 002 GP ... 5 6
2 003 GP ... 8 10
3 004 GP ... 14 15
4 005 GP ... 10 10
[5 rows x 16 columns]
  • Alternative Use the mysql connector and read_sql to get data from
    MySQL.

Create a connection to the mysql data and pass that connection to read_sql
to retrieve the data and load it into a pandas data frame. The same data file
on student math scores was uploaded to SQL Server and MySQL, so we can
use the same SQL select statement we used in the previous step.

host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host,\
database=database,user=user,password=password)
>>studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()
  • Rearrange the columns, set an index and check for missing values.

Move the grade data to the left of the DataFrame, just after student id . Also
move the free time column to the right after travel time and study time.
Confirm that each row has an ID and that the IDs are unique and set
student id as the index.

newcolorder = ['studentid', 'gradeperiod1',
'gradeperiod2','gradeperiod3', 'school',
'sex', 'age', 'famsize','mothereducation',
'fathereducation', 'traveltime',
'studytime', 'freetime', 'failures',
'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()

Output:

395
studentmath.studentid.nunique()

Output:

395
studentmath.set_index('studentid', inplace=True)
  • Use the data frame’s count function to check for missing values.
studentmath.count()

Output:

gradeperiod1 395
gradeperiod2 395
gradeperiod3 395
school 395
sex 395
age 395
famsize 395
mothereducation 395
fathereducation 395
traveltime 395
studytime 395
freetime 395
failures 395
famrel 395
goout 395
dtype: int64
  • Replace coded data values with more informative values.

Create a dictionary with the replacement values for the columns and then use
replace to set those values.

setvalues= \{"famrel":{1:"1:very bad",2:"2:bad",
3:"3:neutral",4:"4:good",5:"5:excellent"},
"freetime":{1:"1:very low",2:"2:low",
3:"3:neutral",4:"4:high",5:"5:very high"},
"goout":{1:"1:very low",2:"2:low",3:"3:neutral",
4:"4:high",5:"5:very high"},
"mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
3:"3:secondary ed",4:"4:higher ed"},
"fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
3:"3:secondary ed",4:"4:higher ed"}}

studentmath.replace(setvalues, inplace=True)
  • Change the type for columns with the changed data to category.

Check any changes in memory usage.

setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)

Output:

famrel 3160
freetime 3160
goout 3160
mothereducation 3160
fathereducation 3160
dtype: int64
for col in studentmath[setvalueskeys].columns:
studentmath[col] = studentmath[col]. \
astype('category')

studentmath[setvalueskeys].memory_usage(index=False)

Output:

famrel 607
freetime 607
goout 607
mothereducation 599
fathereducation 599
dtype: int64
  • Calculate percentages for values in the famrel column.

Run value_counts and set normalize to True to generate percentages.

 studentmath['famrel'].value_counts(sort=False, normalize=True)

Output:

1:very bad 0.02
2:bad 0.05
3:neutral 0.17
4:good 0.49
5:excellent 0.27
Name: famrel, dtype: float64
  • Use apply to calculate percentages for multiple columns.
studentmath[['freetime','goout']].\ apply(pd.Series.value_counts, sort=False, normalize=True)

Output:

freetime goout
1:very low 0.05 0.06
2:low 0.16 0.26
3:neutral 0.40 0.33
4:high 0.29 0.22
5:very high 0.10 0.13
studentmath[['mothereducation','fathereducation']].\ apply(pd.Series.value_counts, sort=False, normalize=True)

Output:

mothereducation fathereducation
1:k-4 0.15 0.21
2:5-9 0.26 0.29
3:secondary ed 0.25 0.25
4:higher ed 0.33 0.24

The preceding steps retrieved a data table from a SQL database, loaded that
data into pandas and did some initial data checking and cleaning.

How It Works

Since data from enterprise systems is typically better structured than CSV or
Excel files, we do not need to do things such as skip rows or deal with

different logical data types in a column. But some massaging is still usually
required before we can begin exploratory analysis. There are often more
columns than we need and some column names are not intuitive or not

ordered in the best way for analysis. The meaningfulness of many data values
is not stored in the data table, to avoid entry errors and save on storage space.

For example, 3 is stored for mother’s education rather than

secondary education. It is a good idea to reconstruct that coding as early in
the cleaning process as possible.To pull data from a SQL database server, we
need a connection object to authenticate us on the server and a SQL select

string. These can be passed to read_sql to retrieve the data and load it into a
pandas DataFrame. I usually use the SQL SELECT statement to do a bit of
cleanup of column names at this point. I sometimes also reorder columns, but
I do that later in this recipe.We set the index in step 5, first confirming that
every row has a value for studentid and that it is unique. This is often more

important when working with enterprise data because we will almost always
need to merge the retrieved data with other data files on the system. Although
an index is not required for this merging, the discipline of setting one

prepares us for the tricky business of merging data down the road. It will also
likely improve the speed of the merge.We use the DataFrame’s count
function to check for missing values and there are no missing values non missing values is 395 the number of rows for every column. This is almost

too good to be true. There may be values that are logically missing that is valid numbers that nonetheless connote missing values, such as -1, 0, 9 or
99. We address this possibility in the next step. Step 7 demonstrates a useful

technique for replacing data values for multiple columns. We create a

dictionary to map original values to new values for each column, and then
run it using replace . To reduce the amount of storage space taken up by the
new verbose values, we convert the data type of those columns to category.
We do this by generating a list of the keys of our set values dictionary

set values keys = [k for k in set values] generates [ famrel ,

freetime, goout, mother education and father education ]. We then

iterate over those five columns and use the astype method to change the
data type to category . Notice that the memory usage for those columns is
reduced substantially.Finally, we check the assignment of new values by
using value_counts to view relative frequencies. We use apply because

we want to run value_counts on multiple columns. To avoid

value_counts sorting by frequency, we set sort to False. The DataFrame
replace method is also a handy tool for dealing with logical missing values
that will not be recognized as missing when retrieved by read_sql. 0
values for mother education and father education seem to fall into that

category. We fix this problem in the set values dictionary by indicating that
0 values for mother education and father education should be replaced
with NaN. It is important to address these kinds of missing values shortly

after the initial import because they are not always obvious and can
significantly impact all subsequent work. Users of packages such as SPPS,
SAS and R will notice the difference between this approach and value labels

in SPSS and R and proc format in SAS. In pandas we need to change the
actual data to get more informative values. However, we reduce how much
data is actually stored by giving the column a category data type. This is
similar to factors in R.

There’s More

I moved the grade data to near the beginning of the DataFrame. I find it
helpful to have potential target or dependent variables in the leftmost
columns to keep them at the forefront of my thinking. It is also helpful to
keep similar columns together. In this example, personal demographic
variables sex, age are next to one another, as are family variables
mother education, father education and how students spend their time

travel time, study time and free time. You could have used map

instead of replace in step 7. Prior to version 19.2 of pandas, map was

significantly more efficient. Since then, the difference in efficiency has been
much smaller. If you are working with a very large dataset, the difference
may still be enough to consider using map.

See Also

The recipes in section 8, Addressing Data Issues when Combining Data
Frames, go into detail on merging data. We will take a closer look at
bivariate and multivariate relationships between variables in section 4,
Identifying Missing Values and Outliers in Subsets of Data. We demonstrate
how to use some of these same approaches in packages such as SPSS, SAS and R in subsequent recipes in these sections.

Conclusion

In conclusion, importing data from an SQL database is an essential skill that empowers businesses and developers to work efficiently with vast amounts of data. By getting ready with the right tools and understanding the steps required you can streamline the process and ensure that data flows seamlessly into your projects. Knowing how to do it involves understanding key concepts like SQL queries, connectors and data management best practices which are critical in unlocking the potential of your data. Once you understand how it works, you can easily manipulate, analyze and transform data, driving valuable insights that can redefine business strategies. But this is just the beginning there’s more to explore in optimizing data workflows, automating imports and integrating data from diverse sources. For further learning and to explore advanced techniques, see additional resources that dive deeper into database management, automation and data transformation. Embrace the power of data imports and you will open new doors to enhance efficiency and business growth.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet