6 — Pandas Data Cleaning: Persisting Tabular Data

6 min readSep 9, 2024

When diving into the world of data management, understanding how to persist tabular data is crucial. It’s not just about storing information but ensuring it remains accessible, organized and ready for seamless retrieval. Getting ready for this task involves equipping yourself with the right tools, frameworks and mindset to handle structured data effectively, regardless of its complexity. But how do you actually do it? The process may seem technical, but with a clear roadmap, anyone can master the steps to create efficient, scalable and reliable systems. As you implement these strategies, you will discover exactly how it works behind the scenes optimizing storage, reducing redundancy and boosting performance. But there’s more to this story, advanced techniques, potential pitfalls to avoid and enhancements that can take your data management to the next level.

For those eager to dig deeper, don’t miss out on the additional resources and examples in the "See Also" section, which will further solidify your understanding and unlock even more potential.

Table of Content

Persisting tabular data
Getting ready
How to do it
How it works
There’s more
See also

Persisting Tabular Data

We persist data, copy it from memory to local or remote storage, for several
reasons to be able to access the data without having to repeat the steps we

used to generate it, to share the data with others or to make it available for
use with different software. In this recipe, we save data that we have loaded

into a pandas data frame as different file types CSV, Excel, Pickle and
Feather. Another important, but sometimes overlooked reason to persist data is to preserve some segment of our data that needs to be examined more
closely perhaps it needs to be scrutinized by others before our analysis can
be completed. For analysts who work with operational data in medium to

large sized organizations, this process is part of the daily data cleaning
workflow. In addition to these reasons for persisting data, our decisions about
when and how to serialize data are shaped by several other factors, where we

are in terms of our data analysis projects, the hardware and software
resources of the machines saving and reloading the data, and the size of our
dataset. Analysts end up having to be much more intentional when saving

data than they are when pressing CTRL+S in their word processing

application.Once we persist data, it is stored separately from the logic that we
used to create it. I find this to be one of the most important threats to the
integrity of our analysis. Often, we end up loading data that we saved some
time in the past a week ago? a month ago? a year ago? and forget how a

variable was defined and how it relates to other variables. If we are in the
middle of a data cleaning task, it is best not to persist our data, so long as our
workstation and network can easily handle the burden of regenerating the
data. It is a good idea to persist data only once we have reached milestones in
our work. Beyond the question of when to persist data, there is the question of
how. If we are persisting it for our own reuse with the same software, it is

best to save it in a binary format native to that software. That is pretty
straightforward for tools such as SPSS, SAS, Stata and R, but not so much
for pandas. But that is good news in a way. We have lots of choices from
CSV and Excel to pickle and feather. We save to all these file types in this
recipe.

Getting Ready

You will need to install feather if you do not have it on your system. You can
do that by entering pip install pyarrow in a terminal window or
powershell in Windows. If you do not already have a subfolder named
Views in your section 1 folder, you will need to create it in order to run the

code for this recipe.

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by

the United States National Oceanic and Atmospheric
Administration at: https://www.ncdc.noaa.gov/data-access/land-

based-station-data/land-based-datasets/global-historical-

climatology-network-monthly-version-4. This is just a 100,000-row

sample of the full dataset, which is also available in the repository.

How To Do It

We will load a CSV file into pandas and then save it as a pickle and as a
feather file. We will also save subsets of the data to CSV and Excel formats.

Import pandas and pyarrow and adjust the display.

Pyarrow needs to be imported in order to save pandas to feather.

import pandas as pd
import pyarrow

Load the land temperatures CSV file into pandas, drop rows with

missing data and set an index.

landtemps = \ pd.read_csv('data/landtempssample.csv',
names=['stationid','year','month','avgtemp',
'latitude','longitude','elevation',
'station','countryid','country'],
skiprows=1, parse_dates=[['month','year']], low_memory=False)
landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.dtypes

Output:

measuredate datetime64[ns]
stationid object
avgtemp float64
latitude float64
longitude float64
elevation float64
station object
countryid object
country object
dtype: object

landtemps.set_index(['measuredate','stationid'], inplace=True)

Write extreme values for temperature to CSV and Excel files.

Use the quantile method to select outlier rows, those at the 1 in 1,000 level
at each end of the distribution.

extremevals = landtemps[(landtemps.avgtemp < landtemps.avgte
mp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]

extremevals.shape

Output:

(171, 7)

extremevals.sample(7)

Output:

avgtemp ... country
measuredate stationid ...
2013-08-01 QAM00041170 35.30 ... Qatar
2005-01-01 RSM00024966 -40.09 ... Russia
1973-03-01 CA002401200 -40.26 ... Canada
2007-06-01 KU000405820 37.35 ... Kuwait
1987-07-01 SUM00062700 35.50 ... Sudan
1998-02-01 RSM00025325 -35.71 ... Russia
1968-12-01 RSM00024329 -43.20 ... Russia
[7 rows x 7 columns]

extremevals.to_excel('views/tempext.xlsx')
>extremevals.to_csv('views/tempext.csv')

Save to pickle and feather files.

The index needs to be reset in order to save a feather file.

landtemps.to_pickle('data/landtemps.pkl')
>landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")

Load the pickle and feather files we just saved. Our index was preserved when saving and loading the pickle file.

landtemps = pd.read_pickle('data/landtemps.pkl')

landtemps.head(2).T

Output:

measuredate 2000-04-01 1940-05-01
stationid USS0010K01S CI000085406
avgtemp 5.27 18.04
latitude 39.90 -18.35
longitude -110.75 -70.33
elevation 2,773.70 58.00
station INDIAN_CANYON ARICA
countryid US CI
country United States Chile

landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T

Output:

measuredate 2000-04-01 00:00:00 1940-05-01 00:00:00
stationid USS0010K01S CI000085406
avgtemp 5.27 18.04
latitude 39.90 -18.35
longitude -110.75 -70.33
elevation 2,773.70 58.00
station INDIAN_CANYON ARICA
countryid US CI
country United States Chile

The previous steps demonstrate how to serialize pandas data frames using
two different formats, pickle and feather.

How It Works

Persisting pandas data is quite straightforward. DataFrames have to_csv,
to_excel, to_pickle, and to_feather methods. Pickling preserves our
index.

There’s More

The advantage of storing data in CSV files is that saving it uses up very little
additional memory. The disadvantage is that writing CSV files is quite slow
and we lose important metadata, such as data types. read_csv can often

figure out the data type when we reload the file, but not always. Pickle files
keep that data, but can burden a system that is low on resources when
serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also,
the authors of feathers make no promises regarding long-term support. You
may have noticed that I do not make a recommendation about what to use for
data serialization other than to limit your persistence of full datasets to
project milestones. This is definitely one of those “right tools for the right
job” kind of situations. I use CSV or Excel files when I want to share a

segment of a file with colleagues for discussion. I use feathers for ongoing
Python projects, particularly when I am using a machine with sub par RAM

and an outdated chip and I am also using R. When I am wrapping up a

project, I pickle the DataFrames.

Conclusion

In conclusion, understanding how to persist tabular data is a crucial step for any data driven project, ensuring that information is stored securely and efficiently for future access. As you have now grasped the basics of getting ready, from setting up your environment to preparing your data, you are well on your way to confidently managing your data assets. By following the "how to do it" steps, you’ve learned practical methods to implement persistence, ensuring data remains consistent across sessions and use cases. You also discovered "how it works" under the hood, gaining insights into the underlying mechanics and why this approach is essential for long-term data retention. But don’t stop here there’s more to explore. Advanced techniques can optimize performance, increase scalability, and integrate seamlessly with other systems. As you continue your journey, be sure to explore related topics that will further enhance your understanding and capabilities. Keep experimenting, learning and applying your knowledge and you will find yourself mastering the art of data persistence and its applications.

6 — Pandas Data Cleaning: Persisting Tabular Data

Table of Content

Persisting Tabular Data

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

Conclusion

Written by A.I Hub

No responses yet