11 — Pandas Data Cleaning: Persisting JSON Data

6 min readSep 9, 2024

When it comes to persisting JSON data, the key is understanding not only how to store it but how to do so efficiently and securely across different systems and applications. As data continues to fuel modern business operations, mastering this process ensures seamless communication between platforms, reduces redundancy and improves data integrity. But before diving in, getting ready with the right tools, frameworks and understanding of how JSON functions is crucial. Understanding how to do it begins with setting up proper storage structures and implementing best practices for encoding, transmitting and retrieving the data. Once you grasp how it works, the process becomes intuitive, allowing you to move quickly from theory to practice. However, there’s more to explore. Advanced techniques, like optimizing JSON for performance or handling complex nested structures that will take your skills to the next level. For those eager to expand their knowledge further, see also related methods of data persistence, such as using NoSQL databases or serverless architectures that streamline JSON handling for high performance applications.

Table of Content

Persisting JSON data
Getting ready
How to do it
How it works
There’s more
See also

Persisting JSON Data

There are several reasons why we might want to serialize a JSON file.

We may have retrieved the data with an API, but need to keep a

snapshot of the data.
The data in the JSON file is relatively static and informs our data
cleaning and analysis over multiple phases of a project.
We might decide that the flexibility of a schema-less format such as
JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON that it can solve many
data problems. Although tabular data structures clearly have many benefits,
particularly for operational data, they are often not the best way to store data

for analysis purposes. In preparing data for analysis, a substantial amount of
time is spent either merging data from different tables or dealing with data
redundancy when working with flat files. Not only are these processes time

consuming but every merge or reshaping leaves the door open to a data error
of broad scope. This can also mean that we end up paying too much attention

to the mechanics of manipulating data and too little to the conceptual issues
at the core of our work.We return to the Cleveland Museum of Art

collections data in this recipe. There are at least three possible units of
analysis for this data file the collection item level, the creator level and the
citation level. JSON allows us to nest citations and creators within
collections. You can examine the structure of the JSON file in the Getting
ready section of this recipe. This data cannot be persisted in a tabular

structure without flattening the file which we did in an earlier recipe in this
section. In this recipe, we will use two different methods to persist JSON data each with its own advantages and disadvantages.

Getting Ready

We will be working with data on the Cleveland Museum of Art’s collection
of works by African American artists. This is the structure of the
JSON data returned by the API. It has been abbreviated to save space.

{"info": { "total": 778, "parameters": {"african_american_artist
s": "" }},
"data": [
{
"id": 165157,
"accession_number": "2007.158",
"title": "Fulton and Nostrand",
"creation_date": "1958",
“citations": [
{
"citation": "Annual Exhibition: Sculpture, Paintings...",
"page_number": "Unpaginated, [8],[12]",
"url": null
},
{
"citation": "\"Moscow to See Modern U.S. Art,\"<em> New York.
..",
"page_number": "P. 60",
"url": null
}]
"creators": [
{
"description": "Jacob Lawrence (American, 1917-2000)",
"extent": null,
"qualifier": null,
"role": "artist",
"birth_year": "1917",
"death_year": "2000"
}
]
}

How To Do It

We will serialize the JSON data using two different methods.

Load the pandas, json, pprint, requests and msgpack libraries

import pandas as pd
import json
import pprint
import requests
import msgpack

Load the JSON data from an API. I have abbreviated the JSON output.

response = requests.get("https://openaccess-api.clevelandart.org
/api/artworks/?african_american_artists")
camcollections = json.loads(response.text)
print(len(camcollections['data']))

Output:

pprint.pprint(camcollections['data'][0])

Output:

{'accession_number': '2007.158',
'catalogue_raisonne': None,
'citations': [
{'citation': 'Annual Exhibition: Sculpture...',
'page_number': 'Unpaginated, [8],[12]',
'url': None},
{'citation': '"Moscow to See Modern U.S....',
'page_number': 'P. 60',
'url': None}]
'collection': 'American - Painting',
'creation_date': '1958',
'creators': [
{'biography': 'Jacob Lawrence (born 1917)...',
'birth_year': '1917',
'description': 'Jacob Lawrence (American...)',
'role': 'artist'}],
'type': 'Painting'}

Save and reload the JSON file using Python’s json library.

Persist the JSON data in human readable form. Reload it from the saved file
and confirm that it worked by retrieving the creators data from the first
collections item.

with open("data/camcollections.json","w") as f:
json.dump(camcollections, f)
with open("data/camcollections.json","r") as f:
.camcollections = json.load(f)
pprint.pprint(camcollections['data'][0]['creators'])

Output:

[{'biography': 'Jacob Lawrence (born 1917) has been a prominent
artist since...'
'birth_year': '1917',
'description': 'Jacob Lawrence (American, 1917-2000)',
'role': 'artist'}]

Save and reload the JSON file using msgpack.

with open("data/camcollections.msgpack", "wb") as outfile:
.packed = msgpack.packb(camcollections)
outfile.write(packed)

Output:

with open("data/camcollections.msgpack", "rb") as data_file:
.msgbytes = data_file.read()
camcollections = msgpack.unpackb(msgbytes)
pprint.pprint(camcollections['data'][0]['creators'])

Output:

[{'biography': 'Jacob Lawrence (born 1917) has been a prominent.
..',
'birth_year': '1917',
'death_year': '2000',
'description': 'Jacob Lawrence (American, 1917-2000)',
'role': 'artist'}]

How It Works

We use the Cleveland Museum of Art’s collections API to retrieve

collections items. The african_american_artists flag in the query string
indicates that we just want collections for those creators. json.loads returns

a dictionary called info and a list of dictionaries called data. We check the
length of the data list. This tells us that there are 778 items in collections.
We then display the first item of collections to get a better look at the

structure of the data. I have abbreviated the JSON output. We save and then
reload the data using Python’s JSON library in step 3. The advantage of
persisting the data in this way is that it keeps the data in human readable form. Unfortunately, it has two disadvantages, saving takes longer than
alternative serialization methods and it uses more storage space. In step 4, we
use msgpack to persist our data. This is faster than Python’s json library and the saved file uses less space. Of course, the disadvantage is that the
resulting JSON is binary rather than text based.

There’s More

I use both methods for persisting JSON data in my work. When I am working
with small amounts of data, and that data is relatively static, I prefer human readable JSON. A great use case for this is the recipes in the previous section where we needed to create value labels.I use msgpack when I am working
with large amounts of data, where that data changes regularly. msgpack files

are also great when you want to take regular snapshots of key tables in
enterprise databases.The Cleveland Museum of Art’s collections data is
similar in at least one important way to the data we work with every day. The

unit of analysis frequently changes. Here we are looking at collections,
citations and creators. In our work, we might have to simultaneously look at

students and courses or households and deposits. An enterprise database
system for the museum data would likely have separate collections, citations,

and creators tables that we would eventually need to merge. The resulting
merged file would have data redundancy issues that we would need to
account for whenever we changed the unit of analysis. When we alter our data

cleaning process to work directly from JSON or parts of it, we end up
eliminating a major source of errors. We do more data cleaning with JSON in
the Classes that handle non-tabular data structures recipe in section 10,
User Defined Functions and Classes to Automate Data Cleaning.

Conclusion

In conclusion, mastering the persistence of JSON data is an essential skill in today’s tech landscape, offering a powerful means to manage and store dynamic data efficiently. With the right tools and knowledge, getting ready to handle JSON persistence becomes a straightforward task. By understanding the step-by-step process of how to do it from choosing the right storage method to implementing it in your code you are setting yourself up for scalable, robust applications. The inner workings of JSON persistence illustrate how seamless data exchange can be between systems, ensuring that applications remain agile and data remains secure. But that’s not all there are always additional methods, techniques and optimizations waiting to be explored as you continue to deepen your understanding. Don’t forget to delve into other useful resources that can further expand your expertise, opening doors to new ways to leverage JSON data in your projects.

11 — Pandas Data Cleaning: Persisting JSON Data

Table of Content

Persisting JSON Data

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

Conclusion

Written by A.I Hub

No responses yet