8 — Pandas Data Cleaning: Importing More Complicated JSON Data From an API

A.I Hub
8 min readSep 9, 2024

--

Image owned by Canva

When it comes to working with APIs, the real challenge begins when you are tasked with importing more complicated JSON data structures. At first glance, the nested layers of data might seem overwhelming, but don’t worry with the right approach and tools, you will be ready to handle it like a pro. Whether you are preparing to pull in data from a third party service or manage complex relationships within your own datasets, understanding how to efficiently retrieve, parse and manipulate JSON is critical. In this guide, we will walk you through how to do it step by step, breaking down the process so it works seamlessly in your workflow. You will not only learn how to deal with intricate data structures, but also how to extract meaningful insights. And just when you think you have mastered the basics, there’s more advanced techniques to make sure you are fully equipped for any API task. For those who want to dive deeper into related topics, be sure to check out our other resources that will expand your knowledge and skills in no time.

Table of Content

  • Importing more complicated JSON data from an API
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Importing More Complicated JSON Data From an API

In the previous recipe, we discussed one significant advantage and
challenge of working with JSON data its flexibility. A JSON file can have

just about any structure its authors can imagine. This often means that this
data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application

developers use JSON precisely because it does not insist on a tabular
structure. I know I do! Retrieving data from multiple tables often requires us
to do a one-to-many merge. Saving that data to one table or file means

duplicating data on the “one” side of the one-to-many relationship. For
example, student demographic data is merged with data on the courses
studied and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can
have data on the courses studied nested within the data for each student. But

doing analysis with JSON structured in this way will eventually require us to
either: 1 manipulate the data in a very different way than we are used to

doing or 2 convert the JSON to a tabular form. We examine the first

approach in the Classes that handle non-tabular data structures recipe in
Section 10, User Defined Functions and Classes to Automate Data
Cleaning. This recipe takes the second approach. It uses a very handy tool for
converting selected nodes of JSON to a tabular structure json_normalize .We first use an API to get JSON data because that is how
JSON is frequently consumed. One advantage of retrieving the data with an
API, rather than working from a file we have saved locally, is that it is easier
to rerun our code when the source data is refreshed.

Getting Ready

This recipe assumes you have the requests and pprint libraries already
installed. If they are not installed, you can install them with pip. From the
terminal or PowerShell in Windows. The following is the structure of the JSON file that is
created when using the collections API of the Cleveland Museum of Art.
There is a helpful info section at the beginning, but we are interested in the
data section. This data does not fit nicely into a tabular data structure. There
may be several citations objects and several creators objects for each

collection object. I have abbreviated the JSON file to save space.

pip install pprint
pip install requests
{"info": { "total": 778, "parameters": {"african_american_artist
s": "" }},
"data": [
{
"id": 165157,
"accession_number": "2007.158",
"title": "Fulton and Nostrand",
"creation_date": "1958",
“citations": [
{
"citation": "Annual Exhibition: Sculpture, Paintings...",
"page_number": "Unpaginated, [8],[12]",
"url": null
},
{
"citation": "\"Moscow to See Modern U.S. Art,\"<em> New York.
..",
"page_number": "P. 60",
"url": null
}]
"creators": [
{
"description": "Jacob Lawrence (American, 1917-2000)",
"extent": null,
"qualifier": null,
"role": "artist",
"birth_year": "1917",
"death_year": "2000"
}
]
}

The API used in this recipe is provided by the Cleveland Museum
of Art. It is available for public use at the link below:

https://openaccess-

api.clevelandart.org/.

How To Do It

Create a DataFrame from the museum’s collections data with one row for
each citation and the title and creation_date duplicated.

  • Import the json, requests, and pprint libraries.

We need the requests library to use an API to retrieve JSON data. pprint
improves the display of lists and dictionaries.

import pandas as pd
import numpy as np
import json
import pprint
import requests
  • Use an API to load the JSON data.

Make a get request to the collections API of the Cleveland Museum of Art.

Use the query string to indicate that you just want collections from African American artists. Display the first collection item. I have truncated the output
for the first item to save space.

response = requests.get("https://openaccess-api.clevelandart.org
/api/artworks/?african_american_artists")
camcollections = json.loads(response.text)
print(len(camcollections['data']))

Output:

778
pprint.pprint(camcollections['data'][0])

Output:

{'accession_number': '2007.158',
'catalogue_raisonne': None,
'citations': [
{'citation': 'Annual Exhibition: Sculpture...',
'page_number': 'Unpaginated, [8],[12]',
'url': None},
{'citation': '"Moscow to See Modern U.S....',
'page_number': 'P. 60',
'url': None}]
'collection': 'American - Painting',
'creation_date': '1958',
'creators': [
{'biography': 'Jacob Lawrence (born 1917)...',
'birth_year': '1917',
'description': 'Jacob Lawrence (American...)',
'role': 'artist'}],
'type': 'Painting'}
  • Flatten the JSON data.

Create a DataFrame from the JSON data using the json_normalize method.
Indicate that the number of citations will determine the number of rows and
that accession_number, title, creation_date, collection, creators,

and type will be repeated. Observe that the data has been flattened by

displaying the first two observations, transposing them with the .T option to
make it easier to view.

camcollectionsdf = \ pd.json_normalize(camcollections['data'],'citations',
['accession_number','title','creation_date', 'collection','creators','type'])
camcollectionsdf.head(2).T

Output:

0 1
citation Annual Exhibiti... "Moscow to See Modern...
page_number Unpaginated, P. 60
url None None
accession_number 2007.158 2007.158
title Fulton and No... Fulton and No...
creation_date 1958 1958
collection American - Pa... American - Pa...
creators [{'description':'J... [{'description':'J...
type Painting Painting

Pull the birth_year value from creators.

creator = camcollectionsdf[:1].creators[0]
type(creator[0])

Output:

<class 'dict'>
pprint.pprint(creator)

Output:

[{'biography': 'Jacob Lawrence (born 1917) has been a prominent
art...',
'birth_year': '1917',
'death_year': '2000',
'description': 'Jacob Lawrence (American, 1917-2000)',
'extent': None,
'name_in_original_language': None,
'qualifier': None,
'role': 'artist'}]
camcollectionsdf['birthyear'] = camcollectionsdf.\
... creators.apply(lambda x: x[0]['birth_year'])
camcollectionsdf.birthyear.value_counts().\
... sort_index().head()
1821 18
1886 2
1888 1
1892 13
1899 17
Name: birthyear, dtype: int64

This gives us a pandas DataFrame with one row for each citation for each
collection item with the collection information title, creation_date and so on duplicated.

How It Works

We work with a much more interesting JSON file in this recipe than in the
previous one. Each object in the JSON file is an item in the collection of the
Cleveland Museum of Art. Nested within each collection item are one or

more citations. The only way to capture this information in a tabular
DataFrame is to flatten it. There are also one or more dictionaries for creators
of the collection item the artist or artists. That dictionary or dictionaries

contains the birth_year value that we want. We want one row for every
citation for all collection items. To understand this, imagine that we are
working with relational data and have a collections table and a citations table
and that we are doing a one-to-many merge from collections to citations. We

do something similar with json_normalize by using citations as the second
parameter. That tells json_normalize to create one row for each citation
and use the key values in each citation dictionary for citation,
page_number and url as data values. The third parameter in the call to

json_normalize has the list of column names for the data that will be
repeated with each citation. Notice that access_number, title,

creation_date, collection, creators and type are repeated in

observations one and two. Citation and page_number change. url is the
same value for the first and second citations. Otherwise, it would also
change. This still leaves us with the problem of the creators dictionaries
there can be more than one creator. When we ran json_normalize it
grabbed the value for each key we indicated in the third parameter and
stored it in the data for that column and row, whether that value was simple
text or was a list of dictionaries, as is the case for creators. We take a look at

the first and in this case, only creators item for the first collections row in
step 10, naming it creator. Note that the creators list is duplicated across
all citations for a collection item, just as the values for title, creation_date and so on are. We want the birth year for the first creator for each collection item, which can be found at

creator[0][‘birth_year’] . To create a birthyear series using this, we

use apply and a lambda function.

camcollectionsdf['birthyear'] = camcollectionsdf.\
.creators.apply(lambda x: x[0]['birth_year'])

We take a closer look at lambda functions in section 6, Cleaning and
Exploring Data with Series Operations. Here, it is helpful to think of the x
as representing the creators series, so x[0] gives us the list item we want,
creators[0]. We grab the value from the birth_year key.

There’s More

You may have noticed that we left out some of the JSON returned by the API
in our call to json_normalize. The first parameter that we passed to

json_normalize was camcollections[‘data’]. Effectively, we ignore the
info object at the beginning of the JSON data. The information we want does
not start until the data object. This is not very different conceptually from the
skip rows parameter in the second recipe of the previous section. There is

sometimes metadata like this at the beginning of JSON files.

See Also

The preceding recipe demonstrates some useful techniques for doing data

integrity checks without pandas, including list operations and

comprehensions. Those are all relevant for the data in this recipe as well.

Conclusion

In conclusion, importing more complicated JSON data from an API is a crucial skill for handling dynamic and complex datasets, empowering developers to seamlessly integrate various sources of information into their applications. Getting ready for this task involves understanding the structure of the JSON, setting up the necessary tools and ensuring that the API provides the data you need. Once you are equipped with the right knowledge, "how to do it" becomes a step-by-step process of parsing, mapping and transforming the data efficiently. Understanding "how it works" is key to unlocking the power of automation and data manipulation, giving you the ability to automate workflows, streamline processes and extract valuable insights. But the journey doesn’t stop here there’s always more to explore, from optimizing performance to dealing with edge cases in your data. Dive deeper, experiment with different approaches and continue expanding your skill set as you venture further into the realm of API-driven development. For more in-depth resources and advanced techniques, check out the "see also” section, which will guide you to the next stage of mastery in handling JSON and APIs.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet