7 — Pandas Data Cleaning: Importing Simple Json Data

9 min readSep 9, 2024

When it comes to working with data, importing JSON files is one of the simplest yet most powerful ways to get started. Whether you are a beginner or a seasoned developer, understanding how to seamlessly pull in JSON data can dramatically enhance your workflow. But before you dive in, you need to ensure you are fully prepared whether that means setting up your environment or gaining a clear understanding of the JSON format itself. The process is straightforward once you know how to do it, and I’m here to walk you through every step. From understanding the file structure to using simple commands, you’ll see how efficiently you can import JSON data into your projects. The magic happens as you begin to understand how it works how the data is structured, parsed and integrated into your application. But that’s just the beginning; there’s so much more to explore in the world of JSON. From advanced techniques to tools that make your life easier, the possibilities are endless. And if you’re curious for more, be sure to check out related concepts and resources to deepen your knowledge.

Table of Content

Importing simple Json data
Getting ready
How to do it
How it works
There’s more
See also

Importing Simple JSON Data

JavaScript Object Notation (JSON) has turned out to be an incredibly

useful standard for transferring data from one machine, process, or node to
another. Often a client sends a data request to a server, upon which that
server queries the data in the local storage and then converts it from
something like a SQL Server table or tables into JSON, which the client can
consume. This is sometimes complicated further by the first server say a
web server forwarding the request to a database server. JSON facilitates this as does XML, by doing the following.

Being readable by humans.
Being consumable by most client devices.
Not being limited in structure.

JSON is quite flexible, which means that it can accommodate just about
anything, no matter how unwise. The structure can even change within a
JSON file, so different keys might be present at different points. For example,

the file might begin with some explanatory keys that have a very different
structure than the remaining data keys. Or some keys might be present in
some cases, but not others. We go over some approaches for dealing with that

messiness (uh, flexibility).

Getting Ready

We are going to work with data on news stories about political candidates in
this recipe. This data is made available for public use at

dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0ZLHOK. I have combined the JSON files there into one file and randomly selected
60,000 news stories from the combined data. This sample (allcandidatenewssample.json) is available in the GitHub repository of
this section. We will do a little work with list and dictionary comprehensions in
this recipe. DataCamp has good guides to list comprehension (https://www.datacamp.com/community/tutorials/python-list-comprehension) and dictionary comprehension (https://www.datacamp.com/community/tutorials/python-dictionary-comprehension) if you are feeling a little rusty.

How To Do It

We will import a JSON file into pandas after doing some data checking and
cleaning.

Import the json and pprint libraries.

pprint improves the display of the lists and dictionaries that are returned
when we load JSON data.

import pandas as pd
import numpy as np
import json
import pprint
from collections import Counter

Load the JSON data and look for potential issues.

Use the json load method to return data on news stories about political
candidates. load returns a list of dictionaries. Use len to get the size of the
list which is the total number of news stories in this case. Each list item is a

dictionary with keys for the title, source and so on and their respective
values. Use pprint to display the first two dictionaries. Get the value from
the source key for the first list item.

with open('data/allcandidatenewssample.json') as f:
..candidatenews = json.load(f)

len(candidatenews)

Output:

pprint.pprint(candidatenews[0:2])

Output:

[{'date': '2019-12-25 10:00:00',
'domain': 'www.nbcnews.com',
'panel_position': 1,
'query': 'Michael Bloomberg',
'source': 'NBC News',
'story_position': 6,
'time': '18 hours ago',
'title': 'Bloomberg cuts ties with company using prison inmate
s to make campaign calls', 'url': 'https://www.nbcnews.com/politics/2020-election/bloombe
rg-cuts-ties-company-using-prison-inmates-make-campaign-calls-n1
106971'},
{'date': '2019-11-09 08:00:00',
'domain': 'www.townandcountrymag.com',
'panel_position': 1,
'query': 'Amy Klobuchar',
'source': 'Town & Country Magazine',
'story_position': 3,
'time': '18 hours ago',
'title': "Democratic Candidates React to Michael Bloomberg's P
otential Run",
'url': 'https://www.townandcountrymag.com/society/politics/a29
739854/michael-bloomberg-democratic-candidates-campaign-reaction
s/'}]

pprint.pprint(candidatenews[0]['source'])

Output:

'NBC News'

Check for differences in the structure of the dictionaries.

Use Counter to check for any dictionaries in the list with fewer than or
more than, the 9 keys that is normal. Look at a few of the dictionaries with
almost no data with just two keys before removing them. Confirm that
the remaining list of dictionaries has the expected length – 60000-

2382=57618.

Counter([len(item) for item in candidatenews])
Counter({9: 57202, 2: 2382, 10: 416})

pprint.pprint(next(item for item in candidatenews if len(item)<9))

Output:

{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'}

pprint.pprint(next(item for item in candidatenews if len(item)>9))

Output:

{'category': 'Satire',
'date': '2019-08-21 04:00:00',
'domain': 'politics.theonion.com',
'panel_position': 1,
'query': 'John Hickenlooper',
'source': 'Politics | The Onion',
'story_position': 8,
'time': '4 days ago',
'title': '‘And Then There Were 23,’ Says Wayne Messam Crossing
Out '
'Hickenlooper Photo \n'
'In Elaborate Grid Of Rivals',
'url': 'https://politics.theonion.com/and-then-there-were-23-sa
ys-wayne-messam-crossing-ou-1837311060'}

pprint.pprint([item for item in candidatenews if len(item)==2][0:10])

Output:

[{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'},
{'date': '2019-07-24 00:00:00', 'reason': 'No Top stories'},
{'date': '2019-01-03 00:00:00', 'reason': 'No Top stories'}]

candidatenews = [item for item in candidatenews if len(item)>2]
len(candidatenews)

Output:

Generate counts from the JSON data.

Get the dictionaries just for Politico a website that covers political news and
display a couple of dictionaries.

politico = [item for item in candidatenews if item["source"] == "Politico"]
len(politico)

Output:

pprint.pprint(politico[0:2])

Output:

[{'date': '2019-05-18 18:00:00',
'domain': 'www.politico.com',
'panel_position': 1,
'query': 'Marianne Williamson',
'source': 'Politico',
'story_position': 7,
'time': '1 week ago',
'title': 'Marianne Williamson reaches donor threshold for Dem
debates',
'url': 'https://www.politico.com/story/2019/05/09/marianne-wil
liamson-2020-election-1315133'},
{'date': '2018-12-27 06:00:00', 'domain': 'www.politico.com',
'panel_position': 1,
'query': 'Julian Castro',
'source': 'Politico',
'story_position': 1,
'time': '1 hour ago',
'title': "O'Rourke and Castro on collision course in Texas",
'url': 'https://www.politico.com/story/2018/12/27/orourke-juli
an-castro-collision-texas-election-1073720'}]

Get the source data and confirm that it has the anticipated length.

Show the first few items in the new sources list. Generate a count of news
stories by source and display the 10 most popular sources. Notice that stories
from The Hill can have TheHill without a space or The Hill as the value
for source.

sources = [item.get('source') for item in candidatenews]
type(sources)

Output:

<class 'list'>

len(sources)

Output:

57618
sources[0:5]
['NBC News', 'Town & Country Magazine', 'TheHill', 'CNBC.com', 'Fox News']

pprint.pprint(Counter(sources).most_common(10))

Output:

[('Fox News', 3530),
('CNN.com', 2750),
('Politico', 2732),
('TheHill', 2383),
('The New York Times', 1804),
('Washington Post', 1770),
('Washington Examiner', 1655),
('The Hill', 1342),
('New York Post', 1275),
('Vox', 941)]

Fix any errors in the values in the dictionary.

Fix the source values for The Hill . Notice that The Hill is now the most
frequent source for news stories.

for newsdict in candidatenews:
newsdict.update((k, "The Hill") for k, v in newsdict.items()
if k == "source" and v == "TheHill")
sources = [item.get('source') for item in candidatenews]
pprint.pprint(Counter(sources).most_common(10))

Output:

[('The Hill', 3725),
('Fox News', 3530),
('CNN.com', 2750),
('Politico', 2732),
('The New York Times', 1804),
('Washington Post', 1770),
('Washington Examiner', 1655),
('New York Post', 1275),
('Vox', 941),
('Breitbart', 799)]

Create a pandas DataFrame.

Pass the JSON data to the pandas DataFrame method. Convert the date
column to a datetime data type.

candidatenewsdf = pd.DataFrame(candidatenews)
candidatenewsdf.dtypes

Output:

title object
url object
source object
time object
date object
query object
story_position int64
panel_position object
domain object
category object
dtype: object

Confirm that we are getting the expected values for the source. Also, rename the date column

candidatenewsdf.rename(columns={'date':'storydate'}, inplace=True)
candidatenewsdf.storydate = candidatenewsdf.storydate.astype('datetime64[ns]')
candidatenewsdf.shape

Output:

(57618, 10)

candidatenewsdf.source.value_counts(sort=True).head(10)

Output:

The Hill 3725
Fox News 3530
CNN.com 2750
Politico 2732
The New York Times 1804
Washington Post 1770
Washington Examiner 1655
New York Post 1275
Vox 941
Breitbart 799
Name: source, dtype: int64

We now have a pandas DataFrame with only the news stories where there is
meaningful data and with the values for source fixed.

How It Works

The json.load method returns a list of dictionaries. This makes it possible
to use a number of familiar tools when working with this data: list methods,
slicing, list comprehensions, dictionary updates, and so on. There are times
may be when you just have to populate a list or count the number of
individuals in a given category, when there is no need to use pandas. In steps
2 to 6, we use list methods to do many of the same checks we have done with

pandas in previous recipes. In step 3 we use Counter with a list

comprehension(Counter([len(item) for item in candidatenews])) to

get the number of keys in each dictionary. This tells us that there are 2,382
dictionaries with just 2 keys and 416 with 10. We use next to look for an
example of dictionaries with fewer than 9 keys or more than 9 keys to get a

sense of the structure of those items. We use slicing to show 10 dictionaries
with 2 keys to see if there is any data in those dictionaries. We then select

only those dictionaries with more than 2 keys.In step 4 we create a subset of
the list of dictionaries, one that just has a source equal to Politico and take a look at a couple of items. We then create a list with just the source data and
use Counter to list the 10 most common sources in step 5. Step 6

demonstrates how to replace key values conditionally in a list of dictionaries.
In this case, we update the key value to The Hill whenever the key (k) is

source and value (v) is The Hill. The for k, v in newsdict.items()

section is the unsung hero of this line. It loops through all key/value pairs for
all dictionaries in candidate news .It is easy to create a pandas DataFrame by
passing the list of dictionaries to the pandas DataFrame method. We do this
in step 7. The main complication is that we need to convert the date column

from a string to a date, since dates are just strings in JSON.

There’s More

In steps 5 and 6 we use item.get(‘source’) instead of item[‘source’].
This is handy when there might be missing keys in a dictionary.get returns
None when the key is missing, but we can use an optional second argument

to specify a value to return.I renamed the date column to storydate in step
8. This is not necessary, but is a good idea. Not only does date not tell you

anything about what the dates actually represent; it is also so generic a
column name that it is bound to cause problems at some point. The news
stories data fits nicely into a tabular structure. It makes sense to represent
each list item as one row and the key/value pairs as columns and column

values for that row. There are no significant complications, such as key
values that are themselves lists of dictionaries. Imagine an authors key for
each story with a list item for each author as the key value, and that list item
is a dictionary of information about the author. This is not at all unusual when
working with JSON data in Python. The next recipe shows how to work with
data structured in this way.

Conclusion

In conclusion, importing simple JSON data is a fundamental step that sets the foundation for leveraging the power of data in modern applications. Once you are ready with your tools and environment, executing the process becomes straightforward and accessible even to beginners. The steps are simple, but the impact is profound, as it allows you to transform raw data into actionable insights that drive decision making. Understanding how it works breaking down the structure and integrating it seamlessly into your system opens up countless opportunities for more advanced functionalities. And the exciting part? There’s always more to explore. As you dive deeper into JSON manipulation, you will discover new techniques, optimizations and even connections to more complex systems. For those eager to push their learning further, numerous resources and related methods are available to broaden your understanding and applications of JSON. So, while this is just the beginning it sets the stage for endless possibilities in your data journey.

7 — Pandas Data Cleaning: Importing Simple Json Data

Table of Content

Importing Simple JSON Data

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

Conclusion

Written by A.I Hub

No responses yet