9 — Pandas Data Cleaning: Importing Data From Web Pages

A.I Hub
7 min readSep 9, 2024

--

Image owned by Canva

In today’s data-driven world, the ability to import data from web pages opens up endless possibilities for businesses, researchers and developers alike. By mastering this skill, you can unlock a vast array of real time information helping you make more informed decisions and stay ahead of the competition. Getting ready with the right tools and mindset is key understanding the steps, knowing how to do it efficiently and grasping how it works behind the scenes are all essential to streamline the process. But that’s just the beginning. There’s always more to explore from optimizing your workflow to integrating advanced techniques like APIs and automation. As you continue your journey, remember that data sourcing is a constantly evolving skill with new methods and best practices emerging regularly. For further reading and resources, check out related topics that can deepen your knowledge and expand your data handling capabilities.

Table of Content

  • Importing data from web pages
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Importing Data From Web Pages

We use Beautiful Soup in this recipe to scrape data from a web page and
load that data into pandas. Web scraping is very useful when there is data at

a website that is updated regularly, but there is no API. We can rerun our
code to generate new data whenever the page is updated. Unfortunately, the
web scrapers we build can be broken when the structure of the targeted page

changes. That is less likely to happen with APIs because they are designed
for data exchange and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the

reliability and ease of data exchange. This causes data cleaning challenges
unique to web scraping, including HTML elements that house the data being
in surprising and changing locations, formatting tags that obfuscate the
underlying data, and explanatory text that aid data interpretation being
difficult to retrieve. In addition to these challenges, scraping presents data

cleaning issues that are familiar, such as changing data types in columns, less
than ideal headings, and missing values. We deal with data issues that occur
most frequently in this recipe.

Getting Ready

You will need Beautiful Soup installed to run the code in this recipe. You can
install it with pip by entering in a terminal

of Windows PowerShell. We will scrape data from a web page, find

the following table in that page, and load it into a pandas DataFrame.

pip install beautifulsoup4
Figure 1.1 - COVID-19 data from six countries

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on

COVID-19 data for public use from Our World in Data, available

at https://ourworldindata.org/coronavirus-source-data.

How To Do It

We scrape the COVID data from the website and do some routine data
checks.

  • Import the pprint, requests and Beautiful Soup libraries.
import pandas as pd
import numpy as np
import json
import pprint
import requests
from bs4 import BeautifulSoup
  • Parse the web page and get the header row of the table.

Use Beautiful Soup’s find method to get the table we want and then use
find_all to retrieve the elements nested within the th elements for that
table. Create a list of column labels based on the text of the th rows.

webpage = requests.get("http://www.alrb.org/datacleaning/covidca
seoutliers.html")
bs = BeautifulSoup(webpage.text, 'html.parser')
theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
type(theadrows)

Output:

<class 'bs4.element.ResultSet'>
labelcols = [j.get_text() for j in theadrows]
labelcols[0] = "rowheadings"
labelcols

Output:

['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths
per Million', 'population', 'population_density', 'median_age',
'gdp_per_capita', 'hospital_beds_per_100k']
  • Get the data from the table cells.

Find all of the table rows for the table we want. For each table row, find the
th element and retrieve the text. We will use that text for our row labels.
Also, for each row, find all the td elements the table cells with the data and
save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. You can confirm that it matches the table from the
web page. We then insert the labelrows list which has the row headings
at the beginning of each list in datarows.

rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
datarows = []
labelrows = []
for row in rows:
rowlabels = row.find('th').get_text()
cells = row.find_all('td', {'class':'data'})
if (len(rowlabels)>3):
labelrows.append(rowlabels)
if (len(cells)>0):
cellvalues = [j.get_text() for j in cells]
datarows.append(cellvalues)

pprint.pprint(datarows[0:2])

Output:

[['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,4 37', '7.4']]
pprint.pprint(labelrows[0:2])

Output:

['Algeria', 'Austria']
for i in range(len(datarows))
datarows[i].insert(0, labelrows[i])
pprint.pprint(datarows[0:1])

Output:

[['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
  • Load the data into pandas.

Pass the datarows list to the DataFrame method of pandas. All
data is read into pandas with the object data type and that some data has
values that cannot be converted into numeric values in their current form due
to the commas.

totaldeaths = pd.DataFrame(datarows, columns=labelcols)
totaldeaths.iloc[:,1:5].head()

Output:

Cases Deaths Cases per Million \ 0 9,394 653 214
1 16,642 668 1848
2 47,153 650 286
3 58,381 9467 5037
4 514,849 29314 2422
Deaths per Million
0 15
1 74
2 4
3 817
4 138
totaldeaths.dtypes
rowheadings object
Cases object
Deaths object
Cases per Million object
Deaths per Million object
population object
population_density object
median_age object
gdp_per_capita object
hospital_beds_per_100k object
dtype: object
  • Fix the column names and convert the data to numeric values.

Remove spaces from column names. Remove all non-numeric data from the
first columns with data, including the commas

( str.replace("[^0-9] ","" ). Convert to numeric values, except for the

row headings column.

totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").
str.lower()
for col in totaldeaths.columns[1:-1]:
totaldeaths[col] = totaldeaths[col].\
str.replace("[^0-9]","").astype('int64')
totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_be
ds_per_100k'].astype('float')
totaldeaths.head()

Output:

rowheadings cases ... gdp_per_capita \
0 Algeria 9394 ... 13914
1 Austria 16642 ... 45437
2 Bangladesh 47153 ... 3524
3 Belgium 58381 ... 42659
4 Brazil 514849 ... 14103
totaldeaths.dtypes
rowheadings object
cases int64
deaths int64
cases_per_million int64
deaths_per_million int64
population int64
population_density int64
median_age int64
gdp_per_capita int64
hospital_beds_per_100k float64
dtype: object

We have now created a pandas DataFrame from an html table.

How It Works

Beautiful Soup is a very useful tool for finding specific HTML elements in a
web page and retrieving text from them. You can get one HTML element
with find and get one or more with find_all . The first argument for both

find and find_all is the HTML element to get. The second argument
takes a Python dictionary of attributes. You can retrieve text from all of the

HTML elements you find with get_text .Some amount of looping is usually
necessary to process the elements and text, as with step 2 and step 3. These
two statements in step 2 are fairly typical.

theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful
Soup result set called theadrows from the elements it found. The second
statement iterates over the theadrows Beautiful Soup result set using the

get_text method to get the text from each element and stores it in the
labelcols list. Step 3 is a little more involved, but makes use of the same
Beautiful Soup methods. We find all of the table rows (tr) in the target table
(rows = bs.find(’table’{’id’:’tblDeaths’}).tbody.find_all(’tr’)). We then iterate over each of those rows, finding the th element and
getting the text in that element

(rowlabels = row.find(’th’).get_text() ). We also find all of the table
cells (td) for each row
(cells = row.find_all(’td’, {’class’:’data’}) and get the text from
all table cells (cellvalues = [j.get_text() for j in cells] ). Note that
this code is dependent on the class of the td elements being data. Finally,
we insert the row labels we get from the th elements at the beginning of
each list in datarows.

for i in range(len(datarows)):
datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2
and 3 into pandas. We then do some cleaning similar to what we have done in
previous recipes in this section. We use string replace to remove spaces

from column names and to remove all non-numeric data, including commas from what are otherwise valid numeric values. We convert all columns
except for the rowheadings column to numeric.

There’s More

Our scraping code is dependent on several aspects of the web page’s structure
not changing, the ID of the main table, the presence of th tags with column
and row labels, and the td elements continuing to have their class equal to
data. The good news is that if the structure of the web page does change, this

will likely only affect the find and find_all calls. The rest of the code

would not need to change.

Conclusion

In conclusion, importing data from web pages is a game changer for anyone looking to harness the power of online information to fuel analysis, decision making or research. As you get ready to dive into this process, having the right tools, a clear objective and a grasp of how to structure the data is crucial for success. The actual execution may seem technical at first, but by following best practices and leveraging simple methods, you can transform raw web content into actionable insights. The beauty of this approach lies in how it works behind the scenes, automating the retrieval of data, saving time and enabling you to stay ahead in the fast moving world of data driven decisions. But that’s just the beginning there’s so much more to explore, from advanced data cleaning to automation techniques that can scale your web scraping efforts. As you continue to build your expertise, you will uncover a wealth of possibilities that go beyond simple extraction, making your workflow more efficient and powerful. See also additional resources for mastering the art of web based data importation as they will help you uncover even deeper insights and boost your competitive edge.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet