30 — Pandas Data Cleaning: Using Line Plots To Examine Trends In Continuous Variables
In today’s data driven world, visualizing trends in continuous variables is essential for uncovering insights that drive smart decision making. One of the most effective tools for this task is the line plot. Whether you are tracking sales over time, analyzing market fluctuations or observing changes in customer behavior, line plots offer a clear, dynamic way to capture and convey trends that evolve continuously. Before diving into how to harness the power of line plots, let’s ensure you are fully equipped and ready to transform raw data into meaningful visual narratives. The process is simpler than you might think and with the right approach, it can open the door to uncovering hidden patterns that fuel growth and innovation. Understanding how it works allows you to leverage this powerful visualization technique effectively, putting you ahead in your data analysis game. But don’t stop here, there’s more to explore beyond the basics, as we dive into advanced techniques that can amplify your analytical capabilities. And if you are eager to deepen your understanding, make sure to check out related resources that expand on the concepts introduced here.
Table of Content
- Using line plots to examine trends in continuous
variables - Getting ready
- How to do it
- How it works
- There’s more
- See also
Using Line Plots To Examine Trends In Continuous Variables
A typical way to visualize values for a continuous variable over regular intervals of time is through a line plot, though sometimes bar charts are used
for small numbers of intervals. We will use line plots in this recipe to display
variable trends and examine sudden deviations in trends and differences in
values over time by groups.
Getting Ready
We will work with daily Covid case data in this recipe. In previous recipes,
we have used totals by country. The daily data provides us with the number
of new cases and new deaths each day by country, in addition to the same
demographic variables we used in other recipes. You will need matplotlib
installed to run the code in this recipe.
How To Do It
We use line plots to visualize trends in daily coronavirus cases and deaths.
We create line plots by region, and stacked plots to get a better sense of how
much one country can drive the number of cases for a whole region.
- Import pandas, matplotlib, and the matplotlib dates and date
formatting utilities.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
coviddaily = pd.read_csv("data/coviddaily720.csv", parse_dates=["casedate"])
- View a couple of rows of the Covid daily data.
coviddaily.sample(2, random_state=1).T
Output:
2478 9526
iso_code BRB FRA
casedate 2020-06-11 2020-02-16
location Barbados France
continent North America Europe
new_cases 4 0
new_deaths 0 0
population 287,371 65,273,512
pop_density 664 123
median_age 40 42
gdp_per_capita 16,978 38,606
hosp_beds 6 6
region Caribbean Western Europe
- Calculate new cases and deaths by day.
Select dates between 2020-02-01 and 2020-07-12, and then use groupby to
summarize cases and deaths across all countries for each day.
coviddailytotals = coviddaily.loc[coviddaily.casedate.between('2020-02-01','2020-07-12')].\
groupby(['casedate'])[['new_cases','new_deaths']].\sum().\reset_index()
coviddailytotals.sample(7, random_state=1)
Output:
casedate new_cases new_deaths
44 2020-03-16 12,386 757
47 2020-03-19 20,130 961
94 2020-05-05 77,474 3,998
78 2020-04-19 80,127 6,005
160 2020-07-10 228,608 5,441
11 2020-02-12 2,033 97
117 2020-05-28 102,619 5,168
- Show line plots for new cases and new deaths by day. Show cases and deaths on different subplots.
fig = plt.figure()
plt.suptitle("New Covid Cases and Deaths By Day Worldwide in 2020")
ax1 = plt.subplot(2,1,1)
ax1.plot(coviddailytotals.casedate, coviddailytotals.new_cases)
ax1.xaxis.set_major_formatter(DateFormatter("%b"))
ax1.set_xlabel("New Cases")
ax2 = plt.subplot(2,1,2)
ax2.plot(coviddailytotals.casedate, coviddailytotals.new_deaths)
ax2.xaxis.set_major_formatter(DateFormatter("%b"))
ax2.set_xlabel("New Deaths")
plt.tight_layout()
fig.subplots_adjust(top=0.88)
plt.show()
Output:
- Calculate new cases and deaths by day and region.
regiontotals = coviddaily.loc[coviddaily.casedate.between('2020-02-01','2020-07-12')].\ groupby(['casedate','region'])[['new_cases','new_deaths']].\ sum().\ reset_index()
regiontotals.sample(7, random_state=1)
Output:
casedate region new_cases new_deaths
1518 2020-05-16 North Africa 634 28
2410 2020-07-11 Central Asia 3,873 26
870 2020-04-05 Western Europe 30,090 4,079
1894 2020-06-08 Western Europe 3,712 180
790 2020-03-31 Western Europe 30,180 2,970
2270 2020-07-02 North Africa 2,006 89
306 2020-02-26 Oceania / Aus 0 0
- Show line plots of new cases by selected regions.
Loop through the regions in show regions. Do a line plot of the total
new_cases by day for each region. Use the gca method to get the x axis and
set the date format.
showregions = ['East Asia','Southern Africa', 'North America','Western Europe']
for j in range(len(showregions)):
rt = regiontotals.loc[regiontotals.\
region==showregions[j],
['casedate','new_cases']]
plt.plot(rt.casedate, rt.new_cases,
label=showregions[j])
plt.title("New Covid Cases By Day and Region in 2020")
plt.gca().get_xaxis().set_major_formatter(DateFormatter("%b"))
plt.ylabel("New Cases")
plt.legend()
plt.show()
Output:
- Use a stacked plot to examine the uptick in Southern Africa more
closely.
See if one country (South Africa) in Southern Africa is driving the trend line.
Create a DataFrame (af) for new_cases by day for Southern Africa (the region). Add a series for new_cases in South Africa (the country) to the af
DataFrame. Then, create a new series in the af DataFrame for Southern
Africa cases minus South Africa cases (afcasesnosa). Select only data in
April or later, since that is when we start to see an increase in new cases.
af = regiontotals.loc[regiontotals.\
region=='Southern Africa',
['casedate','new_cases']].\
rename(columns={'new_cases':'afcases'})
sa = coviddaily.loc[coviddaily.\
location=='South Africa',
['casedate','new_cases']].\
rename(columns={'new_cases':'sacases'})
af = pd.merge(af, sa, left_on=['casedate'], right_on=['casedate'
], how="left")
af.sacases.fillna(0, inplace=True)
af['afcasesnosa'] = af.afcases-af.sacases
afabb = af.loc[af.casedate.between('2020-04-01','2020-07-12')]
fig = plt.figure()
ax = plt.subplot()
ax.stackplot(afabb.casedate, afabb.sacases, afabb.afcasesnosa, l
abels=['South Africa','Other Southern Africa'])
ax.xaxis.set_major_formatter(DateFormatter("%m-%d"))
plt.title("New Covid Cases in Southern Africa")
plt.tight_layout()
plt.legend(loc="upper left")
plt.show()
Output:
region (Southern Africa)
These steps show how to use line plots to examine trends in a variable over
time and how to display trends for different groups on one figure.
How It Works
We need to do some manipulation of the daily Covid data before we do the
line charts. We use groupby in Step 3 to summarize new cases and deaths
over all countries for each day. We use groupby in Step 5 to summarize
cases and deaths for each region and day.In Step 4, we set up our first subplot
with plt.subplot(2,1,1). That will give us a figure with two rows and one
column. The 1 for the third argument indicates that this subplot will be the
first, or top, subplot. We can pass a data series for date and for the values for the y axis. So far, this is pretty much what we have done with the hist,
scatterplot, boxplot and violinplot methods. But since we are
working with dates here, we take advantage of matplotlib’s utilities for date
formatting and indicate that we want only the month to show, withxaxis.set_major_formatter(DateFormatter("%b")). Since we are
working with subplots, we use set_xlabel rather than xlabel to indicate
the label we want for the x axis.We show line plots for four selected regions
in Step 6. We do this by calling the plot for each region that we want plotted.
We could have done it for all of the regions, but it would have been too
difficult to view.We have to do some additional manipulation in Step 7 to
pull the South Africa the country cases out of the cases for Southern Africa
(the region). Once we do that, we can do a stacked plot with the Southern
Africa cases (minus South Africa) and South Africa. This figure suggests that
the increase in cases in Southern Africa is almost completely driven by
increases in South Africa.
There’s More
The figure produced in Step 6 reveals a couple of potential data issues. There
are unusual spikes in mid-February in East Asia and in late April in North
America. It is important to examine these anomalies to see if there is a data
collection error.It is difficult to miss how much the trends differ by region.
There are substantive reasons for this, of course. The different lines reflect
what we know to be reality about different rates of spread by country and
region. However, it is worth exploring any significant change in the direction
or slope of trend lines to make sure that we can confirm that the data is
accurate. We want to be able to explain what happened in Western Europe in
early April and in North America and Southern Africa in early June. One
question is whether the trends reflect changes in the whole region such as
with the decline in Western Europe in early April or for one or two large
countries in the region the United States in North America and South Africa
in Southern Africa.
See Also
We cover groupby in more detail in section 7, Fixing Messy Data When
Aggregating. We go over merging data, as we did in Step 7, in section 8, Addressing Data Issues When Combining Data Frames.
Conclusion
In conclusion, mastering the art of using line plots to examine trends in continuous variables opens a new realm of insight and analysis. As you prepare to dive into this technique, understanding the nuances of its application will significantly enhance your ability to interpret data effectively. By following the outlined steps, you can seamlessly integrate line plots into your analytical toolkit, unlocking a deeper understanding of your data trends and patterns. This method not only facilitates a clear visualization of how variables change over time but also equips you with the tools to make informed, data-driven decisions. And while this article has provided a foundational overview, remember that the journey doesn’t end here. There’s a wealth of advanced techniques and additional resources available to further enrich your data analysis capabilities. Embrace the learning process and explore related topics to expand your expertise and stay ahead in the ever evolving field of data analysis.