29 — Pandas Data Cleaning: Using Scatter Plots To View Bi-Variate Relationship
Scatter plots are one of the most powerful tools for visualizing the relationship between two variables, providing a clear and immediate picture of how one data point influences another. Whether you are analyzing trends, identifying correlations or spotting outliers, scatter plots help you make sense of complex datasets with simplicity and clarity. But before you dive into using them effectively, it’s crucial to get prepared by understanding the basics and organizing your data efficiently. Getting started with scatter plots is easier than you think, but knowing the step-by-step process is key to unlocking their full potential. From choosing the right axes to interpreting the results, you will see just how intuitive and insightful this method can be. There’s even more to explore as we delve into advanced techniques, giving you the tools to make better decisions with your data. For further learning, you can check out related visualization techniques that complement scatter plots and expand your analytical capabilities.
Table of Content
- Using scatter plots to view bi-variate relationship
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Using Scatter Plots To View Bi-Variate Relationship
My sense is that there are few plots that data analysts rely more on than
scatter plots with the possible exception of histograms. We are all very used
to looking at relationships that can be illustrated in two dimensions. Scatter plots capture important real-world phenomena, the relationship between
variables and are quite intuitive for most people. This makes them a valuable
addition to our visualization toolkit.
Getting Ready
You will need matplotlib and seaborn for this recipe. We will be working
with the landtemps dataset which provides the average temperature in 2019
for 12,095 weather stations across the world.
How To Do It
We level up our scatter plot skills from the previous chapter and visualize
more complicated relationships. We display the relationship between average
temperature, latitude and elevation by showing multiple scatter plots on one
chart, creating 3D scatter plots and showing multiple regression lines.
- Load pandas, NumPy, matplotlib, the Axes3D module, and seaborn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
landtemps = pd.read_csv("data/landtemps2019avgs.csv")
- Run a scatter plot of latitude (latabs) by average temperature.
plt.scatter(x="latabs", y="avgtemp", data=landtemps)
plt.xlabel("Latitude (N or S)")
plt.ylabel("Average Temperature (Celsius)")
plt.yticks(np.arange(-60, 40, step=20))
plt.title("Latitude and Average Temperature in 2019")
plt.show()
Output:
- Show the high elevation points in red.
Create low and high elevation data frames. Notice that the high elevation
points are generally lower that is, cooler on the figure at each latitude.
low, high = landtemps.loc[landtemps.elevation<=1000], landtemps.loc[landtemps.elevation>1000]
plt.scatter(x="latabs", y="avgtemp", c="blue", data=low)
plt.scatter(x="latabs", y="avgtemp", c="red", data=high)
plt.legend(('low elevation', 'high elevation'))
plt.xlabel("Latitude (N or S)")
plt.ylabel("Average Temperature (Celsius)")
plt.title("Latitude and Average Temperature in 2019")
plt.show()
Output:
- View a three-dimensional plot of temperature, latitude and elevation.
fig = plt.figure()
plt.suptitle("Latitude, Temperature, and Elevation in 2019")
ax.set_title('Three D')
ax = plt.axes(projection='3d')
ax.set_xlabel("Elevation")
ax.set_ylabel("Latitude")
ax.set_zlabel("Avg Temp")
ax.scatter3D(low.elevation, low.latabs, low.avgtemp, label="low elevation", c="blue")
ax.scatter3D(high.elevation, high.latabs, high.avgtemp, label="high elevation", c="red")
ax.legend()
plt.show()
Output:
- Show a regression line of latitude on temperature. Use regplot to get a regression line.
sns.regplot(x="latabs", y="avgtemp", color="blue", data=landtemps)
plt.title("Latitude and Average Temperature in 2019")
plt.xlabel("Latitude (N or S)")
plt.ylabel("Average Temperature")
plt.show()
Output:
line
- Show separate regression lines for low and high elevation stations.
We use lmplot this time instead of regplot. The two methods have similar
functionality. Unsurprisingly, high elevation stations appear to have both
lower intercepts where the line crosses the y axis and steeper negative
slopes.
landtemps['elevation_group'] = np.where(landtemps.elevation<=1000,'low','high')
sns.lmplot(x="latabs", y="avgtemp", hue="elevation_group", palette=dict(low="blue", high="red"), legend_out=False, data=landtemps)
plt.xlabel("Latitude (N or S)")
plt.ylabel("Average Temperature")
plt.legend(('low elevation', 'high elevation'), loc='lower left')
plt.yticks(np.arange(-60, 40, step=20))
plt.title("Latitude and Average Temperature in 2019")
plt.tight_layout()
plt.show()
Output:
lines for elevation
- Show some stations above the low and high elevation regression lines.
high.loc[(high.latabs>38) & \ (high.avgtemp>=18),['station','country','latabs','elevation','avgtemp']]
Output:
station country latabs \
3943 LAJES_AB Portugal 39
5805 WILD_HORSE_6N United States 39
elevation avgtemp
3943 1,016 18
5805 1,439 23
low.loc[(low.latabs>47) & \ (low.avgtemp>=14),['station','country','latabs','elevation','avgtemp']]
Output:
station country latabs \
1048 SAANICHTON_CDA Canada 49
1146 CLOVERDALE_EAST Canada 49
6830 WINNIBIGOSHISH_DAM United States 47
7125 WINIFRED United States 48
elevation avgtemp
1048 61 18
1146 50 15
6830 401 18
7125 988 16
- Show some stations below the low and high elevation regression lines.
high.loc[(high.latabs<5) & \ (high.avgtemp<18),['station','country','latabs','elevation','avgtemp']]
Output:
station country latabs elevation \
2250 BOGOTA_ELDORADO Colombia 5 2,548
2272 SAN_LUIS Colombia 1 2,976
2303 IZOBAMBA Ecuador 0 3,058
2306 CANAR Ecuador 3 3,083
2307 LOJA_LA_ARGELIA Ecuador 4 2,160
avgtemp
2250 15
2272 11
2303 13
2306 13
2307 17
low.loc[(low.latabs<50) & \
(low.avgtemp<-9), ['station','country','latabs', 'elevation','avgtemp']]
Output:
station country latabs \
1189 FT_STEELE_DANDY_CRK Canada 50
1547 BALDUR Canada 49
1833 POINTE_CLAVEAU Canada 48
1862 CHUTE_DES_PASSES Canada 50
6544 PRESQUE_ISLE United States 47
elevation avgtemp
1189 856 -12
1547 450 -11
1833 4 -11
1862 398 -13
6544 183 -10
Scatter plots are a great way to view the relationship between two variables.
These steps also show how we can display that relationship for different
subsets of our data.
How It Works
We can run a scatter plot by just providing column names for x and y and a
DataFrame. Nothing more is required. We get the same access to the
attributes of the figure and its axes that we get when we run histograms and
boxplots, titles, axis labels, tick marks and labels and so on. Note that to
access attributes such as labels on an axis rather than on the figure, we use
set_xlabels or set_ylabels, not xlabels or ylabels .3D plots are a
little more complicated. First, we need to have imported the axes3d module.
Then, we set the projection of our axes to 3d, plt.axes(projection=’3d’) , as we do in Step 4. We can then use the
scatter3D method for each subplot. Since scatter plots are designed to
illustrate the relationship between a regressor (the x variable) and a
dependent variable, it is quite helpful to see a least-squares regression line on
the scatter plot. Seaborn provides two methods for doing that, regplot and
lmplot. I use regplot typically, since it is less resource intensive. But
sometimes, I need the features of lmplot. We use lmplot and its hue
attribute in Step 6 to generate separate regression lines for each elevation
level. In Steps 7 and 8, we view some of the outliers, those stations with
temperatures much higher or lower than the regression line for their group.
We would want to investigate the data for the LAJES_AB station in Portugal
and the WILD_HORSE_6N station in the United States
((high.latabs>38) & (high.avgtemp>=18)). The average temperatures are
higher than would be predicted at the latitude and elevation level. Similarly,
there are four stations in Canada and one in the United States that are at low
elevation and have lower average temperatures than would be expected
(low.latabs<50) & (low.avgtemp<-9)).
There’s More
We see the expected relationship between latitude and average temperatures.
Temperatures fall as latitude increases. But elevation is another important
factor. Being able to visualize all three variables at once helps us identify
outliers more easily. Of course, there are additional factors that matter for
temperatures, such as warm ocean currents. That data is not in this dataset,
unfortunately.Scatter plots are great for visualizing the relationship between
two continuous variables. With some tweaking, matplotlib’s and seaborn’s
scatter plot tools can also provide some sense of relationships between three
variables by adding a third dimension, creative use of colors when the
third dimension is categorical or changing the size of the dots the Using
linear regression to identify data points with high influence recipe in section 4, Identifying Missing Values and Outliers in Subsets of Data, provides an
example of that.
See Also
This is a section on visualization, and identifying unexpected values through
visualizations. But these figures also scream out for the kind of multivariate
analyses we did in section 4, Identifying missing values and outliers in
Subsets of Data. In particular, linear regression analysis and a close look at
the residuals, would be useful for identifying outliers.
Conclusion
In conclusion, understanding how to visualize bi-variate relationships using scatter plots is a crucial step in uncovering patterns and insights within data. Getting ready involves preparing the data effectively, ensuring it’s clean and relevant for analysis. When it comes to actually doing it, the process of creating scatter plots becomes simple and intuitive, allowing you to easily interpret the relationships between variables. How it works is that each point on the scatter plot represents a pair of values, making trends, clusters or outliers immediately visible. But there’s more to scatter plots than just visualizing data, they form the foundation for deeper statistical analysis and more advanced techniques like regression. By diving into related topics, you can expand your understanding even further and explore more powerful methods that complement scatter plots. For further reading, be sure to explore related resources, which will broaden your perspective on data analysis and visualization.