31 — Pandas Data Cleaning: Generating a Heat Map Based On a Correlation Matrix
Imagine transforming raw, complex data into a vivid, easily interpretable visual that reveals hidden relationships and patterns at a glance. Generating a heat map based on a correlation matrix does just that, turning numerical correlations into a colorful narrative that makes data analysis both intuitive and insightful. Before diving into the heart of data analysis, it’s crucial to lay a solid foundation. Getting ready involves setting up your tools, understanding your dataset and preparing for a journey where every detail matters, ensuring that you are equipped to tackle the challenges and opportunities ahead. Unlocking the full potential of your data requires a precise and methodical approach. Discover the step-by-step process of how to do it, turning theoretical concepts into practical actions that will guide you from raw data to meaningful insights. Ever wondered how seemingly complex data transformations happen seamlessly? Delve into how it works and uncover the sophisticated mechanisms behind the scenes that turn your data into actionable intelligence. The journey doesn’t end with just one technique or tool. There’s more to explore and discover with additional strategies and insights that can further enhance your data analysis and visualization skills, opening up new possibilities and deeper understanding. The world of data analysis is vast and interconnected. See also a curated selection of resources and related topics that will expand your knowledge and enhance your expertise, guiding you to even greater discoveries and innovations.
Table of Content
- Generating a heat map based on a correlation matrix
- Getting ready
- How to do it
- How it works
- There’s more
- See Also
Generating a Heat Map Based On a Correlation Matrix
The correlation between two variables is a measure of how much they move
together. A correlation of 1 means that the two variables are perfectly
positively correlated. As one variable increases in size, so does the other. A
value of -1 means that they are perfectly negatively correlated. As one
variable increases in size, the other decreases. Correlations of 1 or -1 only
rarely happen, but correlations above 0.5 or below -0.5 might still be
meaningful. There are several tests that can tell us whether the relationship is
statistically significant Pearson, Spearman, Kendall. Since this is a section
on visualizations, we will focus on viewing important correlations.
Getting Ready
You will need Matplotlib and Seaborn installed to run the code in this recipe.
Both can be installed by using the pip python package manager.
pip install matplotlib
pip install seaborn
Getting Ready
We first show part of a correlation matrix of the Covid data and scatter plots
of some key relationships. We then show a heat map of the correlation matrix
to visualize the correlations between all variables.
- Import matplotlib and seaborn and load the Covid totals data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
- Generate a correlation matrix.
View part of the matrix.
corr[['total_cases','total_deaths', 'total_cases_pm','total_deaths_pm']]
Output:
total_cases total_deaths \
total_cases 1.00 0.93
total_deaths 0.93 1.00
total_cases_pm 0.23 0.20
total_deaths_pm 0.26 0.41
population 0.34 0.28
pop_density -0.03 -0.03
median_age 0.12 0.17
gdp_per_capita 0.13 0.16
hosp_beds -0.01 -0.01
total_cases_pm total_deaths_pm
total_cases 0.23 0.26
total_deaths 0.20 0.41
total_cases_pm 1.00 0.49
total_deaths_pm 0.49 1.00
population -0.04 -0.00
pop_density 0.08 0.02
median_age 0.22 0.38
gdp_per_capita 0.58 0.37
hosp_beds 0.02 0.09
- Show scatter plots of median age and gross domestic product (GDP)
per capita by cases per million.
Indicate that we want the subplots to share y axis values with sharey=True.
fig, axes = plt.subplots(1,2, sharey=True)
sns.regplot(covidtotals.median_age, covidtotals.total_cases_pm, ax=axes[0])
sns.regplot(covidtotals.gdp_per_capita, covidtotals.total_cases_pm, ax=axes[1])
axes[0].set_xlabel("Median Age")
axes[0].set_ylabel("Cases Per Million")
axes[1].set_xlabel("GDP Per Capita")
axes[1].set_ylabel("")
plt.suptitle("Scatter Plots of Age and GDP with Cases Per Million")
plt.tight_layout()
fig.subplots_adjust(top=0.92)
plt.show()
Output:
by side
- Generate a heat map of the correlation matrix.
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap="coolwarm")
plt.title('Heat Map of Correlation Matrix')
plt.tight_layout()
plt.show()
Output:
peach
Heat maps are a great way to visualize how all key variables in our
DataFrame are correlated with one another.
How It Works
The corr method of a DataFrame generates correlation coefficients of all
numeric variables by all other numeric variables. We display part of that
matrix in Step 2. In Step 3, we do scatter plots of median age by cases per
million and GDP per capita by cases per million. These plots give a sense of
what it looks like when the correlation is 0.22, median age and cases per
million and when it is 0.58 GDP per capita and cases per million. There is
not much of a relationship between median age and cases per million. There is more of a relationship between GDP per capita and cases per million. The
heat map provides a visualization of the correlation matrix we created in Step
2. All of the red squares are correlations of 1.0, which is the correlation of the
variable with itself. The slightly lighter red squares are between
total_cases and total_deaths (0.93). The peach squares those with
correlations between 0.55 and 0.65 are also interesting. GDP per capita,
median age and hospital beds per 1,000 people are positively correlated with
each other and GDP per capita is positively correlated with cases per million.
There’s More
I find it helpful to always have a correlation matrix or heat map close by
when I am doing exploratory analysis or statistical modeling. I understand the
data much better when I am able to keep these bivariate relationships in mind.
See Also
We go over tools for examining the relationship between two variables in
more detail in the Identifying outliers and unexpected values in bivariate
relationships recipe in section 4, Identifying Issues in Subsets of Data.
Conclusion
In the dynamic realm of data analysis, generating a heat map based on a correlation matrix is a powerful tool that can transform complex datasets into clear actionable insights. By visualizing the relationships between variables, a heat map provides an intuitive way to understand patterns and correlations that may not be immediately apparent from raw data alone. As you prepare to harness this technique, it’s crucial to grasp not only the "how-to" but also the underlying mechanics that drive its effectiveness. Implementing this method involves a blend of technical skill and analytical acumen, ensuring that the resulting visualizations are both informative and accurate. But remember, the world of data visualization is vast and continually evolving. There are always new techniques and tools to explore that can enhance your analytical capabilities even further. For those eager to dive deeper, additional resources and advanced methodologies are available to expand your understanding and application of these concepts. Embrace the journey of learning and exploration, as the potential to unlock new insights and drive informed decision making is boundless.