31 — Pandas Data Cleaning: Generating a Heat Map Based On a Correlation Matrix

5 min readSep 15, 2024

Imagine transforming raw, complex data into a vivid, easily interpretable visual that reveals hidden relationships and patterns at a glance. Generating a heat map based on a correlation matrix does just that, turning numerical correlations into a colorful narrative that makes data analysis both intuitive and insightful. Before diving into the heart of data analysis, it’s crucial to lay a solid foundation. Getting ready involves setting up your tools, understanding your dataset and preparing for a journey where every detail matters, ensuring that you are equipped to tackle the challenges and opportunities ahead. Unlocking the full potential of your data requires a precise and methodical approach. Discover the step-by-step process of how to do it, turning theoretical concepts into practical actions that will guide you from raw data to meaningful insights. Ever wondered how seemingly complex data transformations happen seamlessly? Delve into how it works and uncover the sophisticated mechanisms behind the scenes that turn your data into actionable intelligence. The journey doesn’t end with just one technique or tool. There’s more to explore and discover with additional strategies and insights that can further enhance your data analysis and visualization skills, opening up new possibilities and deeper understanding. The world of data analysis is vast and interconnected. See also a curated selection of resources and related topics that will expand your knowledge and enhance your expertise, guiding you to even greater discoveries and innovations.

Table of Content

Generating a heat map based on a correlation matrix
Getting ready
How to do it
How it works
There’s more
See Also

Generating a Heat Map Based On a Correlation Matrix

The correlation between two variables is a measure of how much they move
together. A correlation of 1 means that the two variables are perfectly
positively correlated. As one variable increases in size, so does the other. A
value of -1 means that they are perfectly negatively correlated. As one
variable increases in size, the other decreases. Correlations of 1 or -1 only
rarely happen, but correlations above 0.5 or below -0.5 might still be
meaningful. There are several tests that can tell us whether the relationship is

statistically significant Pearson, Spearman, Kendall. Since this is a section
on visualizations, we will focus on viewing important correlations.

Getting Ready

You will need Matplotlib and Seaborn installed to run the code in this recipe.
Both can be installed by using the pip python package manager.

pip install matplotlib
pip install seaborn

Getting Ready

We first show part of a correlation matrix of the Covid data and scatter plots
of some key relationships. We then show a heat map of the correlation matrix

to visualize the correlations between all variables.

Import matplotlib and seaborn and load the Covid totals data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])

Generate a correlation matrix.

View part of the matrix.

corr[['total_cases','total_deaths', 'total_cases_pm','total_deaths_pm']]

Output:

total_cases total_deaths \
total_cases 1.00 0.93
total_deaths 0.93 1.00
total_cases_pm 0.23 0.20
total_deaths_pm 0.26 0.41
population 0.34 0.28
pop_density -0.03 -0.03
median_age 0.12 0.17
gdp_per_capita 0.13 0.16
hosp_beds -0.01 -0.01
total_cases_pm total_deaths_pm
total_cases 0.23 0.26
total_deaths 0.20 0.41
total_cases_pm 1.00 0.49
total_deaths_pm 0.49 1.00
population -0.04 -0.00
pop_density 0.08 0.02
median_age 0.22 0.38
gdp_per_capita 0.58 0.37
hosp_beds 0.02 0.09

Show scatter plots of median age and gross domestic product (GDP)
per capita by cases per million.

Indicate that we want the subplots to share y axis values with sharey=True.

fig, axes = plt.subplots(1,2, sharey=True)
sns.regplot(covidtotals.median_age, covidtotals.total_cases_pm, ax=axes[0])
sns.regplot(covidtotals.gdp_per_capita, covidtotals.total_cases_pm, ax=axes[1])
axes[0].set_xlabel("Median Age")
axes[0].set_ylabel("Cases Per Million")
axes[1].set_xlabel("GDP Per Capita")
axes[1].set_ylabel("")
plt.suptitle("Scatter Plots of Age and GDP with Cases Per Million")
plt.tight_layout()
fig.subplots_adjust(top=0.92)
plt.show()

Output:

Scatter plots of median age and GDP by cases per million side
by side

Generate a heat map of the correlation matrix.

sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap="coolwarm")
plt.title('Heat Map of Correlation Matrix')
plt.tight_layout()
plt.show()

Output:

Heat map of Covid data, with strongest correlations in red and
peach

Heat maps are a great way to visualize how all key variables in our
DataFrame are correlated with one another.

How It Works

The corr method of a DataFrame generates correlation coefficients of all
numeric variables by all other numeric variables. We display part of that
matrix in Step 2. In Step 3, we do scatter plots of median age by cases per
million and GDP per capita by cases per million. These plots give a sense of
what it looks like when the correlation is 0.22, median age and cases per
million and when it is 0.58 GDP per capita and cases per million. There is
not much of a relationship between median age and cases per million. There is more of a relationship between GDP per capita and cases per million. The
heat map provides a visualization of the correlation matrix we created in Step
2. All of the red squares are correlations of 1.0, which is the correlation of the

variable with itself. The slightly lighter red squares are between

total_cases and total_deaths (0.93). The peach squares those with

correlations between 0.55 and 0.65 are also interesting. GDP per capita,
median age and hospital beds per 1,000 people are positively correlated with
each other and GDP per capita is positively correlated with cases per million.

There’s More

I find it helpful to always have a correlation matrix or heat map close by
when I am doing exploratory analysis or statistical modeling. I understand the

data much better when I am able to keep these bivariate relationships in mind.

Conclusion

In the dynamic realm of data analysis, generating a heat map based on a correlation matrix is a powerful tool that can transform complex datasets into clear actionable insights. By visualizing the relationships between variables, a heat map provides an intuitive way to understand patterns and correlations that may not be immediately apparent from raw data alone. As you prepare to harness this technique, it’s crucial to grasp not only the "how-to" but also the underlying mechanics that drive its effectiveness. Implementing this method involves a blend of technical skill and analytical acumen, ensuring that the resulting visualizations are both informative and accurate. But remember, the world of data visualization is vast and continually evolving. There are always new techniques and tools to explore that can enhance your analytical capabilities even further. For those eager to dive deeper, additional resources and advanced methodologies are available to expand your understanding and application of these concepts. Embrace the journey of learning and exploration, as the potential to unlock new insights and drive informed decision making is boundless.

31 — Pandas Data Cleaning: Generating a Heat Map Based On a Correlation Matrix

Table of Content

Generating a Heat Map Based On a Correlation Matrix

Getting Ready

Getting Ready

Output:

Output:

Output:

How It Works

There’s More

See Also

Conclusion

Written by A.I Hub

No responses yet