24 — Pandas Data Cleaning: Using Isolation Forest To Find Anomalies

A.I Hub
6 min readSep 13, 2024

--

Image owned by Canva

Anomaly detection has become an indispensable tool for businesses, especially in a data driven world where detecting outliers can be the difference between success and failure. Enter Isolation Forest, a powerful and efficient machine learning technique that’s revolutionizing the way we find anomalies. Whether you are monitoring fraud in financial transactions, identifying network intrusions or detecting rare diseases, Isolation Forest offers a groundbreaking approach to isolating anomalies from normal patterns. In this guide, you will discover how to leverage this cutting edge technique to safeguard your business, streamline operations and elevate your data analysis strategy. From understanding the intricacies of how it works to preparing your datasets for optimal performance, this article will walk you through the essential steps to mastering anomaly detection with Isolation Forest. But there’s more, this method goes beyond conventional approaches, offering unique insights that will sharpen your competitive edge. By the time you are done, you will not only know how to implement Isolation Forest, but you will also see how it fits into a broader landscape of advanced machine learning techniques. Keep reading to uncover even more resources and tools that will propel your anomaly detection efforts to the next level.

Table of Content

  • Using isolation forest to find anomalies
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Using Isolation Forest To Find Anomalies

Isolation Forest is a relatively new machine learning technique for identifying
anomalies. It has quickly become popular, partly because its algorithm is
optimized to find anomalies, rather than normal values. It finds outliers by

successive partitioning of the data until a data point has been isolated. Points
that require fewer partitions to be isolated receive higher anomaly scores.
This process turns out to be fairly easy on system resources. In this recipe, we
demonstrate how to use it to detect outlier covid cases and deaths.

Getting Ready

You will need scikit-learn and Matplotlib to run the code in this recipe. You
can install them by entering the pip command in the terminal or powershell in windows.

pip install matplotlib
pip install sklearn

How To Do It

We will use Isolation Forest to find the countries whose attributes indicate
that they are most anomalous.

  • Load pandas, matplotlib and the standard scaler and isolation forest
    modules from scikit-learn.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from mpl_toolkits.mplot3d import Axes3D
covidtotals = pd.read_csv("data/covidtotals.csv")
covidtotals.set_index("iso_code", inplace=True)
  • Create a standardized analysis data frame.

First, remove all rows with missing data.

analysisvars = ['location','total_cases_pm','total_deaths_pm', 'pop_density','median_age','gdp_per_capita']
standardizer = StandardScaler()
covidtotals.isnull().sum()

Output:

lastdate 0
location 0
total_cases 0
total_deaths 0
total_cases_pm 0
total_deaths_pm 0
population 0
pop_density 11
median_age 24
gdp_per_capita 27
hosp_beds 45
region 0
dtype: int64
covidanalysis = covidtotals.loc[:, analysisvars].dropna()
covidanalysisstand = standardizer.fit_transform(covidanalysis.iloc[:, 1:])
  • Run an Isolation Forest model to detect outliers.

Pass the standardized data to the fit method. 18 countries are identified as
outliers. These countries have anomaly values of -1. This is determined by
the contamination number of 0.1.

clf=IsolationForest(n_estimators=100, max_samples='auto', contamination=.1, max_features=1.0)
clf.fit(covidanalysisstand)
IsolationForest(behaviour='deprecated', bootstrap=False, contamination=0.1, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=None, random_state=None, verbose=0, warm_start=False)
covidanalysis['anomaly'] = clf.predict(covidanalysisstand)
covidanalysis['scores'] = clf.decision_function(covidanalysisstand)
covidanalysis.anomaly.value_counts()

Output:

1 156
-1 18
Name: anomaly, dtype: int64
  • Create outlier and inlier data frames.

List the top 10 outliers according to an anomaly score.

inlier, outlier = covidanalysis.loc[covidanalysis.anomaly==1],\
.covidanalysis.loc[covidanalysis.anomaly==-1]
outlier[['location','total_cases_pm','total_deaths_pm',\
'median_age','gdp_per_capita','scores']].\
sort_values(['scores']).\
head(10)

Output:

location total_cases_pm total_deaths_pm median_age \
iso_code
location total_cases_pm ... \
iso_code ...
SGP Singapore 7,825.69 ...
QAT Qatar 35,795.16 ...
BHR Bahrain 19,082.23 ...
BEL Belgium 5,401.90 ...
ITA Italy 4,016.20 ...
CHL Chile 16,322.75 ...
ESP Spain 5,430.63 ...
SWE Sweden 7,416.18 ...
GBR United Kingdom 4,256.44 ...
LUX Luxembourg 7,735.12 ...
gdp_per_capita scores
iso_code
SGP 85,535.38 -0.23
QAT 116,935.60 -0.21
BHR 43,290.71 -0.14
BEL 42,658.58 -0.12
ITA 35,220.08 -0.08
CHL 22,767.04 -0.08
ESP 34,272.36 -0.06
SWE 46,949.28 -0.04
GBR 39,753.24 -0.04
LUX 94,277.96 -0.04
  • Plot the outliers and inliers.
ax = plt.axes(projection='3d')
ax.set_title('Isolation Forest Anomaly Detection')
ax.set_zlabel("Cases Per Million")
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Median Age")
ax.scatter3D(inlier.gdp_per_capita, inlier.median_age, inlier.to
tal_cases_pm, label="inliers", c="blue")
ax.scatter3D(outlier.gdp_per_capita, outlier.median_age, outlier
.total_cases_pm, label="outliers", c="red")
ax.legend()
plt.tight_layout()
plt.show()

Output:

Inlier and outlier countries by GDP, median age, and cases per million

The preceding steps demonstrate the use of Isolation Forest as an alternative
to k-nearest neighbor for anomaly detection.

How It Works

We use Isolation Forest in this recipe much like we used a k-nearest neighbor in the previous recipe. In step 3, we pass a standardized dataset to the

Isolation Forest fit method and then use its predict and

decision_function methods to get the anomaly flag and score,

respectively. We use the anomaly flag in step 4 to separate the data into
inliers and outliers.We plot the inliers and outliers in step 5. Since there are
only three dimensions in the plot, it does not quite capture all of the features

in our Isolation Forest model, but the outliers (the red dots) clearly have
higher GDP per capita and median age, these are typically to the right of and
behind, the inliers. The results from Isolation Forest are quite similar to the k-nearest neighbor results. Qatar, Singapore, and Hong Kong have the highest
(most negative) anomaly scores. Belgium is not far behind, just as with the
KNN model. This is most likely due to an exceptionally high total of deaths
per million for Belgium, the highest in the dataset. We should consider
removing these four observations from any multivariate analyses we conduct.

There’s More

Isolation Forest is a good alternative to k-nearest neighbor, particularly when
working with large datasets. The efficiency of its algorithm allows it to
handle large samples and a high number of features (variables). The anomaly

detection techniques we have used in the last three recipes were designed to
improve multivariate analyses and the training of machine learning models.

However, we might want to exclude the outliers they help us identify much
earlier in the analysis process. For example, if it makes sense to exclude
Qatar from our modeling, it might also make sense to exclude Qatar from

some descriptive statistics.

See Also

In addition to being useful for anomaly detection, the isolation forest
algorithm is quite satisfying intuitively. I think the same could be said about
k-nearest neighbour. You can read more about the isolation forest here:

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf.

Conclusion

In conclusion, using isolation forest to find anomalies is an efficient and highly effective approach for identifying outliers in complex datasets. As businesses continue to rely on data-driven decisions, having a robust method for anomaly detection can significantly enhance operational security, quality control and fraud detection. Getting ready involves preparing your data, ensuring it’s clean and normalized for better results and understanding the parameters that impact the model’s performance. How to do it is straightforward with modern machine learning libraries, which make the implementation process accessible even for those new to the concept. How it works is what makes isolation forest stand out by isolating each point within the dataset and progressively partitioning the data, it identifies anomalies based on the fewest partitions needed for isolation, making it highly efficient. But there’s more to anomaly detection beyond just implementation fine tuning, validation and understanding the results are critical to ensuring that your model performs optimally in real world applications. As you delve deeper into the use of Isolation Forest, exploring additional resources and techniques in anomaly detection can further strengthen your ability to tackle diverse data challenges. See also related approaches, such as clustering based methods or deep learning techniques, for broader insight into anomaly detection solutions.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet