23 — Pandas Data Cleaning: Using K-nearest Neighbour To Find Outliers

A.I Hub
5 min readSep 13, 2024

--

Image owned by Canva

Outliers can distort data, skew results and mislead businesses into making costly mistakes. But what if there was a simple yet powerful way to detect them? Enter the K-nearest Neighbour (KNN) algorithm, a technique that not only classifies data but can also help identify outliers hiding in your datasets. Whether you are a data scientist, an analyst or just getting started in the field, mastering how to use KNN for outlier detection will give you an edge in uncovering anomalies that would otherwise go unnoticed. Today, we are going to dive deep into how this method works, how to get started and why it’s a game changer for data driven decision making. But before we get into the technicalities, it’s crucial to understand the significance of detecting outliers and the role they play in shaping your overall insights. With KNN, you will be able to pinpoint deviations swiftly, ensuring your data is cleaner, more reliable and primed for better predictive accuracy. And guess what? There’s even more to learn, so keep reading to uncover advanced tips and techniques. For related methods and broader insights, don’t forget to check out our additional resources.

Table of Content

  • Using K-nearest neighbour to find outliers
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Using K-Nearest Neighbour To Find Outliers

Unsupervised machine learning tools can help us identify observations that
are unlike others when we have unlabeled data, that is, when there is no
target or dependent variable. In the previous recipe, we used total cases per
million as the dependent variable. Even when selecting targets and factors is
relatively straightforward, it might be helpful to identify outliers without

making any assumptions about relationships between variables. We can use
K-nearest neighbor to find observations that are most unlike others, those
where there is the greatest difference between their values and their nearest

neighbors values.

Getting Ready

You will need PyOD (Python outlier detection) and scikit-learn to run the
code in this recipe. You can install both by entering the pip command in the terminal or powershell in windows.

pip install pyod
pip install sklearn

How To Do It

We will use k-nearest neighbors to identify countries whose attributes indicate
that they are most anomalous.

  • Load pandas, pyod and scikit learn along with the covid case
    data.
import pandas as pd
from pyod.models.knn import KNN
from sklearn.preprocessing import StandardScaler
covidtotals = pd.read_csv("data/covidtotals.csv")
covidtotals.set_index("iso_code", inplace=True)
  • Create a standardized data frame of the analysis columns.
standardizer = StandardScaler()
analysisvars = ['location','total_cases_pm', 'total_deaths_pm', 'pop_density','median_age','gdp_per_capita']
covidanalysis = covidtotals.loc[:, analysisvars].dropna()
covidanalysisstand = standardizer.fit_transform(covidanalysis.iloc[:, 1:])
  • Run the KNN model and generate anomaly scores.

We create an arbitrary number of outliers by setting the contamination
parameter to 0.1.

clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(covidanalysisstand)
KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)
y_pred = clf.labels_
y_scores = clf.decision_scores_
  • Show the predictions from the model.

Create a data frame from the y_pred and y_scores NumPy arrays. Set the
index to the covid analysis data frame index so that we can easily combine
it with that data frame later. Notice that the decision scores for outliers are all
higher than those for the inliers (outlier = 0).

pred = pd.DataFrame(zip(y_pred, y_scores), columns=['outlier','scores'], index=covidanalysis.index)
pred.sample(10, random_state=1)

Output:

outlier scores
iso_code
LTU 0 0.29
NZL 0 0.61
BTN 0 0.20
HTI 0 0.49
EST 0 0.35
VCT 0 0.34
PHL 0 0.42
BRB 0 0.87
MNG 0 0.27
NPL 0 0.36
pred.outlier.value_counts()

Output:

0 156
1 18
Name: outlier, dtype: int64
pred.groupby(['outlier'])[['scores']].agg(['min','median','max'])

Output:

scores
min median max
outlier
0 0.10 0.44 1.74
1 1.76 1.95 11.86
  • Show Covid data for the outliers.

First, merge the covid analysis and pred data frames.

covidanalysis.join(pred).\
loc[pred.outlier==1,\
['location','total_cases_pm',
'total_deaths_pm','scores']].\
sort_values(['scores'],
ascending=False).head(10)

Output:

location total_cases_pm \
iso_code
SGP Singapore 7,825.69
QAT Qatar 35,795.16
BHR Bahrain 19,082.23
BEL Belgium 5,401.90
LUX Luxembourg 7,735.12
CHL Chile 16,322.75
USA United States 9,811.66
KWT Kuwait 12,658.28
ITA Italy 4,016.20
NLD Netherlands 2,968.57
total_deaths_pm scores
iso_code
SGP 4.44 11.86
QAT 50.68 7.54
BHR 61.12 4.01
BEL 844.03 3.02
LUX 175.73 2.39
CHL 359.96 2.36
USA 407.29 2.21
KWT 90.39 2.17
ITA 577.97 1.95
NLD 357.63 1.95

These steps show how we can use k-nearest neighbor to identify outliers
based on multivariate relationships.

How It Works

PyOD is a package of Python outlier detection tools. We use it here as a
wrapper around scikit-learn’s KNN package. This simplifies some tasks. Our
focus in this recipe is not on building a model, but on getting a quick sense of
which observations countries are significant outliers once we take all the
data we have into account. This analysis supports our developing sense that
Singapore and Qatar are very different observations than the others in our
dataset. They have very high decision scores. The table in step 5 is sorted in

descending order of score. Countries such as Belgium, Bahrain and
Luxembourg might also be considered outliers, though that is less clear cut.
The previous recipe did not indicate that they had an overwhelming influence

on a regression model. But that model did not take both cases per million and
deaths per million into account at the same time. That could also explain why

Singapore is even more of an outlier than Qatar here. It has both high cases
per million and below-average deaths per million. Scikit-learn makes scaling

very easy. We use the standard scaler in step 2, which returns the z-score for
each value in the data frame. The z-score subtracts the variable mean from
each variable value and divides it by the standard deviation for the variable.
Many machine learning tools require standardized data to run well.

There’s More

K-nearest neighbor is a very popular machine learning algorithm. It is easy to
run and interpret. Its main limitation is that it will run slowly on large

datasets.We have skipped steps we might usually take when building

machine learning models. We did not create separate training and test
datasets, for example. PyOD allows this to be done easily, but this is not
necessary for our purposes here.

See Also

The PyOD toolkit has a large number of supervised and unsupervised
learning techniques for detecting anomalies in data. You can get the
documentation for this at: https://pyod.readthedocs.io/en/latest/.

Conclusion

In conclusion, using K-Nearest Neighbors to identify outliers is a powerful technique that allows businesses and data analysts to detect anomalies with precision. This method not only enhances data quality but also strengthens predictive models by eliminating data points that could skew results. As you get ready to implement this approach, it’s crucial to ensure your dataset is properly prepared as KNN relies heavily on accurate, well structured data. When you understand how to apply KNN for outlier detection, the process becomes intuitive, helping you spot unusual patterns that might otherwise go unnoticed. Whether you are dealing with financial data, customer behaviors or industrial processes, KNN’s flexibility and effectiveness can unlock valuable insights. But the power of KNN doesn’t stop there. By exploring more advanced tweaks, such as combining it with other algorithms or using it in unsupervised learning scenarios, you can push the boundaries of anomaly detection even further. To dive deeper into related topics and explore other machine learning methods, check out additional resources and guides that build upon the knowledge you have gained here.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet