How To Use Isolation Forest to Detect Outliers in Machine Learning

Detecting anomalies or outliers in Machine learning means identifying the data points that do not meet the usual trend. There are various methods to detect and handle outliers in Machine Learning. Detecting and handling outliers in machine learning is very important as they can highly affect the training process of the model and as a result, we come up with a weak predictive model. In this article, we will learn how we can use isolation forest to detect outliers in Machine learning using Python. We will also learn how the isolation forest works and how it detects outliers in the dataset.

Before going to the Isolation forest to detect outliers in Machine learning, assume that you already have a solid knowledge of Machine learning models, decision trees, and random forest algorithms as the Isolation forest is based on decision trees.


Simple ways to detect and handle outliers in Machine learning


How to use isolation forest to detect outliers in Machine Learning?

Isolation forest is an Unsupervised Machine Learning approach to detect the outliers in the dataset. It splits the data points using decision trees and isolates the outliers. Based on the Decision Tree algorithm, the Isolation Forest isolates outliers by randomly choosing a feature from the input set and a split value between the maximum and minimum values. The anomalous data points are distinguished from the rest of the data by shorter routes in trees produced by this random feature partitioning.

Before going to understand how we can use isolation forest to detect the outliers, let us understand what is an outlier in a dataset.

What is an outlier in a Dataset?

As we said outlier is any anomalous behavior in the dataset. In other words, a data point is identified as an outlier if it does not follow the usual trend. The outlier can be caused by many factors including human error, context, collection error, etc.

isolation-forest-to-detect-outliers-in-machine-learning-outiler

As you can see, the green point in the above graph does not follow the usual trend and hence is an outlier.

How does the Isolation forest detect outliers?

As we discussed earlier that the Isolation forest works based on a decision tree algorithm. It isolates the outliers by randomly selecting the features and splitting the dataset into different branches. While isolating the data based on random features, the outliers have a high chance to be isolated faster than the normal points.

isolation-forest-to-detect-outliers-in-machine-learning-decision-tree-of-outlier

As you can see, it takes less time to split and isolate the outlier in the isolation forest than the usual data points. Let us now take a simple example to understand how the isolation forest actually isolates the data points.

Let us assume that we have the following data points with an outlier.

isolation-forest-to-detect-outliers-in-machine-lerning-sample-dataset

As you can see, we have an outlier in our dataset. So, we will now learn how the isolation forest will detect the outlier.

The very first thing that the isolation forest do is randomly split the dataset into binary decision trees. Let’s say that it split in this way.

isolation-forest-to-detect-outilers-in-machine-learning-splitting

The same process of randomly splitting will continue until all the data points are isolated. Lets us assume that the second split takes in the following way.

isolation-forest-to-detect-the-outliers-in-machine-learning-second-split

As you can see, in only two splits, we were able to isolate the outlier. So, because outliers are separated away from other data points, it is easy to isolate them and it takes fewer splits to reach them. If we continue splitting, it will take a lot of steps to isolate other points as they are very close to each other.

To isolate each data point, the algorithm will generate a random forest of these decision trees and determine the typical number of splits. The lesser the splits take to isolate a data point, the more likelihood of that data being an outlier.

Once the algorithm isolates all the data points, then it uses the following equation to detect the anomalies or outliers.

isolation-forest-to-detect-outliers-equation

Using the above function, the algorithm will assign a score to each of the isolated points. If the score is closer to 1, the data point is considered to be an outlier and if the score is less than 0.5, the data point is considered to be a normal point.

Implementation of isolation forests to detect outliers in machine learning

As we know that the dataset that has continuous values as output values is known as the regression dataset. In this section, we will use a sample dataset about the price of houses and will detect the outliers in the prices of houses.

Before going to the implementation of isolation forest to detect outliers in machine learning, make sure that you have installed the following modules on your system as we will be using them.

  • skearn
  • pandas
  • NumPy
  • matplotlib
  • seaborn
  • plolty

You can use either conda or pip command to install the modules.

Let us now import the dataset and explore it using the pandas module.

# importing the module
import pandas as pd

# importing dataset 
data = pd.read_csv('house.csv')

# heading of the dataset
data.head()

Output:

isolation-forest-to-detect-outliers-in-machine-learning-regression-dataset

As you can see, there are null values in our dataset, let us first remove them.

# removing the null values
data.dropna(inplace=True)

Now, let us visualize the price of houses in a scatter plot.

# importing the module
import plotly.express as px

# plotting scattered plot
fig = px.scatter([i for i in range(len(data['price']))], y=data['price'])
fig.show()

Output:

isolation-forest-to-detect-outliers-in-machine-learning-scatter-plot

As you can see that there are some outliers in our dataset. We will use the isolation forest to detect outliers.

Training isolation forest to detect outliers in machine learning

Now, the next step is to train the model using the dataset and find out the outliers. As isolation forest is an unsupervised machine learning algorithm, so we will not split the dataset into testing and training parts or into input and output variables.

One of the important parameters of an isolation forest is contamination. The contamination is the estimated percentage of outliers in our dataset.

Let us train the model using 0.01 percent contamination.

# importing the module
from sklearn.ensemble import IsolationForest as IF

# isolation forest with o.01 contamination rate
model = IF(contamination = 0.01)

# model training
model.fit(data)

# making predictions 
preds = model.predict(data)

let us now print out the predictions:

# printing
print(preds)

Output:

isolation-forest-to-detect-outliers-in-machine-learning-preds

The predictions of isolation forests are gonna be either 1 or -1, where 1 shows the normal values and -1 shows the anomalies.

Now, let us visualize the findings of the isolation forests.

# adding the outliers to the dataset 
data['outliers'] = preds

# cheicking the outliers
outliers = data.query('outliers == -1')

# importing the plot
import plotly.graph_objects as go

# plotting the graph normal points
normal = go.Scatter(x=data.index.astype(str),y=data['price'],name="Normal",mode='markers')

# plotting the outliers
outlier = go.Scatter(x=outliers.index.astype(str),y=outliers['price'],name="Outliers",mode='markers',
                marker=dict(color='red', size=6,
                            line=dict(color='red', width=1)))

# labeling the graph
layout = go.Layout(title="Isolation Forest to detect outliers in machine learning",yaxis_title='Price',xaxis_title='x-axis',)

# plotting 
dataset = [normal, outlier]

# plotting the graph
fig = go.Figure(data=dataset, layout=layout)
fig.show()

Output:

isolation-forest-to-detect-outliers-in-machine-learning-detected-ouliers

As you can see, the algorithm was able to detect some of the points as outliers. Now, you might be wondering why some of the data points seem to be normal data points but the algorithm has detected them as outliers. Well, the algorithm has correctly detected outliers. We are actually training the algorithm using the whole data set that contains information about the number of rooms, area, location, and price. And then we are visualizing only the price ( because we cannot visualize the whole) so some points may not be outliers in terms of price but can be outliers in another attribute. Let us visualize the area to understand this.

# adding the outliers to the dataset 
data['outliers'] = preds

# cheicking the outliers
outliers = data.query('outliers == -1')

# importing the plot
import plotly.graph_objects as go

# plotting the graph normal points
normal = go.Scatter(x=data.index.astype(str),y=data['area'],name="Normal",mode='markers')

# plotting the outliers
outlier = go.Scatter(x=outliers.index.astype(str),y=outliers['area'],name="Outliers",mode='markers',
                marker=dict(color='red', size=6,
                            line=dict(color='red', width=1)))

# labeling the graph
layout = go.Layout(title="Isolation Forest to detect outliers in machine learning",yaxis_title='area',xaxis_title='x-axis',)

# plotting 
dataset = [normal, outlier]

# plotting the graph
fig = go.Figure(data=dataset, layout=layout)
fig.show()

Output:

isolation-forest-to-detect-outliers-in-machine-learning-visualizing-outlier

So, in order to find the outliers in only one attribute, let us say we want to find the outliers in only the price of houses, in that case, we have to train the isolation forest model on only the price of the houses rather than on the whole dataset.

Let us drop other input variables and then train the model on only price.

# dropping the unwanted attributes
data.drop('anomalies', axis=1, inplace=True)
data.drop('number_of_rooms', axis=1, inplace=True)
data.drop('area', axis=1, inplace=True)
data.drop('floor', axis=1, inplace=True)

# isolation forest with o.002 contamination rate
model = IF(contamination = 0.01)

# model training
model.fit(data)

# making predictions 
preds = model.predict(data)

Now, let us visualize the outliers in the price of houses.

# importing dataset 
data1 = pd.read_csv('house.csv')

# dropping data.drop('anomalies', axis=1, inplace=True)
data1.drop('number_of_rooms', axis=1, inplace=True)
data1.drop('area', axis=1, inplace=True)
data1.drop('floor', axis=1, inplace=True)
data1.drop('latitude', axis=1, inplace=True)
data1.drop('longitude', axis=1, inplace=True)


# isolation forest with o.002 contamination rate
model = IF(contamination = 0.01)

# model training
model.fit(data1)

# making predictions 
preds = model.predict(data1)

Now, let us visualize the outliers in the price attribute only.

# adding the outliers to the dataset 
data1['outliers'] = preds

# cheicking the outliers
outliers = data1.query('outliers == -1')

# importing the plot
import plotly.graph_objects as go

# plotting the graph normal points
normal = go.Scatter(x=data1.index.astype(str),y=data1['price'],name="Normal",mode='markers')

# plotting the outliers
outlier = go.Scatter(x=outliers.index.astype(str),y=outliers['price'],name="Outliers",mode='markers',
                marker=dict(color='red', size=6,
                            line=dict(color='red', width=1)))

# labeling the graph
layout = go.Layout(title="Isolation Forest to detect outliers in machine learning",yaxis_title='Price',xaxis_title='x-axis',)

# plotting 
dataset = [normal, outlier]

# plotting the graph
fig = go.Figure(data=dataset, layout=layout)
fig.show()

Output:

isolation-forest-to-detect-outliers-in-machine-learning-outliers-visualization

As you can see, this time the model has identified all the outliers in the price only as we trained the model on only the price attribute.

Learn about different statistical methods to detect and handle outliers as well.

NOTE: You can get access to the source code and the data from my GitHub account. Please don’t forget to follow and give me a star.

Summary

Isolation forest is an unsupervised technique to detect the outliers in a dataset. It splits the dataset randomly into binary decision trees unless all the data points are isolated. As the outliers are far away from the usual data point, it takes fewer steps to isolate them. In this article, we discuss how we can use isolation forests to detect outliers in machine learning and visualize them.

1 thought on “How To Use Isolation Forest to Detect Outliers in Machine Learning”

Leave a Comment