What are n_estimators in a random forest?

When implementing random forest using sklearn module, you might wonder what n_estimators in a random forest are and how they affect the predictions on the random forest. Well, in this article, we will learn how the n_estimators affect the output or the predictions of the random forest and how we assigned optimum value to it. Before going into the details of what the n_estimator parameter is in a random forest, we have to understand how the random forest actually works.

What is Random Forest in Machine Learning?

A Random forest in machine learning is a type of supervised machine learning algorithm which means it takes the training data along with input and output values. Then, it creates a forest of randomly created decision trees on the training dataset. That is why it is known as a random forest because it contains forests of decision trees as shown below:

what-is-n_estimator-in-random-forest

In order to understand how the random forest actually works and if you want to learn the implementation of random forest in Python on classification and regression datasets, then you can go through this article.

How is the training of random forest done?

The training of random forest is pretty much similar to other supervised learning models. We provide the training dataset to the model. Remember that the training data contains the input and the output values from the dataset so that the model will be able to make the relation between the input and the output values in our dataset. In the training part, the random forest will create different decision trees on the randomly selected dataset (training data). Each decision tree will help to make predictions and then based on the majority voting, the decision tree with the highest is used for making predictions.

The following diagram helps us to understand the training process of the random forest more clearly.

What are n_estimators in random forests?-training part

As you can see, we first passed the dataset to the random forest model. Then the model creates a bunch of decision trees on a randomly selected dataset. After the process of creating decision trees is completed, the model then selects the best-fitted decision tree based on majority voting and that selected decision tree is used to make predictions.

What are n_estimators in a random forest?

As the random forest creates a forest of decision trees on the randomly selected dataset, the n_estimators parameter in the random forest decides the number of decision trees. It can be any positive value. If we assigned the e_estimators value to 1, then the random forest will be actually a decision tree because there will be only one decision tree in the forests of trees.

What are n_estimators in a random forest?

In the figure, it shows that when the n_estimators in random forest values is 30, it actually creates 30 different decision trees in the training process.

Implementation of n_estimator in a random forest using Python

Now, we will use the Python and sklearn module in order to implement the random forest trees and will see how the n_estimators affect the output by changing the values.

Let us first import the dataset which we want to use in order to implement the random forests.

# importing the pandas module
import pandas as pd

# importing dataset
dataset = pd.read_csv('house.csv')

# printing
dataset.head()

Output:

dataset-for-n_estimators

As you can see, there are some null values in our dataset. We will not remove the Null values from the dataset.

# removing null values
dataset.dropna(inplace=True)

Our data is not clean as we removed all the null values.

Splitting the dataset

Once the data is preprocessed, the next step is to split the dataset into input and output values. Let us now separate the input values from the output and store them in separate variables.

# dividing the dataset
X = dataset.drop('price', axis=1)
y = dataset['price']

We have used the price of the house as the output value in this case.

In the next step, we will use the sklearn splitting method to split the dataset into testing and training parts.

# importing the train_test_split method from sklearn
from sklearn.model_selection import train_test_split

# splitting the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

We split the dataset into testing and training parts. The testing part contains 25% of the total data and the remaining 75% has been assigned to the training part.

Default n_estimators in random forest

The default value of n_estimators in a random forest is 100. It means that the model will create 100 decision trees while training on the given dataset and will select the best one based on majority voting.

Let us train the random forest model with the default n_estimators values.

# import Random forest using python
from sklearn.ensemble import RandomForestRegressor

# instantiate Random forest using python
regressor = RandomForestRegressor()

# fit Random forest using python model
regressor.fit(X_train, y_train)

Once the model is trained, we can make predictions and find out the R2-score.

# making predictions for Random forest using python
y_pred = regressor.predict(X_test)

# Importing the required module
from sklearn.metrics import r2_score

# Evaluating the model
print('R score is :', r2_score(y_test, y_pred))

Output:

R score is : 0.32817737008495285

As shown above, we got 0.32 as the r2 score.

changing n_estimators in random forest

By changing the n_estimators in a random forest, we can get either more accurate or less accurate results. Because the n_estimators in a random forest have a direct effect on the prediction of the model.

This time, we will train the random forest model using 200 decision trees and check the R2 score.

# import Random forest using python
from sklearn.ensemble import RandomForestRegressor

# instantiate Random forest using python
regressor = RandomForestRegressor(n_estimators=200)

# fit Random forest using python model
regressor.fit(X_train, y_train)

The next step is to make predictions and evaluate the performance of the model.

# making predictions for Random forest using python
y_pred = regressor.predict(X_test)

# Importing the required module
from sklearn.metrics import r2_score

# Evaluating the model
print('R score is :', r2_score(y_test, y_pred))

Output:

R score is : 0.37000627897332816

This time, we got a much better result than the previous one because we changed the n_estimators in a random forest training part.

Summary

The n_estimators in a random forest represent the number of decision trees created while training the model. As the random forests create a forest of decision trees on the dataset, the n_estimators help the random forest to decide the number of trees in the forest. In this short article, we discussed what is n_estimators in a random forest and how they can affect the predictions of the model. Moreover, we also implemented the random forest in Python and showed how we can change the n_estimators value while training the model.

Leave a Comment