# Extra trees classifier and regressor using Python

Extra trees classifier is used for the classification problems in machine learning and extra trees regressor is used for the regression problems in machine learning. The working of extra trees classifier and regressor is pretty much close to random forest algorithm. It is highly recommended that you should go through the Random forest algorithm and Decision trees algorithm, before starting the extra trees classifier and regressor in order to understand the concepts fully.

In simple words, the extra trees classifier and regressor work by randomly selecting a subset of features and then training the model using the decision tree. But the tree is then pruned so that it contains only the most important features for making predictions. In this article, we will discuss how the extra trees classifier and regressor works. We will also explain the difference between extra tree algorithms, decision trees, and random forests. Moreover, we will implement extra trees classifier and regressor on classification and regression problems respectively.

Machine Learning all algorithms list – with implementation

## How extra trees algorithm works?

The extra trees algorithm is also known as Extreme Randomized Tree. It generates predictive models for classification and regression problems. It is similar to other approaches like decision trees and random forests, but it makes better predictions by using additional facts about the data. The extra tree algorithm is also quicker and simpler to use than others. As a result, it is an effective tool for predictive modeling and data mining.

Like the random forests technique, the extra trees algorithm generates a large number of decision trees, but each tree’s sampling is random and without replacement. This generates a dataset with distinct samples for each tree. For each tree, a predetermined amount of features are also randomly chosen from the entire set of features. The selection of a splitting value for a feature at random is the most significant and distinctive aspect of extra trees. The algorithm then chooses a split value at random rather than figuring out a locally optimal split using Gini or entropy. As a result, the trees are diverse and unrelated.

### Extra Tees vs Random Forest algorithm

Although the extra trees algorithm is pretty much similar to the random forest algorithm, the only difference is the construction of the decision trees. The following are some of the main differences between extra trees and random forest algorithm:

• The extra trees algorithm uses the whole original dataset while the random forest uses bootstrap replicas.
• The next feature that differs in both algorithms is the selection of cut points to split the nodes. The random forest chooses the optimum split while the extra trees algorithm selects randomly.

### Why choose the Extra trees classifier and regressor over the random forest?

Here are some of the features that give the extra trees algorithm more importance.

• The extra trees algorithm uses the whole original sample of data instead of using small portions.
• It chooses the nodes randomly which reduces variance.
• It is faster than the random forest algorithm as it does not spend any time splitting nodes.
• There are very fewer chances of the extra tree model being overfitted or under fitted as it reduces bias and variance due to randomness.

## Extra Trees classifier using Python

Now we will use extra trees classifier to predict the flower type. In this section, we will use the well-known iris dataset which contains information about three different types of flowers. The data can be found in the submodule of sklearn module. We just need to load the data from there.

Let us first load the dataset.

```# importing dataset

### Training extra trees classifier using Python

Before training the extra trees classifier on the given dataset, we have to split the dataset into testing and training parts so that we can use the testing data later to evaluate the model. We will also assign 1 to a random state.

```# splitting the data into inputs and outputs

# importing the module
from sklearn.model_selection import train_test_split

# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(Input, output, test_size=0.25, random_state =1)```

As you can see, we have assigned 25% of the total dataset to the testing part, and the remaining 75% to the training dataset.

Let us now initialize the extra trees classifier and train the model on the training dataset.

```# importing the module
from sklearn.ensemble import ExtraTreesClassifier

# initializing the model
extra_classifier = ExtraTreesClassifier()

# Training the model
extra_classifier.fit(X_train, y_train)```

Once the training is complete, we can then use the model to predict the output class using the testing dataset.

```# making predictions
y_pred = extra_classifier.predict(X_test)```

Now we have the predictions, but we don’t know how well are the predictions, so in order to evaluate the model, we will use various evaluating models.

### Evaluating the extra trees classifier

We will use the confusion matrix to evaluate the performance of the extra trees classifier. A simple way to understand the confusion matrix is that every value that lies in the main diagonal shows the correct classification.

```# importing seaborn
import seaborn as sns

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix

# providing actual and predicted values
cm = confusion_matrix(y_test, y_pred)

# If True, write the data value in each cell
sns.heatmap(cm,annot=True)```

Output:

As you can see, only two values have been incorrectly classified by the model while the rest have been correctly classified.

Let us also calculate the classification report of the model that contains accuracy, precision, recall, and f1-score. Learn how we can calculate these matrices from the confusion matrix.

```#importing the classification report
from sklearn.metrics import classification_report

# printing the classification report
print(classification_report(y_test, y_pred))```

Output:

As you can see, we get an accuracy score of 95% which means only 5% of the testing data has been incorrectly classified while the rest has been correctly classified by the model.

### Extra Trees classifier vs Random Forest classifier

Let us now train the random forest classifier and extra trees classifier on the same dataset with default parameter values and see which one will perform better.

First, we will initialize the random forest classifier and make predictions.

```# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# instantiate the classifier
random_classifier = RandomForestClassifier()

# fit the model
random_classifier.fit(X_train, y_train)

# testing the model
random_pred = random_classifier.predict(X_test)```

As you can see, we have trained the random forest model and then make predictions. Let us now calculate the accuracies of both models.

```# importing accuracy score
from sklearn.metrics import accuracy_score

#accuracy score
print("Accuracy of extra trees algorithm:  ", accuracy_score(y_test,y_pred))
print("Accuracy of random forest algorithm: ", accuracy_score(y_test, random_pred))```

Output:

As you can see, on the given dataset the random forest classifier has performed better than the extra trees classifier.

### Extra trees classifier vs Decision trees classifier

Let us now use the decision trees classifier model and train it on the same training dataset and make predictions to compare it with the extra trees classifier.

```# importing decision tree algorithm
from sklearn.tree import DecisionTreeClassifier

# entropy means information gain
decision_classifer = DecisionTreeClassifier()

# providing the training dataset
decision_classifer.fit(X_train,y_train)

# making predictions
decision_pred = decision_classifer.predict(X_test)```

Once the training is complete, let us calculate the accuracy score and compare it with the extra trees classifier.

```#accuracy score
print("Accuracy of extra trees algorithm:  ", accuracy_score(y_test,y_pred))
print("Accuracy of random forest algorithm: ", accuracy_score(y_test, decision_pred))```

Output:

As you can see, again the decision tree classifier model was better on the given dataset than the extra trees classifier model.

## Extra trees regressor using Python

Now let us use the extra trees regressor on a regression dataset. This time, we will use a dataset about Bitcoin. Let us first import the dataset and print a few rows.

```# importing pandas
import pandas as pd

Output:

As you can see, there are a number of columns. We don’t need all these columns. We will use the open and closing price to predict the Volume value. So, let us remove all other columns.

```# droping the column
data.drop("Date", inplace=True, axis=1)
data.drop("High", inplace=True, axis=1)
data.drop("Low", inplace=True, axis=1)

If you want to learn more about the Bitcoin dataset and explore it more, please have a look at the Two simple ways to analyze the stock market.

Our data is ready and let us move to the training part of the extra trees regressor.

### Training the extra trees regressor model

Before going to the training of the model, let us first split the dataset into testing and training parts.

```Input = data.drop("Volume", axis=1)
output = data['Volume']

# importing the module
from sklearn.model_selection import train_test_split

# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(Input, output, test_size=0.25)```

Once the splitting is complete, we can then initialize the extra trees regressor and train the model on the training dataset.

```# importing the module
from sklearn.ensemble import ExtraTreesRegressor

# initializing the model
regressor = ExtraTreesRegressor()

# Training the model
regressor.fit(X_train, y_train)```

Once the training is complete, we can then use the testing dataset to make predictions using the trained model.

```# Making predictions
y_pred = regressor.predict(X_test)```

As you can see, our model has made the predictions, but we don’t know how well the predictions are. So, let us jump into the evaluating part.

### Evaluating the extra trees regressor model

Let us first visualize the actual and predict values using a line graph. Check various plots that we can plot using Various modules in Python.

```# importing module
import matplotlib.pyplot as plt

# fitting the size of the plot
plt.figure(figsize=(15, 8))

# plotting the graphs
plt.plot([i for i in range(len(y_test))],y_test, color = 'green',label="actual values")
plt.plot([i for i in range(len(y_test))],y_pred, color='red', label="Predicted values")

# showing the plotting
plt.legend()
plt.show()```

Output:

As you can see, the green line shows the actual values while the red plot shows the predicted value. Let us also calculate the R-square score of the model.

```# Importing the required module
from sklearn.metrics import  r2_score

# Evaluating model performance
print('R-square score is :', r2_score(y_test, y_pred))```

Output:

As you can see, we get an R-square score of 0.122.

### Extra trees regressor vs Random forest regressor

Let us now use the random forest regressor to train the model and will evaluate the model to compare the results with the extra trees regressor.

First, we need to initialize the random forest regressor, then train the model and finally make predictions.

```# import Random forest using python
from sklearn.ensemble import RandomForestRegressor

# instantiate Random forest using python
regressor = RandomForestRegressor()

# fit Random forest using python model
regressor.fit(X_train, y_train)

# making predictions for Random forest using python
random_pred = regressor.predict(X_test)```

Once the model has completed the predictions, we can then compare the R-square score of the random forest regressor with the extra tree regressor.

```# Evaluating model performance
print('R-square score of extra trees is  :', r2_score(y_test, y_pred))
print('R-square score of random forest is  :', r2_score(y_test, random_pred))```

Output:

As you can see, the extra tree regressor performed better than the random forest regressor on the given dataset.

### Extra trees regressor vs Decision trees

Now, we will compare the results of the extra trees regressor with the decision trees. Let us first initialize the decision tree regressor and then train the model to make predictions.

```# importing decision tree using Python
from sklearn.tree import DecisionTreeRegressor

# initializing decision tree using Python model
regressor = DecisionTreeRegressor()

# training decision tree using Python
regressor.fit(X_train,y_train)

# making predictions / decision tree using Python
decision_pred = regressor.predict(X_test)```

Once the training and prediction is complete, we can compare the r-square scores.

```# Evaluating model performance
print('R-square score of extra trees is  :', r2_score(y_test, y_pred))
print('R-square score of decision forest is  :', r2_score(y_test, decision_pred))```

Output:

As you can see, the extra trees regressor performed better than the decision trees.

NOTE: You can access the source code and dataset from my GitHub account. Please don’t forget to follow and give me a star.

## Summary

The extra trees algorithm is short for extremely randomized trees. It is similar to a random forest algorithm, but the splitting of nodes is fully randomized in extra trees. The Extra trees algorithm can be used for classification and regression problems. In this article, we discussed how we can use the extra trees algorithm for classification and regression problems. Moreover, we compared the results with random forest and decision trees algorithms.

Categories ML