Categorical encoding in pandas is a process of converting categorical data into integer format using the pandas module so that the data with converted categorical values can be provided to the models to give and improve the predictions. One of the reasons why we need to encode categorical values is that machine learning models can only understand numeric values, not strings or objects. In this article, we will learn 5 different ways of categorical encoding in pandas including label encoding in pandas, hot encoding in pandas, dummy variables in pandas, and others. Moreover, we will also try to understand why encoding categorical values is important before training machine learning models.

## What is feature encoding in Machine Learning?

Machine learning models can only work with numerical values. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding. For example see, the following example where we will encode the colors into numeric values.

As you can see, the encoding has converted the categorical values into numeric values.

### Why do we do feature encoding in Machine Learning?

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. If the data is not numeric, then we have to explicitly convert the data into numeric before feeding it to the model. But some machine learning models like CatBoost and LightGBM can automatically convert the categorical values to numeric values. But for other models like linear regression, KNN, SVM, decision trees, isolation forest, etc, we need to convert the data into numeric values.

For example, let us import a dataset that will have categorical values and apply the Xgboost algorithm.

# importing pandas import pandas as pd # importing dataset data = pd.read_excel('Label_Encoding.xlsx') # dividing the dataset X = data.drop('Marrige_Status', axis=1) y = data['Marrige_Status'] # importing the xgboost module import xgboost as xgb # Default parameters xgboost_clf = xgb.XGBClassifier() # training the model xgboost_clf.fit(X,y)

Output:

As you can see, we get an error because the algorithm cannot understand non-numeric values. That is why it is necessary to convert the data into numeric values before training the model. In the upcoming sections, we will learn various ways of categorical encoding in pandas.

## Categorical encoding in pandas

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. Apart from data manipulations, pandas can also be used to visualize data in various plots.

Now let us see how we can do categorical encoding in pandas. But before it, let us import the dataset.

# importing pandas import pandas as pd # importing dataset data = pd.read_excel('Label_Encoding.xlsx') # heading data.head()

Output:

As you can see, we have categorical values in the output. Let us now jump into different ways of categorical encoding in pandas.

### Method-1 for categorical encoding in pandas – Replacing

Before going into some standard methods of categorical encoding in pandas, let us try to encode in a simple way. What we can do is replace the category with a numeric value. Pandas make it easy for us to directly replace the text values with their numeric equivalent by using the replace() method.

First, let us print all the categorical values from the dataset.

# printing counts of values data["Marrige_Status"].value_counts()

Output:

Yes 9 No 6 Name: Marrige_Status, dtype: int64

Now, we are going to create a mapping dictionary that contains each column to process as well as a dictionary of the values to translate.

# dictrionay with values encoding = {"Marrige_Status": {"Yes": 0, "No": 1}}

As you can see, we have created a dictionary that specifies the numeric values for each of the categories.

Now we can use the replace() to replace the categorical values with the specified numeric values.

# copying the dataset encoded= data.copy() # categorical encoding in pandas using relace method encoded = encoded.replace(encoding) # heading encoded.head()

Output:

As you can see, the categorical values have been changed to corresponding numeric values. Now, we can easily apply any of the machine learning algorithms.

### Method-1 for categorical encoding in pandas – Label Encoding

Label Encoding refers to converting the labels into a numeric form so as to convert them into a machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. Label encoding is simply converting each value in a column to a number.

let us first print the info about the dataset.

# printing info about data data.info()

Output:

As you can see, the Marrige_Status is an object type. Before applying label encoding in pandas, we have to first convert the object type into a categorical type.

# converting object type into categorical value data["Marrige_Status"] = data["Marrige_Status"].astype('category')

Now, we can apply the label encoding in pandas.

# copying the dataset encoded = data.copy() # label encoding in pandas encoded["Marrige_Status"] = encoded["Marrige_Status"].cat.codes # printing encoded.head()

Output:

As you can see, categorical values have been converted into numeric values where a unique numeric value has been assigned to each of the categories.

### Method-3 for categorical encoding in pandas – One Hot encoding

One-hot encoding in machine learning is the conversion of categorical information into a format that may be fed into machine learning algorithms to improve prediction accuracy. One-hot encoding is a common method for dealing with categorical data in machine learning.

Label encoding has the advantage that it is straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life? So, to overcome such a problem, we can use a one-hot encoding.

One-hot encoding converts each category value into a new column and assigns a 1 or 0 value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.

Let us copy the original dataset and then will use the one-hot encoding method to convert the categorical values into numeric values.

In pandas, we can use get_dummies() method for one-hot encoding.

# copying the dataset encoded = data.copy() # one hot encoding in pands encoded = pd.get_dummies(data, columns=["Marrige_Status"]) # heading encoded.head()

Output:

As you can see, we have converted the categorical values into numeric values using one-hot encoding method in pandas.

### Method-4 for categorical encoding in pandas – Custom encoding

Depending on the data set, you may be able to use some combination of label encoding and one hot encoding to create a binary column that meets your needs for further analysis.

# importing numpy array import numpy as np # copying the dataset encoded = data.copy() # custom encoding encoded["Marrige_Status"] = np.where(encoded["Marrige_Status"].str.contains("Yes"), 1, 0) # heading encoded.head()

Output:

As you can see, we have added 1 for all the values which were yes and added 0 to the rest.

Now you can easily apply any machine learning algorithm on the given dataset to train the model.

## Summary

Machine learning models can only work with numerical values. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding. There are various modules and methods through which we can do the encoding. In this short article, we learned how we can do categorical encoding using the pandas module. We learned 4 different methods through which we can do categorical encoding in pandas.

## 2 thoughts on “5 ways of Categorical encoding in Pandas”