Sklearn **standardscaler** covert the numeric data to a standard scale which is then easy for the machine learning model to analyze. It has been observed that machine learning models performed better when the data is scaled in some specific range, especially the algorithms that are highly dependent on the weight of the input values like linear regression, KNN, logistic regression, and many more. In this short article, we will learn how we can use sklearn standardscaler to convert data into standard scale. Moreover, we will also learn why it is important to scale the data before training the model.

## Introduction to sklearn standardscaler

Before going into sklearn standardscaler, let us first understand the concept of scaling. In machine learning, scaling is simply normalizing the dataset. The dataset can contain features of various dimensions and scales together which can affect the training process of a model. A model trained on unscaled data can have biased outcomes. So, it is always important to scale the data on a specific range before applying any machine learning model.

Sklearn **standscaler** is one of the scaling methods that scale the data in a standard way and make it suitable for machine learning models. The following figure shows how the scaled and unscaled data look like

As you can see before applying the scaling the data were randomly distributed and now the data is more clustered in a specific range.

### What are numeric data scaling methods?

The two most powerful techniques of scaling are normalization and standardization. In normalization, each data point is scaled in the range of 0-1. while Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

Normalization uses the following equation:

y = (x – min) / (x – max) |

The min and max are the minima and maximum values in the dataset.

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Standardization assumes that our data is a normal distribution with mean and standard deviation.

It uses the following equation:

y = (x – mean)/standard deviation |

Where the mean and standard deviation are calculated as follows:

Mean | Std Deviation |

mean = sum(x) / count(x) | std = sqrt( sum( (x – mean)^2 ) / count(x)) |

So far we have covered the theoretical part of sklearn standardscaler, now it is time to jump into the practical part and implement it.

## Examples of sklearn standardscaler

In this section, we will take various examples of sklearn standardscaler and will scale our data in a specific range. Before going to the practical part, make sure that you have installed the following Python libraries as we will be using them in the practical part.

```
pip install sklearn
pip install pandas
pip install numpy
pip install matplotlib
```

You can install the modules using the pip command.

### Example 1: sklearn standardscaler on a simple dataset

First, let us create a simple dataset.

```
# importing numpy array
from numpy import asarray
# creating dataset
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1],
[35, 1],
[45, 0.006],
[34, 0.3]])
```

As you can see, we have a dataset with two columns where the first column has higher values and the second column has lower values. let us visualize the data through a box plot to see the distribution.

```
# importing seaborn module
import seaborn as sns
# plotting box plot
sns.boxplot(data=data)
```

Output:

As you can see that there is a huge difference in the distribution of both columns. Now, let us apply sklearn standardscaler and scale the dataset.

```
# importing sklearn standardscaler
from sklearn.preprocessing import StandardScaler
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
# plotting the data
sns.boxplot(data=scaled)
```

Output:

As you can see, after scaling the data is more convenient and both columns are now in the same scale.

### Example 2: Sklearn standardscaler on specific column

In the first example, we have applied sklearn standardscaler to the whole dataset. In this section, we will learn how we can scale a specific column in sklearn.

We will take the same dataset and apply the sklearn standardscaler to the very first column.

```
# importing sklearn standardscaler
from sklearn.preprocessing import StandardScaler
# define standard scale
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data[:, :1])
# plotting the data
sns.boxplot(data=scaled)
```

Output:

As you can see the data is scaled from -0.5 to 0.5 while the original data was from 1-100.

### Example 3: Sklearn standardscaler on data frame

So far we have scaled the dataset that we created on our own. This time we will read a dataset from an external file and then scale it.

Let us first import the dataset.

```
# importing pandas
import pandas as pd
# importing dataset
data = pd.read_excel('data.xlsx')
# searborn
sns.boxplot(data)
```

Output:

As you can see, most of the data is in the range of 20-45. Now, let us apply standard scaling and see the result.

If you will directly scale the series object, you will get the following error.

```
# define standard scale
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data['Age'])
# plotting the data
sns.boxplot(data=scaled)
```

Output:

```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_239177/163014125.py in <module>
6
7 # transform data
----> 8 scaled = scaler.fit_transform(data['Age'])
9
10 # plotting the data
~/.local/lib/python3.10/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
865 if y is None:
866 # fit method of arity 1 (unsupervised transformation)
--> 867 return self.fit(X, **fit_params).transform(X)
868 else:
869 # fit method of arity 2 (supervised transformation)
~/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight)
807 # Reset internal state before fitting
808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)
810
811 def partial_fit(self, X, y=None, sample_weight=None):
~/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight)
842 """
843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
845 X,
846 accept_sparse=("csr", "csc"),
~/.local/lib/python3.10/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
575 raise ValueError("Validation should be done on X, y or both.")
576 elif not no_val_X and no_val_y:
--> 577 X = check_array(X, input_name="X", **check_params)
578 out = X
579 elif no_val_X and not no_val_y:
~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
877 # If input is 1D raise error
878 if array.ndim == 1:
--> 879 raise ValueError(
880 "Expected 2D array, got 1D array instead:\narray={}.\n"
881 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[21. 18. 20. 65. 18. 24. 45. 35. 23. 32. 34. 31. 43. 32. 20.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```

To avoid such errors, we need to do some transformations as shown in the code below:

```
# importing numpy array
import numpy as np
# define standard scale
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(np.array(data['Age']).reshape(-1, 1))
# plotting the data
sns.boxplot(data=scaled)
```

Output:

As you can see, this time the data is scaled in a specific range.

## Summary

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. In this short article, we learned how we can use sklearn **standardscaler** to scale the dataset in a specific range using various examples.

## 1 thought on “Understand Sklearn standardscaler with examples”