# K-means clustering in Python | Visualize and implement

Unsupervised Learning analyzes and groups unlabeled datasets using machine learning algorithms to find hidden patterns or data groupings without the assistance of a person. Algorithms for unsupervised learning are divided into two categories clustering and association rules. The k-means clustering in Python is one of the clustering methods used in machine learning which belongs to unsupervised learning algorithms. In this article, we will visualize and implement k-means clustering in Python using various Python modules. Moreover, we will learn how we can find the optimum number of clusters to train the model based on the dataset.

You may also like:

30+ Machine learning algorithms explained and implemented in Python

## What is the k-means clustering algorithm?

One of the most popular exploratory data analysis methods used to comprehend the data structure is clustering. The clustering method is used, in order to find homogenous subgroups within the dataset that are comparable and similar to one another based on a similarity metric like euclidean-based distance or correlation-based distance. For example, let us assume that we have the following dataset.

As you can see, the data seems to be randomly distributed and does not seems to have any relation with the axes. But if we apply the k-means clustering algorithm ( with clusters ==3), then the k-means clustering model will create three clusters based on the data points that are more related to each other as shown below:

The figure above shows how k-means clustering creates clusters based on the similarities of the data points.

Now, you have a basic idea of what are clusters and how they actually work. Let us now understand the k-means clustering in Python in more detail.

An unsupervised machine learning approach called K-means divides a dataset into K distinct, non-overlapping subgroups also known as clusters. It is also known as a centroid-based algorithm. We can divide the data into many groups or categories easily using the k-means clustering algorithm by specifying the number of clusters. For instance, there will be two clusters if K=2, three clusters if K=3, etc. The categories of groups in the unlabeled dataset may be easily found by using the K-means technique. For now, don’t get confused about K. It is a parameter value that defines the number of clusters. We will discuss it later in the upcoming sections.

The goal of the K-means algorithm is to associate each cluster with its centroid while minimizing the sum of distances between the data points and their respective clusters. Here are some of the key features of the k-means clustering algorithm.

• Iteratively determines the best value for centroids (center points for every K’s cluster)
• Assigns each data point to its closest K-center. The nearest K-center data points form a cluster

## How k-means clustering algorithm works?

The k-means clustering algorithms work

• Define the number of clusters based on provided K value
• Select random K points or centroids
• Form the K clusters by assigning each data point to their closest centroid
• Calculate the variance and define a new centroid of each cluster
• Repeat the process from the third step to reassign each data point to the new closest centroid of each cluster until the algorithm finds the best possible solution using the following steps.

Let us now understand the working of the k-means clustering algorithm step by step. Let us assume that we have the following dataset.

The very first step in K-means clustering is to define the K value. As we discussed earlier that K is a parameter value and it should be defined by us. For simplicity, in this section, we will take the value of K as 2. So, the algorithm will choose random points K ( in our case 2) for each centroid to form the cluster. These points can be either the points from the dataset or any other points. Let’s say the algorithm selects the following two centroids:

The orange square shows the randomly selected centroids for clusters by the model. Now the model will calculate the distance between each data point and its nearest centroid in order to identify the median between two clusters:

As you can see, based on the randomly selected centroids, two clusters had been created. The K-means clustering algorithm then selects new centroids based on the estimated centers of gravity for each cluster, reassigns each datapoint to the new centroid, and determines a median for the new clusters. For example, in our case, the new center will be shifted more toward the dense data points.

The process of changing the position of centroids will continue unless it finds the optimum clusters.

## What is the elbow method in k-means clustering?

The number of clusters has a significant impact on the efficacy of the K-means clustering algorithm, and selecting the right value for K is crucial but it is time-consuming if we start training the model with different numbers of clusters. However, the elbow method solves this problem. The Elbow method uses the Within Cluster Sum of Squares (WCSS) value to define total variations.

The simplest way to understand the elbow method is to visualize it.

The sharp point of bend or a point of the plot that looks like an arm is the best value for K. One of the limitations of the elbow method is that it does not work always, especially when the data is categorical.

### When to apply the k-means clustering algorithm?

It is always better to apply the k-means clustering in Python on the dataset that is distributed randomly and there are no defined output classes. The problem it k-means clustering can arise when applied to categorical output values is that some of the categorical values can be misplaced in different clusters. For example, let us assume that we have the following dataset.

As you can see, our dataset has three categorical classes. Now, we will apply k-means clustering on the above dataset with 3 clusters. We might expect each of the categories to be in a different cluster, but that is not what happens when applied to the such dataset. Because k-means clustering works by creating centroids, it fails to classify such data correctly. This is shown by the below graph.

As you can see, the clusters had been created based on the centroids and we get unexpected clusters. So, it is always a good idea to not apply the k-means clustering algorithm on the classification dataset when the output class is categorical.

## Implementing k-means clustering in Python

So far you get the basic knowledge about k-means clustering. Now, it is time to implement the concept in Python and visualize the clusters using Python language. Before going to the implementation part, make sure that you have already installed the following Python modules as we will be using them in the implementation.

• sklearn
• numpy
• pandas
• matplotlib
• plotly

We will use these modules in the implementation part.

First, let us import the dataset and print a few rows in order to get familiar with the type of dataset.

```# import pandas
import pandas as pd

# importing dataset

# printing the info about dataset

Output:

As we discussed earlier that k-means clustering in Python works better when the data is more dispersed rather than categorical. For that purpose, we will take only two attributes of our dataset. We will train the model on the annual income and spending score column. But first, let us visualize the data points.

```# creating new dataset of only two columns
data = dataset.loc[:, ['Annual Income (k\$)', 'Spending Score (1-100)']]

# importing the module
import matplotlib.pyplot as plt

# image size
plt.figure(figsize=(10,5))

# ploting scatered graph
plt.scatter(x= data['Annual Income (k\$)'], y=data['Spending Score (1-100)'], c='m')

# labeling the axies
plt.xlabel('Annual Income (k\$)')
plt.ylabel('Spending Score (1-100)');```

Output:

As you can see, the data is more dispersed.

## k-means clustering in Python with 2 clusters

As an example and for simplicity, we will train the k-means clustering model on 2 clusters. First, we will import the k-means clustering in python and then initialize the model with 2 clusters.

```# importing the k-means
from sklearn.cluster import KMeans

# initializing k-means clustering in Python with two clusters
km = KMeans(n_clusters = 2)

# training the k-means clustering in Python
km.fit(data)
```

Once the training is complete, we can then visualize the clustering using `matplotlib` module.

```# ploting the graph of the clusters
plt.figure(figsize=(10,5))
plt.scatter(x= data.iloc[:, 0], y=data.iloc[:, 1], c= km.labels_)
plt.xlabel('Annual Income (k\$)')
plt.ylabel('Spending Score (1-100)');```

Output:

As you can see, the model has split the dataset into two clusters. One another parameter of k-means clustering in Python is the random state. It has a great impact on the formation of clusters. For example, if we will change the random state, the formation of clusters will be different. Let us verify it using different random states for the model. We will use subplots to show the differentiations.

```# ploting in line plots
fig, ax = plt.subplots(1, 2, gridspec_kw={'wspace': 0.3}, figsize=(15,5))

# k-means clustering in Python
for i in range(2):
km = KMeans(n_clusters = 2, init='random', n_init=1, random_state=i)
km.fit(data)
ax[i].scatter(x= data.iloc[:, 0], y=data.iloc[:, 1], c= km.labels_);```

Output:

As you can see how the random state affects the plotting of the clusters. So, while training the model, make sure that you have a constant value for the random state.

### How to evaluate k-means clustering in Python?

As we have seen above, the k-means clustering model created different clusters for the same data when we change the random state. so, how we will know which cluster is more accurate? Well, in such cases inertia helps us. The lower the value of inertia for a cluster, the more accurate the clusters are. So, let us find the inertia for the clusters.

```# creating 2 subplots
fig, ax = plt.subplots(1, 2, gridspec_kw={'wspace': 0.3}, figsize=(15,5))

# k-means clustering in python
for i in range(2):
km = KMeans(n_clusters = 2, init='random', n_init=1, random_state=i)
km.fit(data)
ax[i].scatter(x= data.iloc[:, 0], y=data.iloc[:, 1], c= km.labels_)
#Labeling the axis with inertia values
ax[i].set_title(f"Inertia = {round(km.inertia_, 2)}");```

Output:

As you can see, the first plot has a lower value for inertia which means it is more accurate.

### How to find optimum clusters in k-means clustering in Python?

As we discussed earlier, we can use the elbow method to get the optimum number of clusters in k-means clustering. Let us plot the elbow method on our dataset and find the optimum number of clusters.

```#WCSS of clusters
wcss = []

# using for loop to iterates
for i in range(2,15):

#training k-means clustering in python on different clusters
km = KMeans(n_clusters= i)
km.fit(data)
wcss.append(km.inertia_)

# ploting the elbow graph
plt.plot(range(2,15), wcss, 'og-')
plt.annotate('optimum clusters', xy=(5, 50000), xytext=(6, 100000), arrowprops=dict(facecolor='blue', shrink=0.05))

# labeling the data
plt.xlabel("Number of clusters")
plt.ylabel("Inertia");```

Output:

As you can see that the elbow method shows that 5 is the optimum value for clusters.

### K-means clustering in Python with 5 clusters

Let us now train the model again with 5 clusters.

```# importing the k-means
from sklearn.cluster import KMeans

# initializing k-means clustering in Python with two clusters
km = KMeans(n_clusters = 5)

# training the k-means clustering in Python
km.fit(data)

# ploting the graph of the clusters
plt.figure(figsize=(10,5))
plt.scatter(x= data.iloc[:, 0], y=data.iloc[:, 1], c= km.labels_)
plt.xlabel('Annual Income (k\$)')
plt.ylabel('Spending Score (1-100)');```

Output:

As you can see, the dataset has been split into 5 different optimum clusters.

### Finding optimum clusters

As we found the optimum number of clusters which is 5. In this section, we will use inertia to get the optimum cluster locations. As we have seen before that as we change the random state the position of clusters also changes. So let us find the optimum position for clusters.

```# creating 2 subplots
fig, ax = plt.subplots(1, 5, gridspec_kw={'wspace': 0.3}, figsize=(15,5))

# k-means clustering in python
for i in range(5):
km = KMeans(n_clusters = 5, init='random', n_init=1, random_state=i)
km.fit(data)
ax[i].scatter(x= data.iloc[:, 0], y=data.iloc[:, 1], c= km.labels_)
#Labeling the axis with inertia values
ax[i].set_title(f"Inertia = {round(km.inertia_, 2)}");```

Output:

As you can see, we have the lowest inertia value when the random state is either 0 or 1.

## Summary

K-means clustering is a method used for clustering analysis, especially in data mining and statistics. It aims to partition a set of observations into a number of clusters (k), resulting in the partitioning of the data into Voronoi cells. In this article, we discussed how k-means clustering works. We also learned how to get the optimum number of clusters using the elbow method. Moreover, we implemented k-means clustering in Python on a real dataset.

## Related posts

Why Python for Machine Learning?

Semi-supervised learning in Machine Learning

Isolation forest to detect outliers

Catboost algorithm

Categories ML