Hierarchical clustering

Table of Contents

Hierarchical clustering

Introduction

Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It does not require the pre-specification of the number of clusters and can be visualized using a dendrogram. There are two main types of hierarchical clustering: agglomerative and divisive.

Here's an overview of hierarchical clustering, focusing on the agglomerative approach:

Agglomerative hierarchical clustering

In agglomerative hierarchical clustering, each data point starts in its own cluster, and the algorithm successively merges the closest pairs of clusters until only one cluster remains

Distance Matrix

The algorithm begins by calculating the distance between each pair of data points, creating a distance matrix.
The distance between clusters can be measured using various metrics, such as Euclidean distance, Manhattan distance, or correlation distance.

Cluster Similarity

At each iteration, the algorithm identifies the two clusters with the smallest distance between them.
These clusters are merged into a single cluster, and the distance matrix is updated to reflect the distances between the new cluster and the remaining clusters.

Dendrogram

A dendrogram is a tree-like diagram that illustrates the hierarchical relationship between clusters.
The height of the vertical lines in the dendrogram represents the distance between clusters at the time of merging.
By cutting the dendrogram at different heights, you can obtain different numbers of clusters.

Stopping Criteria

The algorithm continues merging clusters until all data points are in a single cluster, or until a stopping criterion is met.
Common stopping criteria include reaching a specified number of clusters or a threshold distance between clusters.

Choosing the number of clusters

The number of clusters in hierarchical clustering is not pre-specified but can be determined by inspecting the dendrogram.
One common approach is to use the dendrogram to identify the tallest vertical lines that are not crossed by any horizontal line and cut the dendrogram at that height.

Implementation

Hierarchical clustering algorithms are available in various libraries such as scipy and scikit-learn in Python.
These libraries provide functions to compute the hierarchical clustering and visualize the resulting dendrogram.

Hierarchical clustering is useful for exploring the hierarchical structure of the data and identifying nested clusters at different scales. It is commonly used in biology for gene expression analysis, in social sciences for clustering individuals based on similarity, and in document clustering for organizing text documents into topics or themes.

Let's walk through an example of hierarchical clustering using the AgglomerativeClustering algorithm from scikit-learn:

We import the necessary libraries and modules, including NumPy for numerical operations, Matplotlib for visualization, and scikit-learn for generating synthetic data and performing hierarchical clustering.

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from sklearn.cluster import AgglomerativeClustering

from scipy.cluster.hierarchy import dendrogram, linkage

We generate synthetic data using the make_blobs function from scikit-learn. We create 300 data points with 4 clusters, each with a standard deviation of 0.6.

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

We perform hierarchical clustering using the linkage function from scipy.cluster.hierarchy module. We use the 'ward' method, which minimizes the variance when merging clusters.

linkage_matrix = linkage(X, method='ward')

We plot the dendrogram using the dendrogram function from scipy.cluster.hierarchy module. The dendrogram visualizes the hierarchical clustering process and helps in determining the number of clusters.

plt.figure(figsize=(12, 6))

dendrogram(linkage_matrix)

plt.title('Hierarchical Clustering Dendrogram')

plt.xlabel('Sample Index')

plt.ylabel('Distance')

plt.show()

We perform agglomerative clustering using the AgglomerativeClustering class from scikit-learn. We specify the number of clusters as 4.

n_clusters = 4 # Number of clusters

agg_clustering = AgglomerativeClustering(n_clusters=n_clusters)

agg_labels = agg_clustering.fit_predict(X)

We visualize the clusters by plotting the data points colored according to their cluster assignments obtained from the agglomerative clustering algorithm.

plt.figure(figsize=(8, 6))

plt.scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis', s=50, alpha=0.8)

plt.title('Agglomerative Clustering')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.show()

This example demonstrates how to perform hierarchical clustering and visualize the resulting dendrogram and clusters. Hierarchical clustering allows us to explore the hierarchical structure of the data and identify clusters at different levels of granularity, providing insights into the underlying data structure.

Hierarchical clustering

Introduction

Here's an overview of hierarchical clustering, focusing on the agglomerative approach:

Let's walk through an example of hierarchical clustering using the AgglomerativeClustering algorithm from scikit-learn:

Useful Links

Edtior's Picks

Latest Articles

Hierarchical clustering

Hierarchical clustering

Introduction

Here's an overview of hierarchical clustering, focusing on the agglomerative approach:

Let's walk through an example of hierarchical clustering using the AgglomerativeClustering algorithm from scikit-learn:

Clustering

K Means Clustering

You may also like

Leave a Comment Cancel Reply

Useful Links

Edtior's Picks

Latest Articles