k-Nearest Neighbors (k-NN)

Introduction

k-Nearest Neighbors (k-NN) is a non-parametric and lazy learning algorithm used for both classification and regression tasks. It’s based on the idea that data points with similar features tend to belong to the same class or have similar target values.

Here's how k-Nearest Neighbors works:

Data Representation

Each data point in the dataset is represented as a point in a multi-dimensional feature space, where each dimension corresponds to a feature or attribute

Distance Metric

k-NN relies on a distance metric (e.g., Euclidean distance, Manhattan distance) to measure the similarity between data points in the feature space. The choice of distance metric depends on the nature of the data and the problem at hand.

k-Nearest Neighbors

Given a new data point, the algorithm identifies the k nearest neighbors to that point based on the chosen distance metric. These neighbors are the data points with the smallest distances to the new point.

Classification

For classification tasks, the algorithm assigns the majority class among the k nearest neighbors to the new data point. The class label of the new point is determined by a majority vote among its neighbors.

Regression

For regression tasks, the algorithm predicts the target value of the new data point by averaging the target values of its k nearest neighbors.

Hyperparameter Tuning K

The number of neighbors k is a hyperparameter that needs to be specified by the user. Choosing the right value of k is crucial and can significantly impact the performance of the algorithm. A smaller value of k can lead to a more flexible model with higher variance but lower bias, while a larger value of k can lead to a smoother decision boundary with lower variance but higher bias.

Scaling

It's important to scale the features before applying k-NN to avoid biased distances due to differences in feature scales.

Computational Complexity

The main drawback of k-NN is its computational complexity during prediction, as it requires calculating distances to all training points. This makes it inefficient for large datasets with many dimensions.

Curse of Dimensionality

In high-dimensional feature spaces, the concept of proximity becomes less meaningful, and the performance of k-NN can deteriorate due to the curse of dimensionality.

Despite its simplicity, k-NN can be effective for a wide range of classification and regression tasks, especially for small to medium-sized datasets with low dimensionality. However, its performance can degrade significantly in high-dimensional spaces or when dealing with imbalanced datasets.

Here's an example of how to use k-Nearest Neighbors (k-NN) for classification using the KNeighborsClassifier class from scikit-learn:

We import the necessary libraries and modules from scikit-learn.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

We generate synthetic data using the make_classification function from scikit-learn. This function creates a random binary classification problem with a specified number of samples, features, and classes.

X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, random_state=42)

We split the data into training and testing sets using the train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We create an instance of the KNeighborsClassifier class with n_neighbors=5 (5 nearest neighbors) and fit it to the training data.

model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)

We make predictions on the test data using the predict

y_pred = model.predict(X_test)

We evaluate the model's performance using metrics such as accuracy, confusion matrix, and classification report.

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

class_report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)

print("Confusion Matrix:\n", conf_matrix)

print("Classification Report:\n", class_report)

Finally, we visualize the decision boundary of the k-NN classifier using a scatter plot. The decision boundary separates the two classes based on the majority class of the nearest neighbors.

plt.figure(figsize=(10, 6))

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', marker='o', alpha=0.6, label='Actual')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.title('k-Nearest Neighbors: Decision Boundary')

plt.legend()

plt.show()

This example demonstrates how to use k-Nearest Neighbors for a binary classification task. You can adjust the hyperparameter n_neighbors to control the number of neighbors considered during classification and optimize the model's performance.

k-Nearest Neighbors (k-NN)