Introduction to Machine Learning: k-Nearest Neighbors

Introduction to Machine Learning: k-Nearest Neighbors

Machine Learning in Python: k-Nearest Neighbors Classification

Introduction

When considering machine learning models, a fundamental distinction is whether the model is supervised or unsupervised. This distinction is determined by the nature of the problem being addressed. For a more detailed comparison between supervised and unsupervised models, refer to this resource. Although there are numerous machine learning models available, it's worth familiarizing oneself with the supervised regression model. Nonetheless, it's important to note that regression may not be suitable for all scenarios and may not consistently generate accurate predictions.

That is where the k-Nearest Neighbors (kNN) algorithm might be a useful model in your machine learning toolbox. The kNN algorithm is a supervised learning model, which predicts a target variable using one or multiple independent variables.

Learning Objective

This article should be able to teach you to:

  • Describe basic kNN algorithm concepts

  • Use common scikit-learn functions to conduct kNN algorithm classification or regression

Introduction To k-Nearest Neighbors

In machine learning, kNN is an algorithm in machine learning used for both classification and regression problems. This algorithm is non-parametric and lazy, which implies that it doesn't make any presumptions regarding the data's underlying distribution, and instead solely relies on the training data when generating predictions.

The kNN algorithm works by finding the k nearest data points in the training set to a given test data point, based on some distance metric (such as Euclidean distance). Then, for classification problems, the algorithm predicts the class of the test data point as the majority class among the k nearest neighbors. For regression problems, the algorithm predicts the value of the test data point as the mean or median value of the target variable among the k nearest neighbors.
Advantages:

  1. Simple

  2. Training is trivial

  3. Works with any number of classes

  4. Few parameters

Disadvantages:

  1. High prediction cost

  2. Does not work well with datasets with many columns

  3. Categorical features do not work as well

This video helps in explaining the concept well,

Make sure you understand the content in the video above! I recommend that you code along while viewing the content. Feel free to pause the video if you need more time to code along. If you haven’t yet installed Jupyter notebook on your machine, you can refer to this guide.

Below, I have provided some resources that may be useful, including the scikit-learn documentation and further details on some of the more complex ideas briefly mentioned in the lessons.

Cleaning The Data

Watch the video below:

we will be cleaning, analyzing the dataset on adults in the United States. Download the data here.

Let’s kickstart our analysis. First, import the libraries needed for this step and do the necessary cleaning and analyzing for the dataset.

Exploratory Data Analysis

In this step, we will examine exploratory data analysis (EDA). ​​In the context of kNN, EDA can help to identify the optimal value of k for the dataset. This can be done by plotting the accuracy of the model for different values of k and selecting the value of k that gives the highest accuracy.

Next, we will also be scaling the data. Scaling is the process of transforming the data so that all features have a similar range or distribution. This is important in kNN because the distance metric used in kNN is sensitive to the scale of the features.

Training And Modelling

To train and model in kNN, various steps must be taken, including preprocessing the data, splitting the dataset into training and testing sets, choosing the value of k, training the model, making predictions, evaluating the performance, and tuning the model if necessary.

Choosing an appropriate value of k is a critical step in the kNN algorithm. A small value of k may result in overfitting, while a large value of k may result in underfitting.

Do note that the training process could take a while to load due to a large amount of data.

Retraining The Model

Click on the link to watch the video for this step, LINK.

In this method, the performance of the kNN model is evaluated for different values of k, and the value of k that results in the highest accuracy is selected. To do this, a plot is created of the performance metric versus the value of k. The plot is then examined to identify which is the point at which the performance begins to plateau. This value of k is then chosen as the optimal value for the model.

From the previous step, we can see from the plot that the k=29 value has the lowest error rate, therefore we will be using it to retrain the model with this value.

To evaluate the performance of the kNN model, we will be using the classification report which summarizes the precision, recall, F1-score, and support for each class in the target variable.

Conclusion

Machine learning is a complex field that requires a fundamental understanding of various algorithms and models. Two popular algorithms that were mentioned include kNN and regression. When working with machine learning models, it is essential to consider whether a supervised or unsupervised model is the most appropriate for the given problem. Additionally, before training and modeling in kNN, various steps must be taken, such as data preprocessing, splitting the dataset into training and testing sets, selecting the value of k, and evaluating model performance. Finally, it is always helpful to have access to external resources and to explore the data through exploratory data analysis (EDA) before diving into modeling.