3.6 Feature Selection: k-Nearest Neighbors (k-NN)

Module 4 Lesson 33.6 Feature Selection: k-Nearest Neighbors (k-NN)

The fourth method for addressing feature selection is the k-nearest neighbors (k-NN) method, which works by finding the k known training cases closest to a new case, and then combining (e.g., by averaging) those answers to estimate its value or class. If there are M input features, the cases are points in an M-dimensional space.  

It is important to carefully consider the set of variables before using k-NN, but it can be worth the effort because k-NN can track unusual surfaces, is intuitive to explain, and no fitting is involved. Using a representative subset of a sample is helpful because k-NN estimation processing time increases nonlinearly with the number of cases.

Advantages of the KNN are that this method: 

Disadvantages of the KNN are that this method: 

  • is straightforward to understand; 
  • makes no assumptions about the sample or model parameters, as it is a nonparametric method; 
  • is relatively robust to outliers as their influence is contained within a neighborhood; and 
  • continuously evolves as new data is presented. 
  • requires the researcher to identify the optimal number of neighbors, the hyperparameter k; 
  • does not perform well when there is a substantial imbalance in the data (e.g., few bankruptcies in a dataset used to predict bankruptcies); 
  • is potentially seriously harmed by irrelevant variables, because they add noise to the true distance between any two points; and 
  • is slow when the dataset is large.