K Nearest Neighbors


K Nearest Neighbors

Machine Learning Model - K Nearest Neighbor
The KNeighborsClassifier model tries to categorize data by assigning it to a data class which is most similar to the data points nearest to it. The k value is manually selected by the programmer to optimize the model and can vary depending on the data set used. Typically, a larger k value will decrease the effects of noise, but the boundaries of the classifications are not as distinct, so it is a balancing act when trying to determine which is the best value to use.

K Nearest Neighbors
The k values represent the number of neighbor values that are compared and a simple majority of like type datapoints is used to classify the datapoint being reviewed.
For our models, the k values were selected by reviewing a matplotlib graph of the k values and training accuracy scores and finding the lowest value where the data appears to stabilize.


Results :
0 = Fail
1 = Success



K Nearest Neighbors
When categorizing the movie data into success/failure categories, the model had the following accuracy scores:
k=17 Train Accuracy: 0.843
k=17 Test Accuracy: 0.840

This model has an 84% accuracy, but it under-predicts the success of the movie meaning it predicts a fail when it actually has a score that indicates a successful movie. This means that a movie fanatic may not watch a movie thinking it will be bad, but may then miss out on watching something they would actually have enjoyed.

K Nearest Neighbors

When trying to find which data points had more impact on the model categorizing into success/failure, removing the columns that indicated that genre of movie improved the model and had the following accuracy scores:

k=13 Train Accuracy: 0.879
k=13 Test Accuracy: 0.884





K Nearest Neighbors

When categorizing the data into the integer rating scores from 2-9, the model had the following accuracy scores:

k=23 Train Accuracy: 0.506
k=23 Test Accuracy: 0.456

This shows that the model is able to better categorize the data into two separate groups instead of splitting the data into 8 separate groups.