Machine Learning Model - K Nearest Neighbor
The KNeighborsClassifier model tries to categorize data by assigning it to a data class
which is most similar to the data points nearest to it. The k value is manually selected
by the programmer to optimize the model and can vary depending on the data set used.
Typically, a larger k value will decrease the effects of noise, but the boundaries of
the classifications are not as distinct, so it is a balancing act when trying to determine
which is the best value to use.
The k values represent the number of neighbor values that are compared and a simple majority
of like type datapoints is used to classify the datapoint being reviewed.
For our models, the k values were selected by reviewing a matplotlib graph of the k values and training
accuracy scores and finding the lowest value where the data appears to stabilize.
Results :
0 = Fail
1 = Success
When categorizing the movie data into success/failure categories, the model had the
following accuracy scores:
k=17 Train Accuracy: 0.843
k=17 Test Accuracy: 0.840
This model has an 84% accuracy, but it under-predicts the success of the movie meaning it
predicts a fail when it actually has a score that indicates a successful movie.
This means that a movie fanatic may not watch a movie thinking it will be bad,
but may then miss out on watching something they would actually have enjoyed.
When trying to find which data points had more impact on the model categorizing into success/failure,
removing the columns that indicated that genre of movie improved the model and had the following
accuracy scores:
k=13 Train Accuracy: 0.879
k=13 Test Accuracy: 0.884
When categorizing the data into the integer rating scores from 2-9, the model had the
following accuracy scores:
k=23 Train Accuracy: 0.506
k=23 Test Accuracy: 0.456
This shows that the model is able to better categorize the data into two separate groups instead of
splitting the data into 8 separate groups.