Machine Learning Model - Logistic Regression
Logistic Regression is a statistical method for predicting binary outcomes from data.
It uses an equation as the representation, similar to linear regression.
Input values (x) are combined linearly using weights or coefficient values to predict an output value (y).
The logistic function, which is also known as the sigmoid function, creates an S-shaped curve that
can take any real-valued number and map it into a value between 0 and 1.
Implementing the Movie Dataset :
After completing the ETL, the clean movie file was used to perform Logistic Regression.
In order to create the Logistic Regression model, first the movie titles needed to be dropped
from the dataset due to them being strings. Second, the Rating Integer was dropped due to it
being utilized in the ETL to create the performance variable, which determined if a movie score
was successful or unsuccessful.
The Train Test Split was created using the performance variable, which was created during the
ETL to determine if a movie’s Rating Integer was greater than 7 (successful) or less than or
equal to 7 (unsuccessful/fail).
Results
0 = Fail
1 = Success
Using the Logistic Regression model, 88% of the time the model is successful at predicting
a movie’s success.
However, the model tends to over predict the success of the movie meaning it
predicts a success when it actually has a score that indicates a failed movie.
Having a model over predict movie scores could lead to someone watching a movie
thinking it will be good, but then become disappointed when watching the movie.
Training Data Score: 0.882876468494126
Testing Data Score: 0.880469583778015