For those of you who haven't (completely) read the previous thread, let me explain in short what this is all about.
A few years ago I started making a movie database that I used to check which movies I'd probably like and which not. At first I based my personal recommendations only on the movie's IMDB rating and the number of oscars it had won. Through the years I've added more information to base the recommendations on. A bit over two months ago I found Criticker and since then I've added a few extra parameters and I've automated the whole thing somewhat so that I now can generate recommendations not only for myself but for any Criticker user.
The parameters that I use in my calculations are the following:
- Criticker PSI: this is the Probable Score Index that Criticker displays to you for each movie that you haven't seen. The higher the PSI, the more likely you'll like the movie.
- IMDB: this is the average user rating for the movie that is shown on IMDB. At the moment I use the 18-29 year olds rating by default.
- Keywords: my program analyses the movies that you've seen for their keywords, and then favours movies that have the same keywords as other movies that you liked.
- Genre: just like with keywords, my program analyzes the movies that you've seen for their genres, and then favours movies that have the same genres as other movies that you liked.
- Votes: maybe I should rename this to popularity. This parameter is based on the number of votes that each movie has on IMDB. The more popular a movie, the better.
- Awards: this is the amount of awards a movie has won and has been nominated for. The more awards a movie has won, the better.
- Length: this parameter is based on the running time of the movie. I noticed that the movies I like best tend to run longer than the movies I like worst. It seems I'm not the only person who favours longer movies, since every other user I've generated recommendations for so far prefers longer movies over shorter ones.
- Age: this parameter shows how recent a movie is. It seems I'm pretty much the only one who prefers more recent movies. The other users I've made recommendations for prefer older movies. Nevertheless I'm wondering lately how useful the Age parameter really is. Anyway...
- Popcorn: it took a really long time for me to find a way to quantify this parameter, but I'm quite confident that it is pretty accurate right now. This parameter tries to distinguish popcorn movies from non-popcorn movies. It does this by combining data on box office results, IMDB ratings and genres. Popcorn movies are light, accessible, fun movies that you go to in order to enjoy your friends company, more than that you want to be in awe by the movie. For example, three movies on the popcorn end of the scale are Meet the Spartans, Anaconda, and Date Movie. Three movies on the non-popcorn end of the scale are Dead Man, The Celebration, and Irréversible.
Here's an example what my system comes up with:
If you want me to make one of these for you, message me here on Criticker with your Criticker password (after you changed it for safety!) and I'll get you a list like the one above in about a day or so. Unfortunately I need your password in order to get hold of the PSI's Criticker shows you.
And now for those who've been following this thread from the beginning...
I've made a few changes to the system since the last version. These are the following:
- The Awards parameter is back in, as well as a popularity parameter, based on the number of votes a movie has on IMDB.
- I've tweaked the popcorn parameter to also be based on the genre of a movie, and not just on its box office results. If two movies have made the same amount of money at the box office and have the same IMDB rating, then the one that has action and romance as genres will have a higher popcorn rating than the one that has history and documentary as genre.
- I've found a completely new way to optimize the weights of the parameters. The primary gain is in computing time. Where my previous code took 20 minutes to get accurate weights, my new code takes only 1 second to generate the final scores (or over 1000 times faster ). For some more info on the math, see below. The performance gain varies from user to user. For me personally it works a bit less well, but for KGB and AFlickering, the final correlations were higher with the new method.
Here are updated lists for:
You'll notice that I removed the weights percentages from the .pdf. This is because with my new method, I have not found a way yet to quantify the weights of the original parameters. I could give you the weights of the mutually independent parameters, but these are very different from the original ones and thus not really relevant to show you what is important and what is not. I think the correlations already show you quite well which parameters work well for you and which not. You might also notice a significant change with your previous lists because of the new optimization method. If you find them (much) worse, please let me know why you feel that way, so that I can figure out whether I can do something about it.
* The math of my new optimization method: In my previous optimization algorithm, I tried to find the maximum of the final correlation in the N-dimensional space of weights, where N was the number of parameters. The reason I couldn't just use the correlations of each parameter on its own to determine the weights is because those parameters all have non-zero correlations with eachother, i.e. they are not mutually independent. So part of the information of the PSI parameter for example, is also present in the IMDB parameter. This is because for most people, movies that are high on IMDB will also be high on Criticker. With my new method, I converted these 9 mutually dependent parameters into 9 mutually independent parameters. Thus the parameters share no information anymore, and the correlations between them are thus zero. This now precisely means that a higher correlation with my ratings also means it needs a higher weight in my final score, and I don't have to worry anymore about mutual dependencies. I'm neither a mathematician, nor a computer programmer, so I can't really explain why the new method works better for some and worse for others, but the important thing is that generally it performs pretty much equally well, while being a thousand times faster.