6

I have a post, and I need to predict the final score as close as I can.

Apparently using curve_fit should do the trick, although I am not really understanding how I should use it.

I have two known values, that I collect 2 minutes after the post is posted.

Those are the comment count, referred as n_comments, and the vote count, referred as n_votes.

After an hour, I check the post again, and get the final_score (sum of all votes) value, which is what I want to predict.

I've looked at different examples online, but they all use multiple data points (I have just 2), also, my initial data point contains more information (n_votes and n_comments) as I've found that without the other you cannot accurately predict the score.

To use curve_fit you need a function. Mine looks like this:

def func(datapoint,k,t,s):
    return ((datapoint[0]*k+datapoint[1]*t)*60*datapoint[2])*s

And a sample datapoint looks like this:

[n_votes, n_comments, hour] 

This is the broken mess of my attempt, and the result doesn't look right at all.

 import numpy as np
 import matplotlib.pyplot as plt
 from scipy.optimize import curve_fit


 initial_votes_list = [3, 1, 2, 1, 0]
 initial_comment_list = [0, 3, 0, 1, 64]
 final_score_list = [26,12,13,14,229]

 # Those lists contain data about multiple posts; I want to predict one at a time, passing the parameters to the next.

 def func(x,k,t,s):
     return ((x[0]*k+x[1]*t)*60*x[2])*s

 x = np.array([3, 0, 1])
 y = np.array([26 ,0 ,2])
 #X = [[a,b,c] for a,b,c in zip(i_votes_list,i_comment_list,[i for i in range(len(i_votes_list))])]


 popt, pcov = curve_fit(func, x, y)

 plt.plot(x, [ 1 , func(x, *popt), 2], 'g--',
          label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))

 plt.xlabel('Time')
 plt.ylabel('Score')
 plt.legend()
 plt.show()

The plot should display the initial/final score and the current prediction.

I have some doubts regarding the function too.. Initially this is what it looked like :

(votes_per_minute + n_comments) * 60 * hour

But I replaced votes_per_minute with just votes. Considering that I collect this data after 2 minutes, and that I have a parameter there, I'd say that it's not too bad but I don't know really.

Again, who guarantees that this is the best function possible? It would be nice to have the function discovered automatically, but I think this is ML territory...

EDIT:

Regarding the measurements: I can get as many as I want (every 15-30-60s), although they have to be collected while the post has =< 3 minutes of age.

G. Ramistella
  • 1,327
  • 1
  • 9
  • 19
  • The question is a bit hard to follow, I think that might be why there is no answer yet. Perhaps you can include some sample data for people to play with, and think about how you are explaining it? Do you have many such measurements, or just one? Because with just two datapoints pretty much any model will give a fit. – tBuLi May 19 '19 at 10:46
  • @tBuLi I have included 5 measured 'posts' to play with (build the datapoints with the lists I have included in the post), although I am working on including more – G. Ramistella May 19 '19 at 11:24
  • Ah sorry, you're right. – tBuLi May 19 '19 at 13:18
  • So these three arrays, `initial_votes_list ` etc., should somehow go into `func`, that is currently not happening. Also, what are `x` and `y`? – tBuLi May 19 '19 at 13:23
  • It was just an attempt at making things work (notice that the values are the same as the lists @ i = 0 ) – G. Ramistella May 19 '19 at 13:27

1 Answers1

6

Disclaimer: This is just a suggestion on how you may approach this problem. There might be better alternatives.

I think, it might be helpful to take into consideration the relationship between elapsed-time-since-posting and the final-score. The following curve from [OC] Upvotes over time for a reddit post models the behavior of the final-score or total-upvotes-count in time: enter image description here

The curve obviously relies on the fact that once a post is online, you expect somewhat linear ascending upvotes behavior that slowly converges/ stabilizes around a maximum (and from there you have a gentle/flat slope).

Moreover, we know that usually the number of votes/comments is ascending in function of time. the relationship between these elements can be considered to be a series, I chose to model it as a geometric progression (you can consider arithmetic one if you see it is better). Also, you have to keep in mind that you are counting some elements twice; Some users commented and upvoted so you counted them twice, also some can comment multiple times but can upvote only one time. I chose to consider that only 70% (in code p = 0.7) of the users are unique commenters and that users who commented and upvoted represent 60% (in code e = 1-0.6 = 0.4)of the the total number of users (commenters and upvoters), the result of these assumptions:

enter image description here

So we have two equation to model the score so you can combine them and take their average. In code this would look like this:

import warnings 
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from mpl_toolkits.mplot3d import axes3d
# filter warnings
warnings.filterwarnings("ignore")

class Cfit: 
    def __init__(self, votes, comments, scores, fit_size):
        self.votes    = votes
        self.comments = comments
        self.scores   = scores
        self.time     = 60          # prediction time 
        self.fit_size = fit_size
        self.popt     = []

    def func(self, x, a, d, q):
        e = 0.4
        b = 1
        p = 0.7
        return (a * np.exp( 1-(b / self.time**d )) + q**self.time * e * (x + p*self.comments[:len(x)]) ) /2

    def fit_then_predict(self):
        popt, pcov = curve_fit(self.func, self.votes[:self.fit_size], self.scores[:self.fit_size])
        return popt, pcov


# init
init_votes    = np.array([3,   1,  2,  1,   0])
init_comments = np.array([0,   3,  0,  1,  64])
final_scores  = np.array([26, 12, 13, 14, 229])

# fit and predict
cfit       = Cfit(init_votes, init_comments, final_scores, 15)
popt, pcov = cfit.fit_then_predict()

# plot expectations
fig = plt.figure(figsize = (15,15))
ax1 = fig.add_subplot(2,3,(1,3), projection='3d')
ax1.scatter(init_votes, init_comments, final_scores,                 'go',  label='expected')
ax1.scatter(init_votes, init_comments, cfit.func(init_votes, *popt), 'ro', label = 'predicted')
# axis
ax1.set_xlabel('init votes count')
ax1.set_ylabel('init comments count')
ax1.set_zlabel('final score')
ax1.set_title('fincal score = f(init votes count, init comments count)')

plt.legend()

# evaluation: diff = expected - prediction
diff = abs(final_scores - cfit.func(init_votes, *popt))
ax2  = fig.add_subplot(2,3,4)
ax2.plot(init_votes, diff, 'ro', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax2.grid('on')
ax2.set_xlabel('init votes count')
ax2.set_ylabel('|expected-predicted|')
ax2.set_title('|expected-predicted| = f(init votes count)')


# plot expected and predictions as f(init-votes)
ax3  = fig.add_subplot(2,3,5)
ax3.plot(init_votes, final_scores, 'gx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax3.plot(init_votes, cfit.func(init_votes, *popt), 'rx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax3.set_xlabel('init votes count')
ax3.set_ylabel('final score')
ax3.set_title('fincal score = f(init votes count)')
ax3.grid('on')

# plot expected and predictions as f(init-comments)
ax4  = fig.add_subplot(2,3,6)
ax4.plot(init_votes, final_scores, 'gx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax4.plot(init_votes, cfit.func(init_votes, *popt), 'rx', label='fit: a=%5.3f, d=%5.3f, q=%5.3f' % tuple(popt))
ax4.set_xlabel('init comments count')
ax4.set_ylabel('final score')
ax4.set_title('fincal score = f(init comments count)')
ax4.grid('on')
plt.show()

The output of the previous code is the following: enter image description here Well obviously the provided data-set is too small to evaluate any approach so it is up to you to test this more.

The main idea here is that you assume your data to follow a certain function/behavior (described in func) but you give it certain degrees of freedom (your parameters: a, d, q), and using curve_fit you try to approximate the best combination of these variables that will fit your input data to your output data. Once you have the returned parameters from curve_fit (in code popt) you just run your function using those parameters, like this for example (add this section at the end of the previous code):

# a function similar to func to predict scores for a certain values
def score(votes_count, comments_count, popt):
    e, b, p = 0.4, 1, 0.7
    a, d, q = popt[0], popt[1], popt[2]
    t       = 60
    return (a * np.exp( 1-(b / t**d )) + q**t * e * (votes_count + p*comments_count )) /2

print("score for init-votes = 2 & init-comments = 0 is ", score(2, 0, popt))

Output:

score for init-votes = 2 & init-comments = 0 is 14.000150386210994

You can see that this output is close to the correct value 13 and hopefully with more data you can have better/ more accurate approximations of your parameters and consequently better "predictions".

SuperKogito
  • 2,998
  • 3
  • 16
  • 37
  • 1
    This is by far the best answer I've ever received. Thank you. I will expand the dataset and test some more. – G. Ramistella May 19 '19 at 15:19
  • How do I predict the score for a single post? From what I understand the `init_comments` is used in the training phase but not in the prediction phase – G. Ramistella May 19 '19 at 15:33
  • Look at the last part of the answer (I just edited the answer). It shows how to predict the score for one case. If the answer is good consider accepting it. – SuperKogito May 19 '19 at 17:50
  • With a bigger dataset I am reaching an accuracy of 55%. By looking at the data there are a lot of cases where the same initial conditions result in a different final score (not by much though). I am guessing that adding more initial conditions (the same but @ minute 1-2-3) will yield better results. Is it possible to expand the `curve_fit` to include this information? – G. Ramistella May 19 '19 at 20:30
  • Duplicates cannot all be fitted at the same time and might reduce your accuracy. Yes you can, by slightly changing your `votes_count` and `comments_count` that you are passing to `Cfit` and accordingly your `func`. But you need to find a proper formatting of your input data. Is it times? or votes and comments ? and accordingly structure your approach. – SuperKogito May 19 '19 at 20:43
  • I wanted to collect both `votes_count` and `comments_count` every minute for 3 minutes and then do the prediction. How should I edit the code to do this? – G. Ramistella May 19 '19 at 20:45
  • You might need to ask more on this on a different [stack-exchange-sites](https://stackexchange.com/sites) just to get more help with the reasoning. [code-review](https://codereview.stackexchange.com/), [cross-validation](https://stats.stackexchange.com/) and [data-science](https://datascience.stackexchange.com/) are imo nice places to start. – SuperKogito May 19 '19 at 20:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193599/discussion-between-superkogito-and-a-dandelion). – SuperKogito May 19 '19 at 20:50