1

I'm trying to fit a curve to get an estimation function of the number of likes a news article have as a function of the articles age. I have a dataset with 5000 datapoints. X-axis is time since publication in hours and y-axis is the number of shares it has.

The limitations of the function is that its not allowed to have a negative derivate (an article will not loose likes when it gets older) and at x=0, y can't be larger than 0.

The only way i managed to get something like this was to use the function a*log(x-1)/log(b)+c and only applying it to the 240 first hours or so. If i take a longer timespan it just becomes a linear estimation where y(0) > 0. I also had to pick away all datapoints above 500 otherwise it gets way to high.

I used the following MATLAB code

modelFunc = @(p,x) p(1) .* log(x-1)/log(p(2)) + p(3);
coef = nlinfit(B(:,2),B(:,1),modelFunc,[1 2 0 0])

But this aproch have several problems that make the result Close to useless:

  • I asume that it is a logarithmic growth

  • I randomly picked the cut-off value in time to make the graph "look good"

  • I randomly picked a cut-off value for the "unnormaly high likes"

So this estimation line is based more on what looks good to my eyes than mathematical calculations...

Any ideas of how to get a good estimation for it?

The data set and an aproximation function

Close up of some datapoints

Daniel Falk
  • 522
  • 4
  • 16
  • The pictures don't make sense to me. Are there multiple dots with the same X value? – Timothy Shields Oct 09 '15 at 18:50
  • It does not look like your description/model fits the data. Its clear that you are getting higher numbers of shares for lower values of x. Before trying to fit a curve do some more data exploration. I would group your data by x values and find the mean and median of each group. Plot that to understand whats happening. – Salix alba Oct 09 '15 at 19:28
  • Yes, I agree with you that according to this pick of data the older ones have lower number of likes. But what I want to predict is not a function of what date they are published but a function of how many hours they have been online. Because a news article cant loose likes (more or less impossible) that fact is probably a sign that people shared less during those dates, for example because of less sharing in the summer than during the winter or similar. I'm interested in the likes based on the age, no matter what time of year is is when I do the prediction. – Daniel Falk Oct 10 '15 at 03:17
  • Ok, let me clarify a bit. Yes, there can be more than one dot with the same x-value because its possible that more than one new article have the same age but have different number of likes. – Daniel Falk Oct 10 '15 at 03:17
  • Are the data you're showing us the number of likes received on a given day? – pragmatist1 Oct 10 '15 at 03:33
  • The problem is that from this data, how long it's been up is not the major factor (it's clear there's an impact to do with something like topic, who it's been shared by, etc). You're not going to cleanly predict anything from time since posting alone. – nkjt Oct 11 '15 at 08:34

0 Answers0