0

{ "blogid": 11, "blog_authorid": 2, "blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn", "blog_timestamp": "2018-03-17 00:00:00", "blog_title": "Amazon India Fashion Week: Autumn-", "blog_subtitle": "", "blog_featured_img_link": "link to image", "blog_intropara": "Introductory para to article", "blog_status": 1, "blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"", "blog_type": "Blog", "blog_tags": "1,4,6", "blog_uri": "Amazon-India-Fashion-Week-Autumn", "blog_categories": "1", "blog_readtime": "5", "ViewsCount": 0 }

Above is one sample blog as per my API. I have a JsonArray of such blogs.

I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.

sns
  • 221
  • 4
  • 17

1 Answers1

1

My intuition is that this question might not be at the right address.

BUT

I would do the following:

  1. Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
    Sounds like this is for training and you are not worried about accuracy too much, numeric features should suffice.
  2. Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.

Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.

Good luck.

EDIT:

You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.

  • Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
  • On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.

I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address. I really want to help but this is the best I can do.

EDIT 2:

If I understand your new comments correctly, each blog has the following for each other blog:

  • A Jaccard similarity coefficient.
  • A set of TF-IDF generated words with scores.
  • A Euclidean distance based on numeric data.

I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.

You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

AChervony
  • 663
  • 1
  • 10
  • 15
  • Thanks for your response. This is for my own website's blogs. I don't have a login feature as of now. So,I don't have users data. I am worried about accuracy based upon various props of each blog such as title,categories,tags,author,textual body,etc. I'll take a look at the link you suggested. Even I was thinking about using KNN algo. It will be really great if you help/suggest based on my requirements above. Thanks – sns May 04 '18 at 11:10
  • Added notes to the answer. – AChervony May 15 '18 at 17:26
  • I have added one sample blog data above. Hope it helps in understanding my current situation. Thanks a lot already. – sns May 16 '18 at 13:53
  • As of now, I have used jaccard similarity for categorical data types, TF-IDF for textual data types and euclidean for numericals. Let me know if I can do anything better at those places. – sns May 16 '18 at 13:56
  • Some more hand-waving. Hopefully more specific and relevant. – AChervony May 16 '18 at 18:26