2

The MovieLens data set provides a table with columns:

userid | movieid | tag | timestamp

I have trouble reproducing the way they pruned the MovieLens data set used in:

Tag Informed Collaborative Filtering, by Zhen, Li and Young

In 4.1 Data Set of the above paper, it writes "For the tagging information, we only keep those tags which are added on at least 3 distinct movies. As for the users, we only keep those users who used at least 3 distinct tags in their tagging history. For movies, we only keep those movies that are annotated by at least 3 distinct tags."

I tried to query the database:

select TMP.userid, count(*) as tagnum
from (select distinct T.userid as userid, T.tag as tag from tags T) AS TMP 
group by TMP.userid
having tagnum >= 3;

I got a list of 1760 users who labeled 3 distinct tags. However, some of the tags are not added on at least 3 distinct movies.

Any help is appreciated.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • What RDBMS are you using? Also is the effect meant to be cumulative or independent? e.g. If a tag is discarded as not on at least 3 distinct movies can this deletion affect the calculation of those users who used at least 3 distinct tags in their tagging history? (Independent seems to make more sense) – Martin Smith Jun 06 '10 at 11:41
  • I am using MYSQL. If a tag is discarded as not on at least 3 distinct movies, I think the deletion should affect the calculation of those users who used at least 3 distinct tags in their tagging history. The author of the paper got a list of 757 users. If it is independent, one should have gotten 1760 by using the SQL query I typed in the first post. – Patrick Chan Jun 06 '10 at 13:12

1 Answers1

0

You aren't anywhere limiting the movies per tag anywhere. It seems like you should first discard tags which have not been used on at least three movies and by three users. Then limit to users who have tagged three times.

This query should give you the tags that are both tagged by three+ users, and on three+ movies:

select T1.tag,
       (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
       (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
from tags T1
having mcount >= 3 and ucount >= 3;

If you query instead by users, and use the whole thing as a subquery, you should be able to check for the users who also have tagged three times:

select T4.user, count(*) as ucount from
 (select T1.userid as user,
         (select count( distinct T2.movieid ) from tags T2 where T2.tag = T1.tag) as mcount,
         (select count( distinct T3.userid ) from tags T3 where T3.tag = T1.tag) as ucount
  from tags T1
  having mcount >= 3 and ucount >= 3) as T4
group by user
having ucount > 3;
asc99c
  • 3,815
  • 3
  • 31
  • 54