0

I have a last.fm dataset composed of songs and their tags given by the users. I want to apply a clusterization on the dataset in order to find clusters of songs based on tags.

The dataset has 200k songs and 119k different tags. I was previously thinking on making a matrix NxM, where N is the number of songs and M is the number of attributes, and each position is 0 or 1 indicating the presence or not presence of a tag in the song. However, the huge dimension of the matrix has stopped me for doing so. I have some ideas on applying a SVD for reducing dimensionality before applying the clustering, but I don't know exactly if it is the best approach.

Therefore, does anybody know some work in the literature which attempts to perform such kind of clustering? Or any other idea in my problem?

Thank you very much in advance

Thiago
  • 694
  • 3
  • 12
  • 26

1 Answers1

0

Clustering probably is the wrong tool for your problem.

Are you sure you want to split your data into (usually) non-overlapping chunks? What if there is some overlap needed? Say, there are songs that are both "hip hop" and "driving beats" but these tags are not synonys?

Frequent itemset mining (Market basket analysis)

is much more applicable, isn't it?

Consider every song to be a "market basket", every tag to be an "item" in these transactions. The FIM will identify frequent tag combinations, and derive patterns from that.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194