0

I'm doing a project on categorizing users based on their surfing patterns on a site.

For this I need to find patterns in the data and then cluster them, but the clustering is a problem, since the clustering algorithms I tried (k-means, agglomerative and DBSCAN) don't allow lists as input data.

I have lists with pages visited, separated by session.

Example:

data = [[1, 2, 5],
        [2, 4],
        [2, 3],
        [1, 2, 4],
        [1, 3],
        [2, 3],
        [1, 3],
        [7, 8, 9],
        [9, 8, 7],
        [1, 2, 3, 5],
        [1, 2, 3]]

Each list represents a session with visited pages. Each number represents a part of the URL.

Example:

1 = '/home'
2 = '/blog'
3 = '/about-us'
...

I put the data through a pattern mining script.

Code:

import pyfpgrowth # pip install pyfpgrowth

data = [[1, 2, 5],
        [2, 4],
        [2, 3],
        [1, 2, 4],
        [1, 3],
        [2, 3],
        [1, 3],
        [7, 8, 9],
        [9, 8, 7],
        [1, 2, 3, 5],
        [1, 2, 3]]

patterns = pyfpgrowth.find_frequent_patterns(data, 2)
print(patterns)

rules = pyfpgrowth.generate_association_rules(patterns, 0.7)
print(rules)

Result:

# print(patterns)

{(1,): 6,
 (1, 2): 4,
 (1, 2, 3): 2,
 (1, 2, 5): 2,
 (1, 3): 4,
 (1, 5): 2,
 (2,): 7,
 (2, 3): 4,
 (2, 4): 2,
 (2, 5): 2,
 (4,): 2,
 (5,): 2,
 (7,): 2,
 (8,): 2,
 (9,): 2}

# print(rules)

{(1, 5): ((2,), 1.0),
 (2, 5): ((1,), 1.0),
 (4,): ((2,), 1.0),
 (5,): ((1, 2), 1.0)}

According to a paper I'm using the next step would be to use the found patterns as input for the clustering algorithm (page 118 chapter 4.3), but as far as I know the clustering algorithms don't accept lists (with variable lengths) as inputs.

I have tried this, but it didn't work.

Code:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=0).fit(patterns)

test = [1, 8, 2]

print(kmeans.predict(test))

What should I do to let the k-means algorithm be able to predict the group to which the surfing pattern belongs to or is there another algorithm which is more suited for this?

Thanks in advance!

Ben Blanc
  • 63
  • 7

1 Answers1

1

Both HAC and DBSCAN could be used with lists.

You just need to compute the distance matrix yourself, because you obviously cannot use Euclidean distance on this data. Instead. You could consider Jaccard, for example.

K-means cannot be used. It needs continuous data in R^d.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you for the response, I'll try this out. – Ben Blanc May 08 '19 at 07:03
  • I have tried both Hyrarchical Agglomerative clustering and DBSCAN. Neither accept lists in the form of the example above. Is there a way you can suggest to put the data in the right format for use in the clustering methods mentioned above? – Ben Blanc May 08 '19 at 12:38
  • As I wrote *you* need to do the similarity matrix and provide it as input to HAC or DBSCAN. For example, compute a matrix of Jaccard coefficients. – Has QUIT--Anony-Mousse May 08 '19 at 18:27
  • I have made a similarity matrix, gave it to the mentioned clustering models and it worked! I also tried it with a matrix of euclidean distance, which also worked. Was there a reason why you thought euclidean distance wouldn't work? Perhaps a second question, how does a clustering algorithm know how to cluster based on that matrix? It doesn't look at all like my original dataset. Do you have suggestion for testing if it clusters correctly? – Ben Blanc May 10 '19 at 09:33
  • How did you generate a matrix of Euclidean distances? That distance is defined on R^p and I can't see how your data could be a vector space like that. – Has QUIT--Anony-Mousse May 11 '19 at 07:05
  • I made a dataframe with columns that are the patterns found through PrefixSpan on my data and per row (session) 1 and 0 if the column is a subsequence of the current session. I used that dataframe for calculating the distance matrix the jaccard way and also for the euclidean way. – Ben Blanc May 13 '19 at 07:31
  • import pandas as pd from scipy.spatial.distance import euclidean, pdist, squareform def similarity_func(u, v): return 1/(1+euclidean(u,v)) dists = pdist(df_data, similarity_func) df_euclid = pd.DataFrame(squareform(dists), columns=df_data.index, index=df_data.index) print(df_euclid) – Ben Blanc May 13 '19 at 07:32
  • The previous comment is the code used for calculating the distance matrix the euclidean way. I got it from this post: https://stackoverflow.com/questions/35758612/most-efficient-way-to-construct-similarity-matrix#comment59191999_35758999 – Ben Blanc May 13 '19 at 07:34
  • This approach you chose adds some very weird bias, based on the patterns found by prefixspan. I am not a fan of dummy coding either. I'd rather use the real Jaccard on sets. You're usually better off with something *explainable*. – Has QUIT--Anony-Mousse May 13 '19 at 18:16
  • What do you mean with "I'd rather use the real Jaccard on sets"? The reason I'm doing it the way as described before is because I got errors saying the clustering model doesn't accept input where the columns aren't as long as the rows. Clustering based on the jaccard matrix of my dataframe with frequent patterns as features didn't go well. I set it to 10 clusters, but everythong seemed to belong to cluster 0. Clustering based on the euclidean matrix of my dataframe with frequent patterns as features went well. 10 clusters and the data was clustered in 10 groups. – Ben Blanc May 14 '19 at 09:47
  • It's easy to get a bad result with k-means... Make sure to set *all* the parameters correctly. If you have a proper distance matrix, it will A) have the same length, so that error cannot arise and B) it by definition it's a square matrix anyway, and when configured to use a distance and not (!) a data matrix, it should check that it is square. – Has QUIT--Anony-Mousse May 14 '19 at 18:40
  • The clustering I meant in my earlier comment was hierarchical agglomerative clustering (also tried it on DBSCAN). Neither gave good results with jaccard, but with euclidean it seems to do well. – Ben Blanc May 15 '19 at 13:53