-1

I am new to clustering techniques and I highly value any input you can provide for my problem bellow. Basically, I want to cluster URLs based on their structural patterns. for example

  • cluster1 - simple URLs https://domain/path/file
  • cluster2 - shortened URLs
  • cluster3 - redirect URLs
  • ....
  • cluster k - new URL pattern

Given a URL dataset, I want to understand how many different URL pattern clusters exists and then visually see the difference.

What I see in the existing methods are clustering domain wise (cluster URLs of the same website together). And this is not what I am expecting. When I try the nlp based (word based) similarity clustering this is happening as the URLs of the same website tend to have same words with little differences.

Instead, I want to focus on the URL structure and identify URL patterns. Removing all the special characters and just creating a bag of words for each URL nullify the URL structure. Can anyone help me to identify a suitable clustering technique as well as a vectorizing technique to identify different URL pattern clusters.

Thanks in advance Matheesha

Mathee
  • 691
  • 7
  • 16
  • @Erwan Hopefully this question clears my intension for URL clustering. Please have a look and advise. – Mathee Dec 29 '22 at 01:45
  • The [on-topic guide](https://stackoverflow.com/help/on-topic) suggests that questions should be about specific programming problems, and that questions asking for general software or tool recommendations are likely to lead to opinion-based answers. This question might be improved with some code showing what you've tried so far, or if it focuses on a specific programming problem. – Alexander L. Hayes Dec 29 '22 at 02:27

1 Answers1

1

Here is an example of how to cluster text.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Result:

 - *eating:* climbing, eating
 - *google:* google, squooshy
 - *feedback:* feedback
 - *face:* face, map
 - *impressed:* impressed
 - *extension:* extension
 - *key:* belly, best, key, kitten, merley
ASH
  • 20,759
  • 19
  • 87
  • 200