1

I have dataset of records where each record is with 5 labels and the importance of each label is different.

I know to labels order according to importance but don't know the differences, so the difference between two records is look like: adist of label1 + bdist of label2 + c*dist of label3 such that a+b+c = 1.

The data set contains around 3000 records and I want to cluster it(don't know the number of clusters) in some way.

I thought about DBSCAN but it is not really good with high dimensional data.

Hierarchical clustering need to know the number of clusters and also I think that it depands what it the first record you compare to so maybe the result will be wrong in this case.

Also look for graph clustering so the difference between two records will be the weight of the edge between this tow nodes but didn't find an algorithm that does that.

EDIT:

the data is a CDR data, represent the antennas user connected to while using his cellphone for calling, SMS and internet so the labels are:

location(longitude,latitude), part_of_day(night,morning-noon,after noon,evening), 
workday\weekend, ,day_of_week, num of days of connection to this antenna

And I want to cluster it to detect points of interest of this user such as gym, mall, etc.. so I want to cluster it and separate between gym and mall even though they are close to each other but it is a different activity.

Any ideas about how to do it?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Roy Ancri
  • 119
  • 2
  • 14
  • 1
    So your dataset has 3000 records but 5 columns? In this case it's not highly dimensional. Maybe you could be more precise about your columns, and the type of values they can take, by providing a sample data? – PlasmaBinturong Dec 09 '19 at 12:23
  • Agglomerative clustering doesn't need to know the number of clusters per say, however e.g. the scikit-learn implementation has that as an option (but you can likely inspect the underlying graph). You could also have a look at hdbscan, which works similar to dbscan, but should scale better. – user2653663 Dec 09 '19 at 12:59
  • @user2653663 I will check it out. Thank you. Is it has the option to assign weights to each dimension? – Roy Ancri Dec 09 '19 at 13:13
  • 1
    In both cases, you can just scale your input features before training/predicting. Since they are both using distance metrics, this should have the same effect. – user2653663 Dec 09 '19 at 13:53

0 Answers0