0

I have a dataset containing coordinates and categorical data, such as below:

dataset

I have searched a lot of papers and journals trying to find explanations regarding which distance measurement method I should apply on my dataset with DBSCAN Algorithm. Here I have a mixed dataset with Latitude and Longitude (coordinates), and Jenis Kecelakaan (Accident Type) as categorical data. Here I found it hard, how do we cluster mixed dataset as above? is there any recommendations of which distance measurement method is good and can be applied in dbscan in my case?

I've been stuck with this problem for days. Please help me out of this problem by giving me some explanation, paper/journal link, or blog like medium/towardsdatascience.

HelpMe
  • 11
  • 1
  • Welcome to StackOverflow. Please, add some of your probes and decisions to the question. Your question is too broad now because you don't enlight the clear problem that should be solved. – Dmitrii Sidenko Jun 14 '21 at 16:37
  • @ДмитрийСиденко thanks a lot for the advice, I've added a detailed explanation to my problem, hope it can make it clearer.. – HelpMe Jun 15 '21 at 03:01
  • Please list all the accident types and their English descriptions. That will help people on this forum. This question is quite subjective imo. The distance between different accident types is totally subjective. For example: Case 1: two pedestrian accidents that are separated by 1km. Case 2: one pedestrian and one auto crash that in the same location. which of these two should belong to the same cluster? (ie. - is the 'distance' between a pedestrian and auto crash more than 1km if we converted all distances to Euclidean ones). This is up to the analyst to decide. – Joydeep Sen Sarma Jun 15 '21 at 17:25
  • I want to do clustering with DBSCAN using 3 features (lat, long, accident_type), which accident_type is a categorical data. I want to cluster the location based on the accident_type instead of only using lat and long.. but I am still confused which distance metric/transformation data should I use to fit the dbscan algorithm.. Would u mind to give me some tips about this? – HelpMe Jul 05 '21 at 03:15

3 Answers3

1

Read this article, I prefer using OneHotEncoding

import pandas as pd

your_df = pd.read_csv('./your_data.csv')

# generate binary values using get_dummies
dum_df = pd.get_dummies(your_df, columns=["Jenis Kecelakaan"])

dum_df.head()
Maytham
  • 151
  • 1
  • 7
  • thank you so much for responding my question. Is OneHotEncoding able to be clustered along with coordinates data in dbscan?? If so, do you have any recommendations of which distance measurement method is good for calculating distances in coordinates + onehotencoding data? Hope you are willing to answer.. – HelpMe Jun 15 '21 at 02:51
  • OneHotEncoding convert your categorical features to numeric so yes you can cluster it with dbscan, I think each clustering algorithm (k-means, dbscan, etc..) has it's own distance measurement method, I think you don't need to do any thing about that – Maytham Jun 15 '21 at 05:42
1

Try it this way.

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

enter image description here

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()

enter image description here

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))


# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))


# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))



# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))


# Result:
Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()

data: https://github.com/gboeing/2014-summer-travels/tree/master/data

sample code: https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

ASH
  • 20,759
  • 19
  • 87
  • 200
  • thank you so much for the explanation, but I am so sorry I think I didn't explain in detail, the main focus of my problem. So I have clustered my data by coordinates here (lat and long). But I want to improve my analysis by adding "Accident Type" feature which is a categorical data along with latitude and longitude features (numerical) to be clustered. Here's my problem, how do we cluster mixed dataset (coordinates + categorical data) in dbscan? is it even possible? hope you are willing to explain.. – HelpMe Jun 15 '21 at 02:46
  • You can add as many features as you want, but as you add more and more, it will probably be less and less easy to interpolate the results. data = np.asarray([np.asarray(dataframe['Lat']),np.asarray(dataframe['Lon']),np.asarray(dataframe['Feaure1']),np.asarray(dataframe['Feature2'])]).T – ASH Jun 15 '21 at 02:57
  • I want to do clustering with DBSCAN using 3 features (lat, long, accident_type), which accident_type is a categorical data. I want to cluster the location based on the accident_type instead of only using lat and long.. but I am still confused which distance metric/transformation data should I use to fit the dbscan algorithm.. Would u mind to give me some tips about this? – HelpMe Jul 05 '21 at 03:14
  • You can use one hot encoding or label encoding for that labeled feature. See the link below for all details. https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/ – ASH Jul 08 '21 at 18:05
0

I had the same question and this is the best link I could find online. It's a bit complex but I think creating the distance matrix by yourself, as suggested in the link, is the best option I'm aware of.

Many ML algorithms create a distance matrix internally to find the neighbors. Here, you need to make your distance matrix based on lat/long using Harvesine, then create another distance matrix for the categorical feature, then concatenate the two matrices side by side and pass it as input to the model.

user3665906
  • 185
  • 13