1

I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my code

from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)

X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)

Q-1 Out of these 7 some are discrete and some are continuous , is it ok to scale discrete and continuous both or just continuous? Q-2 Do i need to map cluster to test data as it learned from training?

Jon Nordby
  • 5,494
  • 1
  • 21
  • 50
user172500
  • 27
  • 7
  • Welcome to StackOverflow. Make sure you take the [tour](https://stackoverflow.com/tour). If you are using DBSCAN just to remove outliers before training, then why should you run it on the test data? – gnodab May 11 '20 at 13:10
  • @gnodab thanks for answering , i kept test separated from train data. I think to avoid data leakage we should pre process train data and map those processing same on test data. Please correct me if i am wrong – user172500 May 11 '20 at 13:44
  • You are right. Typically DBSCAN is uses on unlabelled data for clustering. In your case, one option might be to create a new "noise" label on your training set. Then when you predict on test the classifier could predict if the test point is a noise point or not. – gnodab May 11 '20 at 13:51
  • @gnodab sorry if i am not clear, let me rephrase. I have a dataset i want before training this data clean outliers in order to make generalized form of data using DBSCAN algorithm.Once i am done on train data , repeat same on test data. DBSCAN just give -1 as outlier and rest other are not outliers. From your above suggestion i can infer two algorithm one for learn label -1 outlier and use the same on test to find whether test data is an outlier or not , if not filter this record to find classification? Is it doable? – user172500 May 11 '20 at 14:09
  • You could do both. Just try it. I think be careful running DBSCAN on test with the same settings as you did with train. If your test dataset contains less data, then the same parameters may give different results. That's why I suggested learning the outliers. – gnodab May 11 '20 at 14:14
  • @gnodab thanks for clarification , can you please give some suggestion regarding question 1? Q-1 Out of these 7 some are discrete and some are continuous , is it ok to scale discrete and continuous both or just continuous before using DBSCAN? – user172500 May 11 '20 at 14:19
  • Sure. I don't think I have enough information about your dataset to know if it is ok to scale your features. In general, I don't think it's a problem. I would recommend looking at sklearns [standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). You will want to fit on train, then predict on test. This ensures that you are scaling in the same way for all data. – gnodab May 11 '20 at 14:58

1 Answers1

1

DBSCAN will handle those outliers for you. That's what is was built for. See the example below and post back if you have additional questions.

import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
  bins = 50,
  title = "Histogram of the age variable"
)

from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
  lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]

ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")

from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
  eps = 0.5,
  metric="euclidean",
  min_samples = 3,
  n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters

from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
  x = "age",
  y = "fare",
  c = clusters,
  cmap = cmap,
  colorbar = False
)

enter image description here

Koedlt
  • 4,286
  • 8
  • 15
  • 33
ASH
  • 20,759
  • 19
  • 87
  • 200
  • thanks for giving detail i know DBSCAN clutter up noise from your dataset , my question is my dataset has 7 features out of some are discrete and some are continuous , is it OK to scale discrete and continuous both or just continuous? Second question is do i need to map cluster to test data as it learned from training? To avoid data leakage we should pre process train data and map those processing same on test data, how can we do the same clustering on test data? @gnodp answered my query, if you have any thought please share – user172500 May 12 '20 at 03:39
  • sorry i noticed clusters = outlier_detection.fit_predict(ageAndFare) , is it predicting cluster? I believe DBSCAN is unsupervised algo , can i use the same on test data? – user172500 May 12 '20 at 03:48
  • As for the first question, it has to be continuous (i.e., numeric). The code will fail if it is not fed into the model in this type of data. I'm not sure what you mean in the second question. To cross validate, you always carve out a training set and a testing set. You can map two data sets together based on the index of the data frame. Data sets can be modified to the n-th degree (features added or deleted). The indexes (row numbers) should never change. – ASH May 12 '20 at 03:54
  • All clustering is unsupervised. Unsupervised machine learning algorithms infer patterns from a data set without reference to known, or labeled, outcomes. Unlike supervised machine learning, unsupervised machine learning methods cannot be directly applied to a regression or a classification problem because you have no idea what the values for the output data might be, making it impossible for you to train the algorithm the way you normally would. Unsupervised learning can instead be used to discover the underlying structure of the data. – ASH May 12 '20 at 03:56
  • thanks for clearing doubt , so my whole intent to use DBSCAN is to find associations and structures in data that are hard to find manually or using EDA, with the help of DBSCAN finding relevant, useful and consistent data in order to feed in machine where machine can learn generalize pattern.Am i on right track? The other problem is the more i give features in DBSCAN the more it gives outliers in the dataset – user172500 May 12 '20 at 04:07
  • I think...and this is a stretch...you just need to do a little more analysis, testing, experimenting, etc., and you will hit your target. I don't really understand some of your questions. That's totally fine. I have only been in the data science space for a few years. I think you have arrived here very recently. Maybe both of us have some ignorance that exceeds our knowledge. We will all learn new things every single day, and build on that foundation over time. I think we beat this to death, so to speak. Start a new post, and describe your issue/question well, if you need more help. – ASH May 12 '20 at 04:16
  • fair enough , my question 2 is how can i map DBSCAN on test data or live data , i think after removing data using DBSCAN and feeding these data into any ML model learns based on given data to split , for example tree based or boosted ML learns based on given data and make decision to splits data during learning. This learning (splitting criteria) will be used on test data or any unseen live data – user172500 May 12 '20 at 11:29
  • Yes, yes, that's all correct. You can feed in a live data stream, but you will almost certainly need to have a rolling window of sorts. Also, stick with DBSCAN, rather than K-Means. K-Means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes. – ASH May 12 '20 at 12:35