Preparing data for clustering analysis and data pre processing

Question

I want to implement a rough c means clustering algorithm but I have no prior experience in clustering so I'm wondering if I need to do some pre processing to the data to make it usable for clustering.

For example let's say I have a csv file with a lot of attributes, some numeric, some strings.

IN order for me to apply rough c means clusering (or any other kind of clusering), should I apply other rough methods like attribute selection, rule discovery, discretization, do the lower/upper approximations?

What would be the normal flow of a set of mixed data for clustering? What would the data go through if I were to use a rough set approach algorithm for clustering?

Is there a certain order in which things are supposed to happen? I tried looking up for this information but I couldn't find it anywhere clearly stated.

ANy ideas? Or how could I make my question more clear in order to get an answer cause I can't find anything that would help me get started with clustering data and I dont see how clustering raw data would help me

    rank    discipline  yrs.since.phd   yrs.service sex salary  
1   Prof    B   19  18  Male    139750  
2   Prof    B   20  16  Male    173200  
3   AsstProf    B   4   3   Male    79750   
4   Prof    B   45  39  Male    115000  
5   Prof    B   40  41  Male    141500  
6   AssocProf   B   6   6   Male    97000   
7   Prof    B   30  23  Male    175000  
8   Prof    B   45  45  Male    147765  
9   Prof    B   21  20  Male    119250  
10  Prof    B   18  18  Female  129000  
11  AssocProf   B   12  8   Male    119800  
12  AsstProf    B   7   2   Male    79800   
13  AsstProf    B   1   1   Male    77700   
14  AsstProf    B   2   0   Male    78000   
15  Prof    B   20  18  Male    104800  
16  Prof    B   12  3   Male    117150  
17  Prof    B   19  20  Male    101000  
18  Prof    A   38  34  Male    103450  
19  Prof    A   37  23  Male    124750  
20  Prof    A   39  36  Female  137000

Can you post a sample of your data at least? In my experience, the main thing is you'll need a way to measure the distance between two datapoints. — Matt Cremeens, Apr 17 '17 at 13:03
Well I want to make something general but I posted a sample data anyway — Mocktheduck, Apr 18 '17 at 10:57
I'm thinking you may need to change the fields that are strings to something numeric so you can more easily calculate distance measurements. — Matt Cremeens, Apr 18 '17 at 14:21
Would naming the fields like let's say for Female and Male to be 0 and 1... and Prof, AssocProf and AsstProf be 0,1,2 help? And what would be the steps I'd have to take into reaching the clusters... going from this dataset also considering that I want to use Rough Sets into my algorithm — Mocktheduck, Apr 18 '17 at 15:19
I think you might want to try that encoding. By 'Rough Sets' do you mean 'Fuzzy Sets'? — Matt Cremeens, Apr 18 '17 at 15:30
No. I really mean Rough Sets/Rough Logic https://en.wikipedia.org/wiki/Rough_set — Mocktheduck, Apr 18 '17 at 15:33
My thinking is that your current clustering algorithm will easily calculate centroids at each iteration with the data points having numeric fields, as opposed to some linguistic fields, so Male=1 and Female=0, as you mentioned already, might be the way to go. — Matt Cremeens, Apr 18 '17 at 15:48
Yes but what Im talking about is if i have to do some data preprocessing too before c;ustering. Like removing useless data or confusing data or things like this — Mocktheduck, Apr 18 '17 at 19:04
I don't think so. I think the only thing you might have to do is quantify the linguistic fields somehow. I think the best way for anyone to help you on this site is to try something and if you get stuck, post it. — Matt Cremeens, Apr 18 '17 at 19:14
C-means will only work on *continuous* variables of the same scale. So the salary attribute is okay, but you would need multiple attributes like this. — Has QUIT--Anony-Mousse, May 02 '17 at 18:31

Preparing data for clustering analysis and data pre processing

0 Answers0