1

I have a dataset:

  • almost 45K samples
  • 8 features
  • 4 classes

The percentage of samples for each class is different. I wanted to draw all scatter charts for each combination's pair, that's to say, 28 charts and by considering all dataset.

So at the end I get , for each chart, a scatter where I see the samples distributed by class. Since I have seen in a book,an example, where they plot these scatters by considering the same number of samples for each class.

For example: 100 samples class0, 100 samples class1, 100 sample class2, 100 samples class3.

Question: I am wondering if by considering all dataset with different percentage for each class is correct or not?

Note: I want to get a view to the figure out whether the features, taking them in pairs, are linearly separable or not.

mrk
  • 8,059
  • 3
  • 56
  • 78
Alex
  • 67
  • 1
  • 8
  • It depends on the purpose of the exercise, I guess. Just generating a visualisation, there is no 'correct' or 'not correct', as that depends on the purpose of the visualisation. If you're trying to get a view to the distribution of each class across (X,Y) then going for balanced samples is probably helpful. If you're trying to go for a visualisation to show the dominance of one class vs the other in (X,Y) then you'll want to use representative sampling – Henry Nov 16 '18 at 09:07
  • Well, I want to get a view to the figure out whether the features,taking them in pairs, are linearly separable or not. So , from your reply, I should go for balanced samples for each class. – Alex Nov 16 '18 at 09:14
  • How do you get those 28 Graphs? – mrk Nov 16 '18 at 09:23
  • By using ipertools.combinations() function in the ipertools module. In detail I am using a nested for loop one to get the feature pair to plot and the second one to draw the scatter chart. So at then end, for each scatter, I have samples grouped by class more or less , because problem is not linearly separable at the first sight. – Alex Nov 16 '18 at 09:33
  • I'm voting to close this question as off-topic because it is not about programming. – desertnaut Nov 17 '18 at 00:19

1 Answers1

1

This sounds like Feature Analysis or Feature selection

  1. If you want to find out from your plots wheather your Features are linearly separable or not I would go for all the samples of the class. Otherwise choosing a random set of say 100 samples will let you end up with ambiguous results for your plots and thus interpretations
  2. When trying to make sense of Features a mere qualitative "look" on plots shouldn't be the end of the Pipeline. Rather turn to some decent feature selection strategies and approaches, such as: Recursive Feature Elimination, Correlation Matrix, etc. (here some examples in R for a start)
  3. When trying to make sense of a set of Features there are methods such as ellbow method and others.
mrk
  • 8,059
  • 3
  • 56
  • 78
  • So , you are saying the contrary of what @Henry said. That's to say, I have to consider all samples of my dataset, even if they have a different percentage distribution for each class. Well, to assess / select the most important features I already used a random forest technique where I split my dataset 70% as train and trained RandomForestClassifier() by 10K estimators. What it's not clear is that I get like the less important feature one that seems to be discriminated in the scatter chart when paired with another feature. Does it make sense? – Alex Nov 16 '18 at 09:46
  • I am not sure to have been understanding everything, but I think you shouldn't neglect datapoints you have. And there are techniques for Feature set selection as well. Maybe that can be of use here. – mrk Nov 16 '18 at 09:52
  • At the moment,I plotted all combinations without neglecting any datapoint. Before doing that, I had already used an ensemble technique to select relevant features based on random forest and considering 70% of dataset as training. What I'm noticing is the less important feature got it by bar chart(random forest technique) , it seems to be, in some scatter chart where it is paired with another feature ,discriminating that's to say in that scatter the 4 clusters are more separate compared with a feature pair doesn't contain the less important feature found through random forest technique. – Alex Nov 16 '18 at 10:13
  • Also, *Class Correlation* is worth to mention, +1 all the way! – Yahya Nov 16 '18 at 22:01