proportion between two features in a scatter plot

Question

I have a dataset:

almost 45K samples
8 features
4 classes

The percentage of samples for each class is different. I wanted to draw all scatter charts for each combination's pair, that's to say, 28 charts and by considering all dataset.

So at the end I get , for each chart, a scatter where I see the samples distributed by class. Since I have seen in a book,an example, where they plot these scatters by considering the same number of samples for each class.

For example: 100 samples class0, 100 samples class1, 100 sample class2, 100 samples class3.

Question: I am wondering if by considering all dataset with different percentage for each class is correct or not?

Note: I want to get a view to the figure out whether the features, taking them in pairs, are linearly separable or not.

It depends on the purpose of the exercise, I guess. Just generating a visualisation, there is no 'correct' or 'not correct', as that depends on the purpose of the visualisation. If you're trying to get a view to the distribution of each class across (X,Y) then going for balanced samples is probably helpful. If you're trying to go for a visualisation to show the dominance of one class vs the other in (X,Y) then you'll want to use representative sampling — Henry, Nov 16 '18 at 09:07
Well, I want to get a view to the figure out whether the features,taking them in pairs, are linearly separable or not. So , from your reply, I should go for balanced samples for each class. — Alex, Nov 16 '18 at 09:14
By using ipertools.combinations() function in the ipertools module. In detail I am using a nested for loop one to get the feature pair to plot and the second one to draw the scatter chart. So at then end, for each scatter, I have samples grouped by class more or less , because problem is not linearly separable at the first sight. — Alex, Nov 16 '18 at 09:33
I'm voting to close this question as off-topic because it is not about programming. — desertnaut, Nov 17 '18 at 00:19

mrk · Answer 1 · 2018-11-16T09:53:31.440

1

This sounds like Feature Analysis or Feature selection

If you want to find out from your plots wheather your Features are linearly separable or not I would go for all the samples of the class. Otherwise choosing a random set of say 100 samples will let you end up with ambiguous results for your plots and thus interpretations
When trying to make sense of Features a mere qualitative "look" on plots shouldn't be the end of the Pipeline. Rather turn to some decent feature selection strategies and approaches, such as: Recursive Feature Elimination, Correlation Matrix, etc. (here some examples in R for a start)
When trying to make sense of a set of Features there are methods such as ellbow method and others.

edited Nov 16 '18 at 09:53

answered Nov 16 '18 at 09:31

mrk

8,059
3
56
78

So , you are saying the contrary of what @Henry said. That's to say, I have to consider all samples of my dataset, even if they have a different percentage distribution for each class. Well, to assess / select the most important features I already used a random forest technique where I split my dataset 70% as train and trained RandomForestClassifier() by 10K estimators. What it's not clear is that I get like the less important feature one that seems to be discriminated in the scatter chart when paired with another feature. Does it make sense? – Alex Nov 16 '18 at 09:46
I am not sure to have been understanding everything, but I think you shouldn't neglect datapoints you have. And there are techniques for Feature set selection as well. Maybe that can be of use here. – mrk Nov 16 '18 at 09:52
At the moment,I plotted all combinations without neglecting any datapoint. Before doing that, I had already used an ensemble technique to select relevant features based on random forest and considering 70% of dataset as training. What I'm noticing is the less important feature got it by bar chart(random forest technique) , it seems to be, in some scatter chart where it is paired with another feature ,discriminating that's to say in that scatter the 4 clusters are more separate compared with a feature pair doesn't contain the less important feature found through random forest technique. – Alex Nov 16 '18 at 10:13
Also, *Class Correlation* is worth to mention, +1 all the way! – Yahya Nov 16 '18 at 22:01

proportion between two features in a scatter plot

1 Answers1