3

I have several thousand samples, which are already labeled as "A" or "Not A". Each sample has assigned [0-n] categories.

What I am trying to achieve is to find out which categories are suitable for labeling a new sample as "A" or "Not A".

My approach is splitting the samples into two sets, one containing all samples labeled as "A" and one set containing all "Not A".

Now I am creating a set of all different categories and count how often each category occurs in the "A" set and the "Not A" set.

Then I am calculating an error ratio based on the occurrences in the two sets (#occurrences in "Not A" / (#occurrences in "A" + #occurrences in "Not A")) for each category. These are then sorted ascending (according to the error ratio).

So now the challenge is to find out which of these categories are suitable to lable a sample as "A".

----------------------------------------------------
| Category | error ratio | #occ "A" | #occ "Not A" |
--------------------------------------------------
| V        | 0           | 2        | 0            |
----------------------------------------------------
| W        | 0           | 59       | 0            |
----------------------------------------------------
| X        | 0,138       | 125      | 20           |
----------------------------------------------------
| Y        | 0,901       | 9        | 82           |
----------------------------------------------------
| Z        | 1           | 1        | 0            |
----------------------------------------------------

So first of all I need to decide how many observations are needed to further process my categories. In the shown table V and Z are probably not pretty good categories to choose, since there are too few occurrences. But is there a statistical approach to decide which categories should be discarded?

After that I need to choose where my decision boundary is. I was thinking about creating all possible combinations of categories, then measure the accuracy and choose the largest set with a higher accuracy than ~95%.

In the first step I would only use {V} to decide, whether a sample is "A" or "Not A". Then {W}, ... {V, W}, {V, X}, ... {V, W, X}, ... {V, W, X, Y, Z}. Which seems to be the complexity of (2^n - 1).

Since I have several thousands of categories, this is impossible. Is there an optimisation algorithm I can use for this purpose?

  • You can sort categories by error ratio O(nlogn) small to large. Then, choose top k-categories based on cumulative error ratio based on cumulative counts of occ "A" & occ "Not A" for the top-k categories, O(n). Additionally, you may consider "prior" by adding some constants to #occ "A" and #occ "Not A" so that categories with less certain error ratio can be filtered somehow. – Sanghack Lee Jul 22 '17 at 01:34

1 Answers1

0

You probably do not have to invent bicycles.

You can encode your data in a binary way, like this:

A  V  W  X  Y  Z
1  1  1  0  0  1
0  0  1  1  0  0
1  0  1  1  1  0 
...

Thereafter, you can feed your data to any classification algorithm, like Naive Bayes, Logistic regression, Decision tree classifier, SVM, et cetera.

David Dale
  • 10,958
  • 44
  • 73