-1

I'm a new user of Sci-kit learn, I'm working on a classification problem, in which I have two mains classes, Class_1 : benign programs and Class_2 : malware(malicious programs) The second class (malware) is composed from different sub-classes: worms, virus, Trojans, ...etc.

In my data set, I have samples of, benign programs, worms, virus, ....etc

So, as I have a pretty good accuracy in classifying just the two main classes (benign vs malware), I wouldn't like translate my problem at the beginning as a multi-class problem directly (benign VS Trojan VS virus VS worm ....). What I would like, is to build via sci-kit learn a composed classifier that, in a first time classify my data set in main class (malware, benign), then if the sample is classified as malware, carry-on towards multi-class problem (worm VS virus VS Trojan , ...).

I don't know how to do that directly via the functions of sci-kit learn. I heard about multi-label and multi-output classification, I don't know if my problem could be interpreted and implemented in scikit learn as a multi-output problem: I mean, two main classes (malware, begnin), with multiple-output (multiple sub-classes: worm, trojan , ....) for the malware class ?

Thank's in advance for your precious help

NL_user
  • 1
  • 1

1 Answers1

0

what you are doing is multi-class classification you can achieve it by specifying your own loss function with for instance:

loss(benign, trojan) = loss(benign, virus) = ... = 10

loss(trojan, virus) = loss(trojan, worm) = ... = 1

This will make your classifier "learn" that misclassifying a Worm as a Virus isn't as important a misclassifying a Malware as a Benign program.

Be Chiller Too
  • 2,502
  • 2
  • 16
  • 42
  • First of all, thank you very much for your answer ! I have just 2 questions : 1 - Please, could you give me an example of code or a website link to a code in scikit learn doing that ? 2 - I also heard about multi-label and multi-output classification, don't you think that my problem could be interpreted and implemented in scikit learn as a multi-output problem: I mean, two main classes (malware, begnin), with multiple-output (multiple sub-classes: worm, trojan , ....) for the malware class ? – NL_user Feb 28 '18 at 11:59
  • 1) Sorry but I don't have examples. 2) You could solve your problem in many ways, multi-label may be a solution. More precisely, what you are trying to do is "Structured Learning", especially "Hierarchical Multiclass Classification". Maybe [pyStruct](https://pystruct.github.io/user_guide.html#multi-label-svm) could help you but I don't really know this module. – Be Chiller Too Feb 28 '18 at 13:03
  • Or maybe [sklearn.multioutput.MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) – Be Chiller Too Feb 28 '18 at 13:05
  • Hello, sorry again for disturbing, please, I have difficulties to distinguish between multi-output classification and multi-label classification, could please, and that will my last question, explain me just briefly the difference between the two, and could you tell me if my problem is closer to multi-label ou multi-output classification ? – NL_user Feb 28 '18 at 16:44
  • Hello, I don't really know but you can find more info here: https://stats.stackexchange.com/q/11859 – Be Chiller Too Feb 28 '18 at 17:05