-1

I am doing a data analysis/ machine learning project.

The main goal is to identify which component is causing the problem in a large dataset.

The dataset contain many rows, each rows represent one single test, each test contain the information such as TestingName, Component Used, and Result

For example, data are given as below

Test ComponentUsed Result
1 1,2,3,7 Fail
2 2,3 Pass
3 1,3,4,5 Fail
1 3,4 Pass
1 5,6,7 Fail
1 4,5,6 Pass
1 7 Fail
1 1,2 Fail
1 2,5,6 Pass
1 2,3,5,7 Fail
1 2,3,4 Pass
1 1,2,3,4,5,6,7 Fail

From the table above using human interpretation, we can find out that As long as there is "1", the result will fail As long as there is "7", the result will fail Hence, we can conclude that component 1 and 7 is faulty and cannot be used for testing. Note that there could be more than 100 of component used at once and million of test. There is also scenario where sometime 1 will pass or sometime 7 will pass, it doesn't always fail 100%. Another scenario is that there could be combination of component used together could lead to fail as well, for eg, if 2 and 3 used together it will fail, but if 2 are used without 3, it may pass. Something like that.

I had tried out decision tree in python, Trained using DecisionTree from scikit-learn and extracted out the __feature_importance it does give me the result of which feature is most importnace but doesn't show me the magnitude whether the feature affect positively or negatively. Why does it only shows the value for the importance, how do find the magnitude

LaLaOng
  • 1
  • 2

1 Answers1

0

You need to understand how feature importance is calculated and what they represent. For example, if you're using scikit-learn and an RF classifier, the feature importance usually calculates the GINI impurity based feature importance. Specifically:

The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

To answer your question, feature importance will be +ve and will add up to 1.

Suraj Shourie
  • 536
  • 2
  • 11