I am doing a data analysis/ machine learning project.
The main goal is to identify which component is causing the problem in a large dataset.
The dataset contain many rows, each rows represent one single test, each test contain the information such as TestingName, Component Used, and Result
For example, data are given as below
Test | ComponentUsed | Result |
---|---|---|
1 | 1,2,3,7 | Fail |
2 | 2,3 | Pass |
3 | 1,3,4,5 | Fail |
1 | 3,4 | Pass |
1 | 5,6,7 | Fail |
1 | 4,5,6 | Pass |
1 | 7 | Fail |
1 | 1,2 | Fail |
1 | 2,5,6 | Pass |
1 | 2,3,5,7 | Fail |
1 | 2,3,4 | Pass |
1 | 1,2,3,4,5,6,7 | Fail |
From the table above using human interpretation, we can find out that As long as there is "1", the result will fail As long as there is "7", the result will fail Hence, we can conclude that component 1 and 7 is faulty and cannot be used for testing. Note that there could be more than 100 of component used at once and million of test. There is also scenario where sometime 1 will pass or sometime 7 will pass, it doesn't always fail 100%. Another scenario is that there could be combination of component used together could lead to fail as well, for eg, if 2 and 3 used together it will fail, but if 2 are used without 3, it may pass. Something like that.
I had tried out decision tree in python, Trained using DecisionTree from scikit-learn and extracted out the __feature_importance it does give me the result of which feature is most importnace but doesn't show me the magnitude whether the feature affect positively or negatively. Why does it only shows the value for the importance, how do find the magnitude