Improving prediction accuracy in Bayesian Causal Network

Question

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.

I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.

Ensuring no missing/NaN values
Discretization (only it is supported by the library)
SMOTE/SMOTETomek for data balancing
Various train/test size combinations

I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.

Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?

Any ideas/literature/other suitable library regarding this would be of great help!

Beware that notears (and other causal discovery using the same or similar score functions) are highly sensitive to the data scale (see [this article](https://proceedings.neurips.cc/paper_files/paper/2021/file/e987eff4a7c7b7e580d659feb6f60c1a-Paper.pdf) for more information). In fact, the results are often dominated by information in the data scale. On synthetic data the algorithm tends to perform well only when data are not standardized, which is a problem for applications where data scales are arbitrary or do not contain information. This could be the case in your setting. — Scriddie, Jul 11 '23 at 08:01

score 2 · Answer 1 · answered Jul 10 '20 at 00:25

A few tips that can help:

Changing/Tuning the Structure learning.

Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).

This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.
Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.
Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.

Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:

Methods available in Causalnex (uniform, for example)
fixed discretisations based on what thresholds make sense for your data
MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

Improving prediction accuracy in Bayesian Causal Network

1 Answers1