Is there any code to modify raw data in excel to event log which is required for process mining

Question

The problem that I am going through is that I want to do process mining in a set of data. But I am facing difficulties to covert the raw data in excel to an event log required for any process mining software.is there any way to modify the data from excel to convert to an event log.

score 0 · Accepted Answer · answered Apr 06 '20 at 09:35

0

If your data is in CSV format, you are able to transform it to XES in ProM/ProMLite by using the "convert csv to xes" plugin.

answered Apr 06 '20 at 09:35

MRFS

96
9

thankyou, but whether I can download XES file from ProM lite for python programming – Sukanya s May 26 '20 at 12:35
Yes, you can use XES in PM4PY. There is a simple function to load an event log in a XES format there. from pm4py.objects.log.importer.xes import factory as xes_import_factory log = xes_import_factory.apply("") – MRFS May 26 '20 at 16:42

score 0 · Answer 2 · answered Jul 23 '21 at 16:53

It is not a trivial task at all. However, it follows an approach that is not a common tool in typical data scientists' toolbox :) Here, I explain the basic idea!

Many process mining papers mention that most of the existing information systems are PAIS (process-aware information system) hence, qualified to perform process mining on them. This is true, BUT, it does not mean you can get the data out-of-the-box!

What's the solution? You may transform the existing data (typically from a relational database of your business solution, e.g., an ERP or HIS system) into an event log that process mining can understand.

It works like this: you look into the table containing, e.g., patient registration data. You need the patient ID of this table and the timestamp of registration for each ID. You create an empty table for your event log, typically called "Activity_Table". You consider giving a name to each activity depending on the business context. In our example "Patient Registration" would be a sound name. You insert all the patient IDs with their respective timestamp into the Activity_Table followed by the same activity name for all rows, i.e., "Patient Registration". The result looks like this:

|Patient-ID | Activity             | timestamp           |
|:----------|:--------------------:| -------------------:|
| 111       |"Patient Registration"| 2021.06.01 14:33:49 |
| 112       |"Patient Registration"| 2021.06.18 10:03:21 |
| 113       |"Patient Registration"| 2021.07.01 01:20:00 |
| ...       |                      |                     |

Congrats! you have an event log with one activity. The rest is just the same. You create the same table for every important action that has a timestamp in your database, e.g., "Diagnose finished", "lab test requested", "treatment A finished".

|Patient-ID | Activity          | timestamp           |
|:----------|:-----------------:| -------------------:|
| 111       |"Diagnose finished"| 2021.06.21 18:03:19 |
| 112       |"Diagnose finished"| 2021.07.02 01:22:00 |
| 113       |"Diagnose finished"| 2021.07.01 01:20:00 |
| ...       |                   |                     |

Then you UNION all these mini tables and sort them based on Patient-ID and then by timestamp:

|Patient-ID | Activity             | timestamp           |
|:----------|:--------------------:| -------------------:|
| 111       |"Patient Registration"| 2021.06.01 14:33:49 |
| 111       |"Diagnose finished"   | 2021.06.21 18:03:19 |
| 112       |"Patient Registration"| 2021.06.18 10:03:21 |
| 112       |"Diagnose finished"   | 2021.07.02 01:22:00 |
| 113       |"Patient Registration"| 2021.07.01 01:20:00 |
| 113       |"Diagnose finished"   | 2021.07.01 01:20:00 |
| ...       |                      |                     |

If you notice, the last two rows have the same timestamp. This is very common when working with real data. To avoid this, we need an extra column called "sorting" which helps the process mining algorithm to understand the "normal" order of activities with the same timestamp according to the nature of the underlying business. In this case, we can easily know that registration happens before diagnosis hence, we assign a low value (e.g., 1) to all "Patient Registration" activities. The table might look like this:

|Patient-ID | Activity             | timestamp           |Order |
|:----------|:--------------------:|:-------------------:| ----:|
| 111       |"Patient Registration"| 2021.06.01 14:33:49 |  1   |
| 111       |"Diagnose finished"   | 2021.06.21 18:03:19 |  2   |
| 112       |"Patient Registration"| 2021.06.18 10:03:21 |  1   |
| 112       |"Diagnose finished"   | 2021.07.02 01:22:00 |  2   |
| 113       |"Patient Registration"| 2021.07.01 01:20:00 |  1   |
| 113       |"Diagnose finished"   | 2021.07.01 01:20:00 |  2   |
| ...       |                      |                     |      |

Now, you have an event log that process mining algorithms understand!

Side note: there have been many attempts to automate the event log extraction process. The works of "Eduardo González López de Murillas" are really interesting if you want to follow this topic. I could also recommend this open-access paper by Eduardo et al. 2018: "Connecting databases with process mining: a meta-model and toolset" (https://link.springer.com/article/10.1007/s10270-018-0664-7)

Is there any code to modify raw data in excel to event log which is required for process mining

2 Answers2