0

I have a csv file which has the following attributes - (id, date, status)

First I store the values in a dataframe and process it

# get the data in a data frame
log_csv = pd.read_csv('HILTGLOBAL.csv', sep=',')

# processing the data
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv['createdDate'] = pd.to_datetime(log_csv.createdDate)
log_csv['createdDate'] = log_csv['createdDate'].values.astype('datetime64[D]')
log_csv = log_csv.sort_values('createdDate')

After that I rename some columns as required by PM4PY, and get the event log

# renaming
log_csv.rename(columns = {'currentStatus':'concept:name','createdDate':'time:timestamp','candidateId':'case:concept:name'},inplace = True)

# getting the event logs
log = log_converter.apply(log_csv)

Then I try to get the directly follows graph of the above dataframe. I want the edges to represent the average time between each stage.

from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.objects.dfg.retrieval.log import Parameters

dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})

gviz = dfg_visualization.apply(dfg, log=log, 

variant=dfg_visualization.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
dfg_visualization.view(gviz)

However, the outliers are being ignored in calculating the average time. I do not know how to fix it such that all points are considered.

ayushi
  • 63
  • 1
  • 6

1 Answers1

0

Have you tried to play around with MAX_NO_EDGES_IN_DIAGRAM in parameters?

Add it to parameters like so, and try changing the number of edges:

dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE,
                          parameters = {Parameters.AGGREGATION_MEASURE:'mean', Parameters.MAX_NO_EDGES_IN_DIAGRAM:80})
Jorge Luis
  • 813
  • 6
  • 21
dg_
  • 1