I have a csv file which has the following attributes - (id, date, status)
First I store the values in a dataframe and process it
# get the data in a data frame
log_csv = pd.read_csv('HILTGLOBAL.csv', sep=',')
# processing the data
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv['createdDate'] = pd.to_datetime(log_csv.createdDate)
log_csv['createdDate'] = log_csv['createdDate'].values.astype('datetime64[D]')
log_csv = log_csv.sort_values('createdDate')
After that I rename some columns as required by PM4PY, and get the event log
# renaming
log_csv.rename(columns = {'currentStatus':'concept:name','createdDate':'time:timestamp','candidateId':'case:concept:name'},inplace = True)
# getting the event logs
log = log_converter.apply(log_csv)
Then I try to get the directly follows graph of the above dataframe. I want the edges to represent the average time between each stage.
from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.objects.dfg.retrieval.log import Parameters
dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
gviz = dfg_visualization.apply(dfg, log=log,
variant=dfg_visualization.Variants.PERFORMANCE,parameters = {Parameters.AGGREGATION_MEASURE:'mean'})
dfg_visualization.view(gviz)
However, the outliers are being ignored in calculating the average time. I do not know how to fix it such that all points are considered.