0

I have a dataset of air quality values. For the research that I am writing, I am comparing the Apriori, ECLAT and FP-growth algorithm against eachother. For the Apriori and FP-growth algorithms I have used the mlxtend library and for ECLAT I have used pyECLAT. However when I run these three algorithms I get the same result for Apriori and FP-growth. But for ECLAT I get a different result.

Apriori implementation

frequent_itemset = apriori(encoded_data, min_support = min_support, use_colnames = True)

FP-growth implementation

frequent_itemset = fpgrowth(encoded_data, min_support=min_support, use_colnames = True)

I get the same result for Apriori and FP-growth

Support Itemset
1 (pm25-Zelo dobra)
1 (pm25-Zelo dobra, nox-Zelo nizka)
1 (nox-Zelo nizka)
0.99 (pm25-Zelo dobra, nox-Zelo nizka, no2-Zelo dobra)
0.99 (pm25-Zelo dobra, no2-Zelo dobra)
0.99 (nox-Zelo nizka, no2-Zelo dobra)
0.99 (no2-Zelo dobra)

ECLAT implementation

eclat_instance = ECLAT(encoded_data, verbose=False)
get_ECLAT_indexes, get_ECLAT_supports = eclat_instance.fit(min_support=min_support, min_combination=1, max_combination=3, separator=' & ', verbose=False)

frequent_itemset = pd.DataFrame(get_ECLAT_supports.items(),columns=['itemsets','support'])
frequent_itemset = frequent_itemset[['support','itemsets']]
new_column = []
for row in frequent_itemset['itemsets']:
    r = row.split('&')
    r = tuple(i.strip() for i in r)
    new_column.append(r)
frequent_itemset['itemsets'] = pd.Series(new_column)
frequent_itemset

So I could get the same frequent_itemset format as the mlextend offers I used this proposed workaround.

ECLAT result

Support Itemset
1 (pm25-Zelo dobra, benzen-Dobro)
1 (pm25-Zelo dobra,)
1 (nox-Zelo nizka, pm25-Zelo dobra, benzen-Dobro)
1 (nox-Zelo nizka, pm25-Zelo dobra)
1 (nox-Zelo nizka, benzen-Dobro)
1 (nox-Zelo nizka,)
1 (benzen-Dobro,)
0.99 (pm25-Zelo dobra, no2-Zelo dobra, benzen-Dobro)

Should the result be the same for all three?

I know that the three algorithms are working in a different way, since one is better for larger datasets one is better for smaller and different speed performances.

The only difference in parameters in the implementation is that the ECLAT function has min_combination and max_combination. I have tried playing around with those values, but could not reproduce the same result.

0 Answers0