3

i am trying to use WARMR to find frequent relational patterns in my data; for this i am using ALEPH in SWI-Prolog. however, i am struggling to figure out how to do this and why my previous attempts did not work.

i want to make a toy example work before i move on to my full data. for this i took the toy "train" data from the aleph pack page: http://www.swi-prolog.org/pack/list?p=aleph

the Aleph manual states about the ar search:

ar Implements a simplified form of the type of association rule search conducted by the WARMR system (see L. Dehaspe, 1998, PhD Thesis, Katholieke Universitaet Leuven). Here, Aleph simply finds all rules that cover at least a pre-specified fraction of the positive examples. This fraction is specified by the parameter pos_fraction.

accordingly i have inserted

:- set(search,ar).
:- set(pos_fraction,0.01). 

into the background file (and deleted :- set(i,2).)) and erased the .n file of negative examples. i have also commented out all the determinations and the modeh declaration logic being that we are searching for frequent patterns, not rules (i.e. in a supervised context head would be an "output" variable and clauses in the body -- "inputs" trying to explain the output), i.e. it is an unsupervised task.

now, the original trains dataset is trying to construct rules for "eastbound" trains. this is done by having predicates like car, shape, has_car(train, car) etc. originally all the background knowledge relating to these is located in the .b file and the five positive examples (e.g eastbound(east1).) in the .f file (+ five negative examples, e.g. eastbound(west1)., in the .n file). leaving files unchanged (save for the changes described above) and running induce. does not produce a sensible result (it would return ground terms like train(east1) as a "rule", for example). i have tried moving some of the background knowledge to the .f file but that did not produce anything sensible either.

how do i go about constructing the .f and .b files? what should to into the positive examples file if we are not really looking to explain any positive examples (which would surely constitute a supervised problem) but instead to find frequent patterns in the data (unsupervised problem)? am i missing something?

any help would be greatly appreciated.

stas g
  • 1,503
  • 2
  • 10
  • 20

1 Answers1

2

First of all if you can use the original WARMR I think it is better. But I think you need to be an academic for free use. You can try asking for a license. https://dtai.cs.kuleuven.be/ACE/

To get association rules, I put all the examples I want in the f file. The n file can have examples in it or I think be empty.

The only thing I change is to put :

 :- set(search,ar).
 :- set(pos_fraction,0.01).  

In the .b file. Keep the determinations and mode declarations.

The set(i,2) limits the length of the query to having two additional literals (I think) so you might want this to be larger.

?-read_all(train). induce.

You will then get an out of 'good clauses' which I think are the frequent queries.

[good clauses] eastbound(A). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), long(B). [pos cover = 2 neg cover = 0] [pos-neg] [2] eastbound(A) :- has_car(A,B), open_car(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), shape(B,rectangle). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), wheels(B,2). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), load(B,rectangle,3). [pos cover = 1 neg cover = 0] [pos-neg] [1] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), short(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), closed(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), shape(B,rectangle). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), wheels(B,2). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), load(B,triangle,1). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), long(B). [pos cover = 2 neg cover = 0] [pos-neg] [2] eastbound(A) :- has_car(A,B), open_car(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), shape(B,rectangle). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), wheels(B,3). [pos cover = 3 neg cover = 0] [pos-neg] [3] eastbound(A) :- has_car(A,B), load(B,hexagon,1). [pos cover = 1 neg cover = 0] [pos-neg] [1] eastbound(A) :- has_car(A,B), has_car(A,C). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), short(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), open_car(B). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), shape(B,rectangle). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), wheels(B,2). [pos cover = 5 neg cover = 0] [pos-neg] [5] eastbound(A) :- has_car(A,B), load(B,circle,1). [pos cover = 3 neg cover = 0] [pos-neg] [3] eastbound(A) :- has_car(A,B), open_car(B), shape(B,rectangle). [pos cover = 4 neg cover = 0] [pos-neg] [4]

etc etc

The rules are of the form eastbound(A):-blah blah. But it is only counting the eastbound examples. So think of this as example_covered(A):-blah blah

user27815
  • 4,767
  • 14
  • 28
  • user27815, thank you for your answer. i have tried it out but i am still confused about needing to have positive examples. i think `show(features)` prints all the "good clauses". in contrast when searching for constraints (through `induce_constraints`) one does not need to provide neither the .f file nor the .n file. – stas g Sep 22 '15 at 14:52
  • my data is of genetic origin (so information on markers, genes, mutations etc.) and i want to learn frequent patterns in the data. i do have phenotypic (trait) data as well but i need to learn features of the genetic data not tied to any of the traits (which are the only candidates for positive examples). – stas g Sep 22 '15 at 14:55
  • warmr needs to count something. In the normal warmr you can set what it will count using a key. In the trains example, presumably you want to count trains. Because you are not interested in positive or negatives, you only need one file to describe the trains. You might think that Aleph would just combine all the trains from the positives and negatives, but it does not do that. It just takes example trains from the positive file. If you want it to find frequent queries from all the trains, you need to put all the examples in the f file. Induce_constraints is different to freq queries. – user27815 Sep 22 '15 at 15:58
  • when the search is ar it runs show(features) so the output is similar, but the criteria used is also set to coverage of positives only. You can look at the source code.. it might help. – user27815 Sep 22 '15 at 16:06
  • hi, thank you for the comments! it definitely makes sense now and i think it has finally sank in. i've put all the train examples (east and west) in the positive examples file and run `induce`. show(features) now displays clauses with the desired coverage (according to the value of the pos_fraction). the only rule found predictably is `eastbound(A).`, which also makes sense. i have also checked examples from the ACE software and the way stuff is encoded in the .s file does seem a bit more intuitive; so does the output. however faffing about with obtaining the license is a deterrant. – stas g Sep 22 '15 at 16:21
  • i will try to move on to my bioinformatics data now. in light of our helpful discussion i am thinking markers would have to play the role of examples as it is their properties (location, ORFs they are sitting in etc) that we are concerned with. have you published any of your data mining work on bioinformatics data? – stas g Sep 22 '15 at 16:23
  • No problem, they are quite responsive in giving licenses just a quick email normally. The main advantage to using ACE is that you can add constraints to the language bias and use the rmodes which are quiet powerful. I have a short paper here which I am expanding: http://www.ilp2015.jp/papers/ILP2015_submission_29.pdf – user27815 Sep 22 '15 at 16:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/90347/discussion-between-stas-g-and-user27815). – stas g Sep 22 '15 at 16:28