Data-mining algorithm for dynamically consolidating recurring substrings?

Question

I am trying to construct an artificial intelligence unit. I plan to do this by first collecting sensory input ('observations') into a short-term working-memory list, continually forming patterns found in this list ('ideas'), and committing those ideas to a long-term storage memory when they reach a substantial size, perhaps seven chained observations. For any philosophy folks, similar to Locke's Essay on Human Understanding, but this won't be Tabula Rasa. There needs to be an encoded underlying structure.

Thus, my question is:

Is there/where is a good algorithm for dynamically consolidating or 'pattern-izing' largest substrings of this constantly growing observations string? For example: if I have thus far been given ABCDABCABC, I want an ABC idea, D, and two other ABC ideas; then, if another D is observed and added to the short-term memory, I want an ABCD token, an ABC token, and another ABCD token. I don't want to use Shortest Common Substring, because I would need to rerun it after an arbitrary number of character additions. I think I'd prefer some easily searchable/modifiable tree structure.

Does this look like a decent enough solution? http://www.cs.ucsb.edu/~foschini/files/licenza_spec_thesis.pdf. If nothing else, I think the other data-miners may enjoy.

what are {A,B,C,D}? single characters, words/tokens, or substrings? — wildplasser, Feb 28 '13 at 00:28
I am no expert in this field, but this sounds very much like what you want to do when you build a dictionary for a compression algorithm. — Ryan, Feb 28 '13 at 00:29
@wildplasser - they are 'observations,' sensory input tokens, but as far as I'm concerned they could be characters. — Bondolin, Feb 28 '13 at 00:42
@wildplasser - Thanks for the replies; this is getting at what I am trying to do; but I want the algorithm to develop the DFAs on its own. An encoded DFA already knows the language's words; I would like an algorithm that comes up with the words. — Bondolin, Feb 28 '13 at 01:00
You'll need a first pass (the tokeniser) which consumes characters, and returns them as "words" or "tokens". (if you cannot construct such a tokenizer, your token will be the smallest possible token: a single character) Developing a "dynamic" DFA is very close to constructing a Markov tree (or decision tree, or Bayes tree). (except , for instance, that loops are possible) See my profile for Wakkerbot's tokeniser (which is rather advanced, IMHO, but very domain-specific) — wildplasser, Feb 28 '13 at 01:06
@wildplasser - great, leads, thanks. I will DFAnitely check out Wakkerbot. Would you mind making an answer out of the comment so I can accept it? — Bondolin, Feb 28 '13 at 03:21
@Ryan - thanks. Not sure how well I could use Huffman coding, but the dictionary algorithms look interesting. — Bondolin, Mar 01 '13 at 01:31

score 1 · Accepted Answer · answered Feb 28 '13 at 23:26

First step: the tokeniser. Define what you consider {A,B,C,D} and what not.

you need at least one extra token for garbage/miscellanious stuff (the good news is, that if this token occurs, the statemachine that follows will always be reset to its starting state)
you may or may not want to preserve whitespace (which would again cause an extra token, and a lot of extra states later in the DFA or NFA recogniser)
maybe you need some kind of equivalence class: eg wrap all numeric strings to one token type; fold lower/uppercase; accept some degree of misspelling (difficult!)
you might need special dummy token types for begin of line/end of line and the like.
you must make some choice about the amount of false positives versus the amount of false negatives that you allow.
if there is text involved make sure that all the sources are in the same canonical encoding, or preprocess them to bring them into the same encoding.

Building the tokeniser is an excellent way to investigate your corpus: if it is real data from the outside world, you will be amazed about the funky cases you did not even knew they existed when you started!

The second step (the recogniser) will probably be much easier, given the right tokenisation. For a normal deterministic statemachine (with predefined sequences to recognise) you can use the standard algorithms from the Dragon Book, or from Crochemore.

For fuzzy self-learning matchers, I would start by building Markov-chains or -trees. (maybe Bayes-trees, I am not an expert on this) . I don't think it will be very hard to start with a standard state-machine, and add some weights and counts to the nodes and edges. And dynamically add edges to the graph. Or remove them. (this is where I expect to starts to get hard)

A strategic decision: do you need a database? If your model fits in core, you won't and you should not. (databases are not intended to fetch one row and process it, then store it then fetch the next row, etc) If your data does not fit in core, you'll have more than a data-modelling problem. BTW: all the DNA-assemblers/matchers that I know off, work in core and with flatfiles. (maybe backed up by a database for easy management and inspection)

Thanks much; especially eager to look into the self-learning matchers. — Bondolin, Mar 01 '13 at 01:01
BTW: what is your domain of interest? Natural langage? logfile analysis? DNA-recognition? Digital music recognition ;-? — wildplasser, Mar 01 '13 at 01:03
I briefly mention it in the first paragraph of the question. I wish to make an autonomous learning unit, with an input stream of sensory input tokens (the A, B, C, and Ds). Neural association on a low level takes place by them simply arriving adjacent to each other. The pattern recognition encodes ideas. I suppose this isn't quite an AI yet, but like a computer's ALU, this ALU is meant to function as the core of operations for a functional AI. Being as man is made in the image of God (Gen. 1:27), I do not hope for a recreation of the soul, but am optimistic of reproducing an animal brain. — Bondolin, Mar 01 '13 at 01:29
You tell me: I created a monster: http://twitter.com/Hubert_B_Both and it just won't die. Next weekend it will be Bible-enabled (TM) and Genesis-proof, too. — wildplasser, Mar 01 '13 at 01:36
Interesting. Can it reason? What do you mean by Bible-enabled? — Bondolin, Mar 01 '13 at 02:40
1) No, it cannot reason like Watson. It just mumbles like an old man, triggered by keywords (twitter is an excellent medium for this) . 2) I will soon poor a complete bible into the corpus. — wildplasser, Mar 01 '13 at 08:45

Data-mining algorithm for dynamically consolidating recurring substrings?

1 Answers1