3

i have some data:

A
AXNHJNEHWXNOECMEJK
DNFJNXYEEQWhsdbchjsxs
XMJQWsdsEOJdfsKMDJE

....

Each row is array and each letter is object. I have comparer function which could say that letter A is equavalent of letter a(actually it is not letter. It's russian words and comparer function use morphology to let me know that word are equal for example матрешка==матрешки==матрешкины and arrays are russian sentences. For example: "Мама мыла раму"). I want to create tree data structure which looks like:

1) A
2.1) BA
2.2) DHBAFH
3.1) BEDMEWA
etc...

Otherwise child nodes must contain letters from parent nodes. If you know how to work google adwords i think you can understand me. My question is how to do that FAST. I need to create tree with thousands arrays. Compare function works very slow(it use big dictionary) that's why speed is real problem.

Some simple data(sorry for russian):

here is set of sentences

сайты        
сайты недорого
сайты дешево
сайты дешево и быстро
красивый сайт по доступным ценам 
хочу купить хороший стул 
стул по доступным ценам

we must create following tree data structure

1) сайты
1->2.1) сайты недорого
1->2.2) сайты дешево
1->2.3) красивый сайт по доступным ценам 
1->2.2->3) сайты дешево и быстро

other parent nodes:

1) хочу купить хороший стул 
1) стул по доступным ценам

Child nodes must contain more words then parent.

Neir0
  • 12,849
  • 28
  • 83
  • 139
  • 1
    Could you show some sample data and what tree would you like to build out of them? Because it is not clear to me, what exactly do you want to do. – svick May 21 '11 at 15:06
  • @Neir0, why is “красивый сайт по доступным ценам” a child of “сайты”? Because your comparer says that “сайты” == “сайт”? – svick May 21 '11 at 15:35
  • @svick Yeah. "сайт" translates as "site" and "сайты" as "sites". It's different form of the same word. – Neir0 May 21 '11 at 15:39
  • @Neir0, what if one sentence would logically belong under two roots? E.g. if you had “сайты”, “недорого” and “сайты недорого”, should “сайты недорого” be in the tree twice, once under “сайты” and once under “недорого”? – svick May 21 '11 at 15:51
  • @svick Yes. Then "сайты недорого" has two parents "сайты" and "недорого" – Neir0 May 21 '11 at 15:55

2 Answers2

1

Well,

Seems that this link could be helpful for your problem

Fast String Searching With Suffix Trees: http://marknelson.us/1996/08/01/suffix-trees/

and

Suffix tree

http://en.wikipedia.org/wiki/Suffix_tree

k-dev
  • 1,657
  • 19
  • 30
1

Start with sentences that have one word. They all are going to be parent nodes, so this is simple.

Then continue with two-word sentences. You have to match each of them with every one-word parent node, which is going to be quite slow, because of your slow comparison function. You can do two optimizations, though: first check whether the words are exactly the same. You can do this yourself and it's going to be fast. Another one is to remember the results of the comparison function for every pair of compared words. You're going to waste some memory, but you're going to gain some speed.

When a node matches, add the sentence to it. When the sentence doesn't match any node, make it a parent node.

For sentences with gradually increasing lengths, you do the same, except you have to try matching children of a node that matched, to find the correct place to add the sentence.

svick
  • 236,525
  • 50
  • 385
  • 514