2

I have a 500k+ wordlist that I loaded it into a DAWG data structure. My app is for mobile phones. I of course don't want to repeat all the conversion steps to load this wordlist into a DAWG every time, since it would take to much storage space to have the wordlist on the phone and to much time to load it into a DAWG every time. So, I am looking for a way to store the data in my DAWG to a file or DB in a format that will both conserve space and allow for me to quickly load it back into my DAWG data structure.

I received one suggestion that I could store each node in a SQLite DB, but I am not sure how that would exactly work and if I did that how would I retrieve it quickly. I certainly wouldn't want to run lots of queries. Would some other type of storage method be better? I also received suggestions of creating a serialised file or to store it as a bitmap.

phs
  • 10,687
  • 4
  • 58
  • 84
Alex
  • 319
  • 6
  • 14

3 Answers3

2

You can basically do a memory dump, just use offsets instead of pointers (in Java terms, put all nodes in an array, and use array index to refer to a node).

500k doesn't seem like amount that would be problematic for modern phones, especially that DAWG is quite efficient already. If you mmap the file, you would be able to work with the data structure even if it doesn't fit in memory.

Kornel
  • 97,764
  • 37
  • 219
  • 309
  • I should correct something. I meant it is 500,000 words, not 500k in storage space. The actual file is several mb. I know on my G1 most of the apps take up less than 1mb each, including data. I had to delete a lot of apps my phone due to running out of space. I didn't want that to happen to my app, so I wanted to be as efficient as possible. Which is also one of the reason I want to use DAWG - to be efficient on memory usage. Thanks for the suggestion. I'll give that a try too and see which works best for me. – Alex Jan 11 '11 at 19:22
1

Did you tried to reduce the wordlist? Are you saving only the word stam if possible for your application?

Other hand: You never should rebuild the data structure because the wordlist is constant. Try do use a memory dump like suggusted. Use mmap for the file, java serialization or pickle pickle technics to load a ready-made data structure into your memory.

user624214
  • 21
  • 1
0

I guess, you are using DAWG for fast searching some word in a dictionary. DAWG has O(LEN) search complexity.

Many years ago, I developed J2ME app and faced with the same problem. But in that times phones definetely couldn't provide such RAM amount of RAM memory, to store 500K+ strings) The solution I used is the following:

  1. Read all words, sort them, put in some file line by line and for each word precompute skipBytes. - number of bytes before this word. Computing skipBytes is trivial. pseudocode is skipBytes[0]=words[0].bytesLen; for i=1 to n skipBytes[i]=skipBytes[i-1]+words[i].getBytesLength
  2. When app starts read 500k skipBytes to some int array. It is much smaller that 500K strings)
  3. Searching word in a dict - binary search. Imagine that you are perfoming it on sorted array but, instead of making array[i] you make something like RandomAccessFile.read(skipBytes[i]). Google Java Random Access Files my pseucode of course wrong it's just direction.

Complexity - O(LEN*LOG(N)) = LOG of Binary search and comparing strings is linear complexity. LOG(500000)~19, LEN ~ average word leng in worst case is 50 (fantastic upper bound), so search operation is still very fast, only ~1000 operation it will be done in microseconds. Advantage - small memory usage.

I should mention, that in case of web app when many users perform searhing, LOG(N) becomes important, but if your app provides service for only one person LOG(500000) doesn't change much if it performed not inside a loop)