0

I have a txt file with 100000000 words in every new line.

I want to write a function that takes an input of the word and searches if the word is there or not in the txt file.

I have tried this with map and trie method but I'm getting std:bac_alloc error, this is due to that large number of words can anyone suggest how to solve the issue

  • 1
    Is the file sorted? If so you could probably just do a binary search of the file without needing to read it all into memory – Alan Birtles May 07 '22 at 14:46
  • What's the problem with a simple sequential search? Load the file chunk-wise and search those for the word. At least the way the problem is described right now there's absolutely no need to keep the entire file in memory –  May 07 '22 at 14:47
  • Memory map it? Perhaps using the window-based approach mentioned by @Paul? But be ready to handle partial words. – Some programmer dude May 07 '22 at 14:48
  • Or, if each word is stored on separate lines, read a couple of thousand lines, check those, and if not found continue by reading the next few thousands of lines to check. There are many different ways to handle it, by not reading the whole file into memory. – Some programmer dude May 07 '22 at 14:50
  • Thank you all for your suggestions, sequential will work, I want to load the data into memory and search the words multiple times so that the search time will be less. Any code reference for chunk wise search @Paul. I want to use trie chunk wise but i confused how to free the data of the tried for every chunk – Crazy Thoughts May 07 '22 at 15:01
  • Sort the text file. Do a binary search of the text file for the word. – Eljay May 07 '22 at 15:05
  • @CrazyThoughts why do you event want to use a trie for this? It'll just add to complexity without any benefit. –  May 07 '22 at 15:34
  • 1
    @Paul Tries benefit from a logarithmic-time search (assuming words are uniformly distributed) while being more compact than binary search in memory assuming they are efficiently implemented. This assume multiple words are searched. Otherwise any data structure are worthless and a basic sequential search in the file is mandatory unless it is structured (eg. sorted). – Jérôme Richard May 07 '22 at 15:48
  • @JérômeRichard I know what a trie is and how it works. But the question asks for searching for a single word, not multiple. –  May 07 '22 at 15:55
  • 1
    @Paul It look like the OP want to search multiple words as They said "*I want to load the data into memory and search the words multiple times so that the search time will be less*" in the above comments tough it is not very clear (at least not in the question)... – Jérôme Richard May 07 '22 at 16:05

2 Answers2

0

Data structures are quite important when programming. If possible I would recommend that you use something like a binary tree. This would require sorting the text file though. If you cannot sort the text file, the best way would be to just iterate over the text file until you get the word that you wanted. Also, your comment should contain more information as to allow us to more easily diagnose your problem

Olivia Smith
  • 45
  • 2
  • 7
  • A binary search tree is what they were referring to when they mentioned trying `map`, which is an associative container that is part of the C++ standard library. There is no need to index the file or its contents if only searching for a single word. The problem seems to be they run out of memory trying to do so, but without any code or information about their platform, you are correct that a proper diagnosis is not possible. On that note, this answer should really be a comment (although you are just echoing things that are already in the comments). – paddy May 07 '22 at 15:58
  • Thanks for the feedback. I will take that into mind next time I answer a question. – Olivia Smith May 07 '22 at 16:23
0

I assume you want to search this word list over and over. Because for a small number of searches just search linear through the file.

Parsing the word list into a suffix tree takes about 20 times the size of the file, more if not optimized. Since you ran out of memory constructing a trie of the word list I assume it's really big. So lets not keep it in memory but process it a bit so you can search faster.

The solution I would propose is to do a dictionary search.

So first turn every whitespace into a newline so you have one word per line instead of multiple lines with multiple words and then sort the file and store it. While you are at it you can remove duplicates. That is our dictionary. Remember the length of the longest word (L) while you do that.

To access the dictionary you need a helper function to read a word at offset X, which can be at the middle of some word. The function should seek to the offset - L and read 2 * L bytes into a buffer. Then from the middle of the buffer search backward and forward to find the word at offset X.

Now to search you open the dictionary and read the word at offset left=0 and offset right = size_of_file, i.e. the first and last word. If your search term is less then the first word or larger then the last word you are done, word not found. If you found the search term you are done too.

Next in a binary search you would take the std::midpoint of left and right, read the word at that offset and check if the search term is less or more and recurse into that interval. This would require O(log n) reads to find the word or determine it's not present.

A dictionary search can do better. Instead of using the midpoint you can approximate where the word should be in the dictionary. Say your dictionary goes from "Aal" to "Zoo" and you are searching for "Zebra". Would you open the dictionary in the middle? No, you would open it near the end because Zerba is much closer to Zoo than Aal. So you need a function that gives you a value (M) between 0 and 1 of where a search term is located relative to the left and right word. Your "midpoint" for the search is then (right - left) * M. Then, like with binary search, determine if the search term is in the left or right interval and recurse.

A dictionary search takes only log log n reads on average if the word list has reasonably uniform distribution.

Goswin von Brederlow
  • 11,875
  • 2
  • 24
  • 42