5

I have a 50gb txt file of random strings , out of which I want to count the number of occurrences of a substring in that file.. many times, for different not predefined random substrings.

I was wondering if there is another way to approach the problem.

probabilistic way

Something like a bloom filter , but instead of probabilistic membership check, we could have probabilistic counting. That data structure would be used for count estimations.

Other statistical method(?)

Any dummy method that I could use to estimate the number of occurrences of a string in a text file ? Open to alternatives.

It would be nice if it could be done in <= logarithmic time as I will be doing the same task a lot of times.

RetroCode
  • 342
  • 5
  • 14
  • Why do you think you can't use a counter? You don't need to specify the keys ahead of time. Even if you don't want to process the whole file, you could use a counter to sample some part of it. – jonrsharpe Nov 11 '16 at 20:11
  • @jonrsharpeI you are right in that but I forgot to add that I donnot have 50gb of RAM. – RetroCode Nov 11 '16 at 20:13
  • A counter won't take up 50gb, and you don't need to hold the entire file in memory at once. You can read a bit at a time. It's perfectly possible to count every character. – Carcigenicate Nov 11 '16 at 20:15
  • 2
    Why do you think that you will need 50 GB of ram? The size of the file does not matter at all, it's the number of different words that count, and there probably aren't more than a few thousands of those, particularly if you apply stemming first. – tobias_k Nov 11 '16 at 20:18
  • @tobias_k words..sure. Combinations of chars? – RetroCode Nov 11 '16 at 20:19
  • @RetroCode how is counter impractical. You have to give detail in your question to avoid this exact senairio of us giving you decent answers, and you shutting them down – MooingRawr Nov 11 '16 at 20:20
  • Well, I assumed "words" because you used "text file", but if it's instead 50 GB of, e.g., continuous genome sequences, then you should say so in the question. – tobias_k Nov 11 '16 at 20:21
  • Even if it is continuous data, you can still read it in chunks, or lazily. – Carcigenicate Nov 11 '16 at 20:35
  • are these passwords, newpaper articles, wikipedia? just wanted to know – john mangual Nov 11 '16 at 20:42

2 Answers2

1

Some streaming algorithms sound relevant to this problem, either separately, or in conjunction with each other.

  1. An initial pass on the file could give an approximation of the heavy hitters. Depending on your problem, it's possible that the heavy hitters' distribution is sufficient for you, but that this set is small enough to hold in memory. If this is the case, you could perform a second pass, counting only the heavy hitters from the first pass.

  2. The count-min sketch data structure can perform approximate counting. You could either use this data structure on its own, or you could use it for counting the occurrences of the heavy hitters.

Since this is tagged as Python:

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
1

You could compute a suffix array for your file.

This array contains the starting positions of suffixes in sorted order. With 50GB of text, you could allocate 5 bytes per position and end up with a suffix array of 5*50=250 GBytes. If this is too much, then you could try a compressed suffix array.

Computing this array can be done in O(n) (probably a few hours with an appropriate algorithm, mostly limited by disk read/write speed).

Once you have the array, you can count the number of occurrences of any substring in logarithmic time. In practice the time would be dominated by the seek times to different parts of your disk, so this part will be much faster if you store the files on a solid state drive.

Peter de Rivaz
  • 33,126
  • 4
  • 46
  • 75