Is it possible to count the frequency of a word in a file precisely using two buffers in C?

Question

I have a file of size 1GB. I want to find out how many times the word "sosowhat" is found in the file. I've written a code using fgetc() which reads one character at a time from the file which is way too slower when it comes for a file of size 1GB. So I made a buffer of size 1000(using mmalloc) to hold 1000 words at a time from the file and I used the strstr() function to count the occurrence of the word "sosowhat". The logic is fine. But the problem is that if the part "so" of "sosowhat" is located at the end of the buffer and the "sowhat" part in the new buffer, the word will not be counted. So I used two buffers old_buffer and current_buffer. At the beginning of each buffer I want to check from the last few characters of old buffer. Is this possible? How can I go back to the old buffer? Is it possible without memmove()? As a beginner, I will be more than happy for your help.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/217856/discussion-on-question-by-sharon-shelton-is-it-possible-to-count-the-frequency-o). — Samuel Liew, Jul 15 '20 at 03:51

score 0 · Answer 1 · answered Jul 14 '20 at 12:51

0

Yes, it can be done. There are more possible approaches to this.

The first one, which is the cleanest, is to keep a second buffer, as suggested, of the length of the searched word, where you keep the last chunk of the old buffer. (It needs to be exactly the length of the searched word because you store wordLength - 1 characters + NULL terminator). Then the quickest way is to append to this stored chunk from the old buffer the first wordLen - 1 characters from the new buffer and search your word here. Then continue with your search normally. - Of course you can create a buffer which can hold both chunks (the last bytes from the old buffer and the first bytes from the new one).

Another approach (which I don't recommend, but can turn out to be a bit easier in terms of code) would be to fseek wordLen - 1 bytes backwards in the read file. This will "move" the chunk stored in previous approach to the next buffer. This is a bit dirtier as you will read some of the contents of the file twice. Although that's not something noticeable in terms of performance, I again recommend against it and use something like the first described approach.

answered Jul 14 '20 at 12:51

Vlad Rusu

1,414
12
17

So as far as I understood, if a buffer of size 1000 is made using malloc, 999 characters will be the characters from the file and 1000th position will be NULL? – Sharon Shelton Jul 14 '20 at 13:11
@SharonShelton: If you store a null-terminated string in that buffer, then yes. The string length will be limited to 999 chars and the subsequent null terminator. If you use the function [`fgets`](https://en.cppreference.com/w/c/io/fgets), then it will always write a null terminator. However, if you use a function intended for binary data, for example [`fread`](https://en.cppreference.com/w/c/io/fread), then it will not write a null terminator and you can use all 1000 bytes. However, in this case, I recommend that you use null-terminated strings, so don't use `fread`. – Andreas Wenzel Jul 14 '20 at 13:21
@SharonShelton: If you use `fgetc` instead of `fgets`, then a null-terminator will not be written automatically, you must write it yourself or use some other method of keeping track of the length of the data. – Andreas Wenzel Jul 14 '20 at 13:25
1

@SharonShelton: Reading 1000 bytes using `fread` is the same as calling `fgetc` exactly 1000 times. It will not write a null terminator. Therefore, you will have to keep track of the length of the valid data in the buffer yourself if you use these functions, or you can write your own null terminator. – Andreas Wenzel Jul 14 '20 at 13:32

score 0 · Answer 2 · answered Jul 14 '20 at 13:05

0

use the same algorithm as per fgetc only read from the buffers you created. It will be same efficient as strstr iterates thorough the string char by char as well.

answered Jul 14 '20 at 13:05

0___________

60,014
4
34
74

1

But how do you deal with the case when the searched word lies in 2 separate file reads? – Vlad Rusu Jul 14 '20 at 13:08
you avoid this problem reading char by char. When you reach the end of the buffer you start to read from another one. – 0___________ Jul 14 '20 at 13:09
@P__J__ It is not possible to get characters from the buffer using fgetc(). fgetc() is able to read characters directly from the file only – Sharon Shelton Jul 14 '20 at 13:14
@SharonShelton write the function getCharFromBuffer. BTW it is possible to use fgetc as well but you need to add support of your buffers to the file system (how depends on the OS hardware and implementation) – 0___________ Jul 14 '20 at 13:21
strstr probably uses an algorithm like Boyer-Moore that _doesn't_ have to look at every character – zwol Jul 14 '20 at 20:52

Is it possible to count the frequency of a word in a file precisely using two buffers in C?

2 Answers2