0

For example, I have an index text file that has 400+ English words, and then I have another text file with decrypted text on each line.

I want to check each English word in my index file with each line of my decrypted text file (so checking 400+ English words for a match per line of decrypted text)

I was thinking of using strncmp(decryptedString, indexString, 10) because I know that strncmp terminates if the next character is NULL.

Each line of my decrypted text file is 352 characters long, and there's ~40 million lines of text stored in there (each line comes from a different output).

This is to decrypt a playfair cipher; I know that my decryption algorithm works because my professor gave us an example to test our program against and it worked fine.

I've been working on this project for six days straight and this is the only part I've been stuck on. I simply can't get it to work. I've tried using

while(getline(&line, &len, decryptedFile) != -1){
    while(getline(&line2, &len2, indexFile) != -1){
        if(strncmp(decryptedString, indexString, 10) == 0){
            fprintf(potentialKey, "%s", key); 
        }
    }
}

But I never get any matches. I've tried storing each string in into arrays and testing them one character at a time and that didn't work for me either since it would list all the English words are on one line. I'm simply lost, so any help or pointers in the right direction would be much appreciated. Thank you in advance.

EDIT: Based on advice from Clifford in the comments, here's an example of what I'm trying to do

Let's say indexFile contains:

HELLO
WORLD
PROGRAMMING
ENGLISH

And the decryptedFile contains

HEVWIABAKABWHWHVWC
HELLOHEGWVAHSBAKAP
DHVSHSBAJANAVSJSBF
WORLDHEEHHESBVWJWU
PROGRAMMINGENGLISH

I'm trying to compare each word from indexFile to decryptedFile, one at a time. So all four words from indexFile will be compared to line 1, line2, line 3, line 4, and line 5 respectively.

G_3
  • 1
  • 3
  • did you consider using memcmp(const void *str1, const void *str2, size_t n)) ? I dont understand why you are only comparing 10 bytes – H.cohen Oct 20 '18 at 19:54
  • Consider whether `strstr()` can help. Also explain whether you need to find `din` in `ordinary` or not. – Jonathan Leffler Oct 20 '18 at 20:04
  • I'm only comparing 10 bytes because the decrypted text consists of only English words, so if the first 10 bytes aren't a match, then I know that's the wrong decrypted text and I'll move onto the next line. – G_3 Oct 20 '18 at 20:12
  • The problem with strstr() in my case is that it's a match every single time because there's a word from index found in each line if decrypted text, so it doesn't make my list any smaller – G_3 Oct 20 '18 at 20:14
  • EDIT: Jonathan, your comment gave me an idea and I would like to hear your opinion. If I deleted all words <4 letters in the index file, then the odds of strstr() getting a match would be relatively low, meaning it would be more effective in getting matches, right? – G_3 Oct 20 '18 at 20:23
  • Within the loop, you have `strncmp(decryptedString, indexString, 10)`, but neither `decryptedString` nor `indexString` are modified in the loop; meanwhile `line` and `line2` are ignored. Currently your code does nothing useful. – Clifford Oct 20 '18 at 20:27
  • Your justification for comparing only 10 bytes makes little sense - there are many English words of more than ten characters which are also not unique in the first ten characters - `friendship` and `friendships` for example. Both `strcmp` and `strncmp` will terminate of the fist character mismatch, so there is no performance benefit in using some arbitrarily short compare length. – Clifford Oct 20 '18 at 20:36
  • From my understanding, line and line2 would iterate through each line, I didn't realize that it wasn't doing anything useful. Do you have any advice in the case? Also, my index file has English words given to me by my professor, so I'm not comparing every word in the English language. However, you are right because of the termination process, it was definitely terminating every time because not every word is not a fixed size. – G_3 Oct 20 '18 at 20:40
  • Between your title and the body text it is entirely unclear what you are trying to do. In one you refer to comparing _lines_ of text, and the other you talk about comparing individual words. An easy way to add clarity is to give and example of input and expected output. – Clifford Oct 20 '18 at 20:40
  • After `while(getline(&line, &len, decryptedFile) != -1){` and _before_ `while(getline(&line2, &len2, indexFile) != -1){` you need to add `rewind(indexFile);` Otherwise, you'll only get a match on the _first_ line of `decryptedFile` – Craig Estey Oct 20 '18 at 22:37

2 Answers2

1

If what you are trying to do is check to see if an input line starts with a word, you should use:

strncmp(line, word, strlen(word));

If you know that line is longer than word, you can use

memcmp(line, word, strlen(word));

If you are doing that repeatedly with the same word(s), you'd be better off saving the length of the word in the same data structure as the word itself, to avoid recomputing it each time.

This is a common use case for strncmp. Note that your description of strncmp is slightly inaccurate. It will stop when it hits a NUL in either argument, but it only returns equal if both arguments have a NUL in the same place or if the count is exhausted without encountering a difference.

strncmp is safer than depending on the fact that line is longer than word, given that the speed difference between memcmp and strncmp is very small.

However, with that much data and that many words to check, you should try something which reduces the number of comparisons you need to do. You could put the words into a Trie, for example. Or, if that seems like too much work, you could at least categorize them by their first letter and only use the ones whose first letter matches the first letter of the line, if there are any.

If you are looking for an instance of the word(s) anywhere in the line, then you'll need a more sophisticated search strategy. There are lots of algorithms for this problem; Aho-Corasick is effective and simple, although there are faster ones.

rici
  • 234,347
  • 28
  • 237
  • 341
  • You're absolutely right; line is definitely longer than word so I'll give strlen(word) for size_t a try. It's something so simple, yet it completely escaped my mind. Thank you, this has definitely pointed me in the right direction! – G_3 Oct 20 '18 at 21:36
0

If a line of decrypted text is 352 characters long and each word in the index is not 352 characters long, then a line of decrypted text will never match any word in the index.

From this I think you've misunderstood the requirements and asked a question based on the misunderstanding.

Specifically, I suspect that you want to compare each individual word in the decrypted line (and not the whole line) with each each word in your index, to determine if all words in the decrypted line are acceptable. To do that, the first step would be to break the decrypted line of characters into individual words - e.g. maybe finding the characters that separate words (spaces, tabs, commas?) within the decrypted text and replacing them with a zero terminator (so that you can use strcmp() and don't need to worry about "foobar" incorrectly matching "foo" just because the first letters match).

Note that there's probably potential optimisations. E.g. if you know that a word from the decrypted text is 8 characters (which you would've had to have known to place the zero terminator in the right spot) and if your index is split into "one list for each word length" (e.g. a list of index words with 3 characters, a list of index words with 4 characters, etc) then you might be able to skip a lot of string comparisions (and only compare the word from the decrypted line with words that have the same length in the index). In this case (where you know both words have the same length already) you can also avoid modifying the original 352 characters (you won't need to insert the zero terminator after each word).

Brendan
  • 35,656
  • 2
  • 39
  • 66