3

If I have:

const char *mystr = "cheesecakes";
FILE *myfile = fopen("path/to/file.exe","r");

I need to write a function to determine whether myfile contains any occurrences of mystr. Could anyone help me? Thanks!

UPDATE: So it turns out the platform I need to deploy to doesn't have memstr. Does anyone know of a free implementation I can use in my code?

igul222
  • 8,557
  • 14
  • 52
  • 60
  • 1
    can you load the whole file into memory? – Chris H Feb 03 '10 at 00:44
  • It's not too large (<2MB), but I don't know how (I have about 30 minutes C experience so far) – igul222 Feb 03 '10 at 00:55
  • I've put an implementation of `memmem` in my answer - and `memstr(mem, memlen, str)` is just `memmem(mem, memlen, str, strlen(str) + 1)`. – caf Feb 03 '10 at 01:29

6 Answers6

15

If you can't fit the whole file into memory, and you have access to the GNU memmem() extension, then:

  • Read as much as you can into a buffer;
  • Search the buffer with memmem(buffer, len, mystr, strlen(mystr) + 1);
  • Discard all but the last strlen(mystr) characters of the buffer, and move those to the start;
  • Repeat until end of file reached.

If you don't have memmem, then you can implement it in plain C using memchr and memcmp, like so:

/*
 * The memmem() function finds the start of the first occurrence of the
 * substring 'needle' of length 'nlen' in the memory area 'haystack' of
 * length 'hlen'.
 *
 * The return value is a pointer to the beginning of the sub-string, or
 * NULL if the substring is not found.
 */
void *memmem(const void *haystack, size_t hlen, const void *needle, size_t nlen)
{
    int needle_first;
    const void *p = haystack;
    size_t plen = hlen;

    if (!nlen)
        return NULL;

    needle_first = *(unsigned char *)needle;

    while (plen >= nlen && (p = memchr(p, needle_first, plen - nlen + 1)))
    {
        if (!memcmp(p, needle, nlen))
            return (void *)p;

        p++;
        plen = hlen - (p - haystack);
    }

    return NULL;
}
caf
  • 233,326
  • 40
  • 323
  • 462
  • Thanks; I ended up loading the whole thing into memory, then using your memmem to find the needle. – igul222 Feb 03 '10 at 03:03
  • this finds only the FIRST occurrence.. what about the others? – Zibri Mar 16 '18 at 15:11
  • @Zibri: You call it in a loop. The return value, if non-NULL, is a pointer to the found substring within the search buffer, so you can use that to work out where to start searching for the next occurrence (depending on whether you want to count occurrences that overlap with previously-found occurrences or not). – caf Mar 17 '18 at 12:28
  • @cat yep, thanks I figured that out already. by the way (off topic) do you have any idea why using fwrite on an mmapped input file (to write it to disk) fails unless you do a malloc buf, memcpy and write the file in blocks? – Zibri Mar 20 '18 at 06:48
  • @Zibri: That should work just fine. I guess post a new question. – caf Mar 20 '18 at 13:41
3

Because there is no memmem or memstr to find a string in a binary array (others suggested to read it into memory and use strstr - no this doesn't work) you have to read it byte by byte with "fgetch" and write a small state machine to match it while reading.

Lothar
  • 12,537
  • 6
  • 72
  • 121
1

See here:

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

for a Boyer-Moore implementation in C99. This is a very common string searching algorithm and runs in O(n).

Demi
  • 6,147
  • 7
  • 36
  • 38
0

Here is a slapped together version of it. It has no error checking and probably has overflow bugs. But I think it finds the desired string and accounts for the backtracking necessary for partial substring matches. I doubt there are more than 15 bugs left.

Edit: There was at least one in the first answer. I woke up in the middle of the night and realized the backtracking check was wrong. It didn't find '12123' in '1212123'. It might still be wrong, but at least it finds that one now.

int main( int argc, char* argv[] )
{
    FILE *fp;
    char *find, *hist;
    int  len, pos=0, hl=0, i;
    char c;

    fp = fopen( argv[1], "r" );
    find = argv[2];
    len = (int)strlen( find );
    hist = malloc( len );
    memset( hist, 0, len );
    while ( !feof( fp )) {
        c = fgetc( fp );
        if ( find[pos++] == c ) {
            if ( pos == len ) {
                printf( "Found it\n" );
                return 1;
            }
        }
        else {
            // check history buffer (kludge for backtracking)
            if ( pos > 0 ) {
                pos = 0;
                for ( i = 0; i < len - 1; i++ )
                    if ( 0 == memcmp( hist+len-i-1, find, i + 1 )) {
                    // we had a mismatch, but the history matches up to len i
                    pos = i;
                }
            }
        }
        // update history buffer - this is innefficient - better as circular buffer
        memmove( hist, hist + 1, len - 1 );
        hist[len-1] = c;
    }
    printf( "Not found\n" );
}
Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110
0

Here's a function that will search for a string in a buffer.

Limitations: it doesn't handle wide characters (in the case of internationalization). You'll have to write your own code to read the file into memory. It won't find the pattern if the pattern is split between 2 read buffers.

/*****************************************************
const char *buffer    pointer to your read buffer (the larger, the better).
size_t bufsize        the size of your buffer
const char *pattern   pattern you are looking for.

Returns an index into the buffer if pattern is found.
-1 if pattern is not found.

 Sample:
    pos = findPattern (buffer, BUF_SIZE, "cheesecakes");
*****************************************************/

int findPattern (const char *buffer, size_t bufSize, const char *pattern)
{
    int i,j;
    int patternLen;

    // minor optimization. Determine patternLen so we don't 
    // bother searching buffer if fewer than patternLen bytes remain.
    patternLen = strlen (pattern);

    for (i=0; i<bufSize-patternLen; ++i)
    {
        for (j=0; j<patternLen; ++j)
        {
            if (buffer[i+j] != pattern[j])
            {
                break;
            }
        }
        if (j == patternLen)
        {
            return i;
        }
    }
    return -1;
}
Mike Yam
  • 49
  • 1
  • 2
-1
 chat match = "findthis";
 int depth = 0;
 while(not eof)
 {
     char ch = getonebyte();
     if(ch == match[depth])
     {  
         if (depth == strlen(match))
            break;
         else
            depth++;
      }
      else 
           depth = 0;
 }

roughly (I am sure there are off by ones in there)

pm100
  • 48,078
  • 23
  • 82
  • 145
  • 1
    Good try. Unfortunately this has an edge case where it will fail when you have a search string like "112" and in the file you have "1112" – R Samuel Klatchko Feb 03 '10 at 01:03
  • i dont think so, i start at the beginning of the match string when i hit a mismatch – pm100 Feb 03 '10 at 01:19
  • pm100, yes you reset your needle index, but you've already lost the part of the haystack you need. – Matthew Flaschen Feb 03 '10 at 03:08
  • you think wrong. Yes, you reset depth and start at the beginning of the match string, but you don't seek back in the data stream. Try testing it (of course, you'll actually have to make it work for the non-edge case first). – R Samuel Klatchko Feb 03 '10 at 05:29