split a char array into tokens where the separator is NUL char

Question

I want to split a char array into tokens using the NUL char as the separator.

I have a char array that I've received over the network from a recv command, so I know the length of the char array. In that char array there are bunch of strings that are separated by the NUL char (\0).

Because the separator is the NUL char, that means I can't use strtok, because it uses NULL for its own purposes.

So I want to iterate through all the strings starting from byte 8 (the strings are preceded by 2 32 bit integers).

I was thinking I could iterate though all the characters looking for the \0 character and then doing a memcpy of the length I have found so far, but I figured there must be a nicer method than this.

What other approach can I take?

If you don't need to re-use the buffer for something else, you can keep the strings stored there, and just use pointers to their beginnings. They are already NUL-terminated. — Thomas Padron-McCarthy, May 15 '16 at 11:56
@ThomasPadron-McCarthy I haven't done much c in a while, how do I move the position of the pointer again? — Joel Pearson, May 15 '16 at 12:00
Or if you do need to re-use the buffer, you can find the first character of each string and `strcpy()` or `strdup()` it. Either way, do be certain that the last string is null-terminated, too, or otherwise handle it as a special case. — John Bollinger, May 15 '16 at 12:01
"how do I move the position of the pointer [...]?" You don't. A pointer is a value -- it is what it is. But you can compute a *new* pointer value that points to the desired location via pointer arithmetic. If you wish, you can store your new pointer value in the same *variable* that initially held the original one. If this is not yet making sense to you then it's time to review your C language basics. — John Bollinger, May 15 '16 at 12:06
"_the strings are preceded by 32 bit integers_". 'Strings' is plural. Is _every_ string preceded by a 32 bit length? Btw, 32 bit integers are only 4 bytes, not 8. — Paul Ogilvie, May 15 '16 at 12:13
@PaulOgilvie I think he means the data is of the form ("integer" ,"integer", 0-term string, 0-term string,.... ) and the last string might or might not be 0-terminated (this is not specified) and the total length is pre-given. — Henno Brandsma, May 15 '16 at 12:33
@PaulOgilvie yep, I understand that 32 bit integers are 4 bytes, I have 2 integers at the beginning, which is why I want to skip the first 8 bytes. — Joel Pearson, May 15 '16 at 12:51
@alk Thanks for the clarification about `NUL` and `NULL`. How come you deleted your answer out of curiosity? — Joel Pearson, May 15 '16 at 13:04
@JoelPearson: It had a bug, I needed to fix. It's back now. :-). Writing code without testing sometimes fails ... ;-) — alk, May 15 '16 at 13:22

Support Ukraine · Accepted Answer · 2016-05-15T13:15:04.613

Here is some simple code showing how you can get the contained strings:

#include <stdio.h>
#include <string.h>

int main(void) {
    char recbuf[7] = {'a', 'b', 'c', '\0', 'd', 'e', '\0'};
    int recbuf_size = 7;
    int j = 0;
    char* p = recbuf;
    while(j < recbuf_size) 
    {
        printf("%s\n", p);  // print the string found
                            // Here you could copy the string if needed, e.g.
                            // strcpy(mySavedStrings[stringCount++], p);

        int t = strlen(p);  // get the length of the string just printed
        p += t + 1;         // move to next string - add 1 to include string termination
        j += t + 1;         // remember how far we are
    }
    return 0;
}

Output:

abc
de

If you need to skip some bytes in the start of the buffer then just do:

int number_of_bytes_to_skip = 4;
int j = number_of_bytes_to_skip;
char* p = recbuf + number_of_bytes_to_skip;

Notice:

The code above assumes that the receive buffer is always correctly terminated with a '\0'. In real world code, you should check that before running the code and add error handling, e.g.:

if (recbuf[recbuf_size-1] != '\0')
{
    // Some error handling...
}

I re-use the recbuf for multiple invocations of recv. Does that mean I'd need to reset the pointer back to the beginning to use this approach? If I simply make of copy of recbuf before iterating through it, then the original pointer should be ok? — Joel Pearson, May 15 '16 at 13:02
@JoelPearson - when ever you receive a new buffer, you need to set the pointer `p` to the address of the new buffer (and add the number of bytes to skip). If you reuse the same receive buffer then you need to set the pointer `p` back before parsing the buffer — Support Ukraine, May 15 '16 at 13:04
Just tried it out and it works great for what I need, thanks a lot! — Joel Pearson, May 15 '16 at 13:12

Vagish · Answer 2 · 2016-05-15T13:03:41.983

NUL separation actually makes your job lot easy.

char* DestStrings[MAX_STRINGS];
int j = 0;
int length = 0; 
inr prevLength =0;
int offset = 8;
for(int i = 0;i<MAX_STRINGS;i++) 
{
     length += strlen(&srcbuffer[j+offset+length]); 
     if(length == prevLength)                          
     { 
       break;
     }
     else
     {

       DestStrings[i] = malloc(length-prevLength+1);
       strcpy(DestStrings[i],&srcbuffer[j+offset+length]);
       prevLength = length;
       j++;
     }

}

You need add few additional checks to avoid potential buffer overflow errors. Hope this code gives you slight idea about how to go ahead.

EDIT 1: Though this is the code to start with an not the entire solution due to down-votes modifying the index.

EDIT 2 : As the length of received data buffer is already known ,please append NUL to received data to make this code work as it is. On the other hand length of received data can itself used to compare with copied length.

alk · Answer 3 · 2016-05-16T07:07:43.173

Assuming this input data:

char input[] = {
  0x01, 0x02, 0x0a, 0x0b,  /* A 32bit integer */
  'h', 'e', 'l', 'l', 'o', 0x00, 
  'w', 'o', 'r', 'l', 'd', 0x00,
  0x00 /* Necessary to make the end of the payload. */
};

A 32 integer in the beginning gives:

const size_t header_size = sizeof (uint32_t);

Parsing the input can be done by identifying the "string"'s 1st character and storing a pointer to it and then moving on exactly as much as the string found is long (1+) then start over until the end of input had been reached.

size_t strings_elements = 1; /* Set this to which ever start size you like. */
size_t delta = 1; /* 1 is conservative and slow for larger input, 
                     increase as needed. */

/* Result as array of pointers to "string": */
char ** strings = malloc(strings_elements * sizeof *strings);

{  
  char * pc = input + header_size;
  size_t strings_found = 0;
  /* Parse input, if necessary increase result array, and populate its elements: */
  while ('\0' != *pc)
  {
    if (strings_found >= strings_elements)
    {
      strings_elements += delta;
      void * pvtmp = realloc(
        strings, 
        (strings_elements + 1) * sizeof *strings /* Allocate one more to have a 
                                        stopper, being set to NULL as a sentinel.*/
      ); 

      if (NULL == pvtmp)
      {
        perror("realloc() failed");
        exit(EXIT_FAILURE);
      }

      strings = pvtmp;
    }

    strings[strings_found] = pc; 
    ++strings_found;

    pc += strlen(pc) + 1;
  }

  strings[strings_found] = NULL; /* Set a stopper element. 
                                    NULL terminate the pointer array. */
}

/* Print result: */
{
  char ** ppc = strings;
  for(; NULL != *ppc; ++ppc)
  {
    printf("%zu: '%s'\n", ppc - strings + 1, *ppc)
  }
}

/* Clean up: */
free(strings);

If you need to copy on split, replace this line

  strings[strings_found] = pc;

by

  strings[strings_found] = strdup(pc);

and add clean-up code after using and before free()ing strings:

{
  char ** ppc = strings;
  for(; NULL != *ppc; ++ppc)
  {
    free(*ppc);
  }
}

The code above assume that at least 1 '\0' (NUL aka null-character) follows the payload.

If the latter condition is not met you need to either have any other terminating sequence be defined/around or need know the size of the input from some other source. If you don't your issue is not solvable.

The code above needs the following headers:

#include <inttypes.h> /* for int32_t */
#include <stdio.h> /* for printf(), perror() */
#include <string.h>  /* for strlen() */
#include <stdlib.h> /* for realloc(), free(), exit() */

as well as it might need one of the following defines:

#define _POSIX_C_SOURCE 200809L

#define _GNU_SOURCE

or what else your C compiler requires to make strdup() available.

Adrian Jałoszewski · Answer 4 · 2016-05-15T14:07:17.303

I'd suggest using a structure implementing the tokenizer for doing such kind of work. It'll be easier to read and to maintain because it looks similar to object oriented code. It isolates the memcpy, so I think it's "nicer".

First, the headers I'll use:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

The Tokenizer structurehas to remember the beginning of the string (so that we can erase the memory after it's not needed anymore), the actual index and the end index to check if we already parsed the whole string:

struct Tokenizer {
    char *string;
    char *actual_index;
    char *end_index;
};

I suggest using a factory-like function to create a tokenizer. It's constructed here, copying the input string using memcpy because string.h functions stop on the first '\0' character.

struct Tokenizer getTokenizer(char string[], unsigned length) {
    struct Tokenizer tokenizer;
    tokenizer.string = (char *)malloc(length);
    tokenizer.actual_index = tokenizer.string;
    tokenizer.end_index = tokenizer.string + length;
    memcpy(tokenizer.string, string, length); 
    return tokenizer;
}

Now the function responsible for getting the tokens. It returns new allocated strings, which have a '\0' character at their end. It also changes the address actual_index is pointing to. It takes the address of the tokenizer as its argument, so it can change its values:

char * getNextToken(struct Tokenizer *tokenizer) {
    char * token;
    unsigned length;
    if(tokenizer->actual_index == tokenizer->end_index) 
        return NULL;
    length = strlen(tokenizer->actual_index);
    token = (char *)malloc(length + 1); 
    // + 1 because the '\0' character has to fit in
    strncpy(token, tokenizer->actual_index, length + 1);
    for(;*tokenizer->actual_index != '\0'; tokenizer->actual_index++) 
        ; // getting the next position
    tokenizer->actual_index++;
    return token;
}

Sample use of the tokenizer, to show how to handle the memory allocation ang how to use it.

int main() {
    char c[] = "Lorem\0ipsum dolor sit amet,\0consectetur"
        " adipiscing elit. Ut\0rhoncus volutpat viverra.";
    char *temp;
    struct Tokenizer tokenizer = getTokenizer(c, sizeof(c));
    while((temp = getNextToken(&tokenizer))) {
        puts(temp);
        free(temp);
    }
    free(tokenizer.string);
    return 0;
}

I come from a Java background, so this way does read a lot nicer to me, thanks. — Joel Pearson, May 15 '16 at 21:56

split a char array into tokens where the separator is NUL char

4 Answers4