0

I have this piece of code that is suppose to split a string depending on the delimiter. It seems to work when the delimiter is '\0' (null character) but when I try to make it work with '\n' (newline) I end up with a weird result that I can't explain. Here's the code:

  char **results = malloc(*nb_delimiter * sizeof(char *));

  // Separate result with delimiter
  char *next;
  char *pointer = "1\n2\n3\n4";
  for (int i = 0; i < *nb_delimiter; ++i)
  {
    next = strchr(pointer, delimiter);
    results[i] = pointer;
    pointer = next + 1;
  }
  printf("%s\n", results[0]);

For exemple: when I run it it gives me:

  • results[0] : 1\n2\n3\n4
  • results[1] : 2\n3\n4
  • results[2] : 3\n4
  • results[3] : 4

Instead of:

  • results[0] : 1
  • results[1] : 2
  • results[2] : 3
  • results[3] : 4

I really don't get it, when I search for ways to split an array in C the majority recommend strchr(). I dont understand why I have this problem.

Thanks

  • *"...the majority recommend strchr()"* - interesting, because the *vast* majority I continually see recommend `strtok_r` . Regardless, nowhere in your code do you (a) make copies of the strings, nor (b) pilfer the exiting string data by hard-writing nullchar values on the delimiters (and in this case you can't anyway, because you'd be modifying a literal, so even `strtok_r` would be off limits). – WhozCraig Oct 21 '20 at 04:52
  • You don't write a null byte over the newline characters you find, so the strings continue after the newlines. Using `strtok()` or `strtok_r()` or `strtok_s()`, the delimiter is zapped with a null byte, so the newlines go and you get the shorter strings. The disadvantage of the `strtok()` routines is that you don't know which character was zapped if there are multiple possibilities — when there's only one (a newline, for example), there isn't a problem. – Jonathan Leffler Oct 21 '20 at 05:04

1 Answers1

1

With char* strings, the representation is very primitive: it is simply a sequence of chars contiguously in memory, and a null character \0 (zero byte) says where the string ends. There is no side information to otherwise say what the string length is. (See null-terminated string on Wikipedia.)

Your results "1\n2\n3\n4", "2\n3\n4", etc. are pointing to the start of the strings as intended, but they don't end where they should. You need to write null characters into the original string, replacing the delimiters, so that it splits into multiple pieces.

However, the source string is a literal, and it is therefore UB to attempt to modify it. So to do this properly, first make a copy of the string (e.g. by strcpy'ing into a buffer of sufficient size), then manipulate the copy.

Pascal Getreuer
  • 2,906
  • 1
  • 5
  • 14