7

In the following program, strtok() works as expected in the major part but I just can't comprehend the reason behind one finding. I have read about strtok() that:

To determine the beginning and the end of a token, the function first scans from the starting location for the first character not contained in delimiters (which becomes the beginning of the token). And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.

Source: http://www.cplusplus.com/reference/cstring/strtok/

And as we know, strtok() places a \0 at the end of each token. But in the following program, the last delimiter is a dot(.), after which there is Toad between that dot and the quotation mark ("). Now the dot is a delimiter in my program, but there is no delimiter after Toad, not even a white space (which is a delimiter in my program). Please clear the following confusion arising from this premise:

Why is strtok() considering Toad as a token even though it is not between 2 delimiters? This is what I read about strtok() when it encounters a NULL character (\0):

Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.

Source: http://www.cplusplus.com/reference/cstring/strtok/

Nowhere does it say that once a null character is encountered,a pointer to the beginning of the token is returned (we don't even have a token here as we didn't get an end of the token as there was no delimiter character found after the scan begun from the beginning of the token (i.e. from 'T' of Toad), we only found a null character, not a delimiter). So why is the part between last delimiter and quotation mark of argument string considered a token by strtok()? Please explain this.

Code:

#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] =" Falcon,eagle-hawk..;buzzard,gull..pigeon sparrow,hen;owl.Toad";
  char * pch=strtok(str," ;,.-");

    while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ;,.-");
  }

  return 0;
}

Output:

Falcon
eagle
hawk
buzzard
gull
pigeon
sparrow
hen
owl
Toad

Community
  • 1
  • 1
Rüppell's Vulture
  • 3,583
  • 7
  • 35
  • 49
  • Not sure I understand your question; what output did you expect? That `Toad` would not be printed? Going by that logic if you remove the leading space in the input string, `Falcon` shouldn't be printed either. I would say that makes for some unintuitive behavior. – Praetorian May 15 '13 at 17:13
  • If you deleted the blank before the Falcon, `strtok()` would still consider 'Falcon' to be the first token. – Jonathan Leffler May 15 '13 at 17:44
  • @JonathanLeffler I have deliberately done that.Like I said ,all is as expected from `strtok()`,except the last token,which is clearly not between two delimiters. – Rüppell's Vulture May 15 '13 at 19:07
  • @JonathanLeffler I regret I had to go outside right after posting this question. – Rüppell's Vulture May 15 '13 at 19:07
  • @Praetorian Why shouldn't I expect the `Falcon` to be printed?I have mentioned from the source that `the function first scans from the starting location for the first character not contained in delimiters`..ie,for the beginning of the token we don't need a delimiter(space is a delimiter in my program),but to mark the end of the token we clearly **need** a delimiter,and NULL at the string end is not on the delimiter list. – Rüppell's Vulture May 15 '13 at 19:10
  • @JonathanLeffler I am surprised I couldn't convey my point even to you in this question. – Rüppell's Vulture May 15 '13 at 19:13

5 Answers5

9

The standard's specification of strtok (7.24.5.8) is pretty clear. In particular paragraph 4 (emphasis added by me) is directly relevant to the question, if I understand that correctly:

3 The first call in the sequence searches the string pointed to by s1 for the first character that is not contained in the current separator string pointed to by s2. If no such character is found, then there are no tokens in the string pointed to by s1 and the strtok function returns a null pointer. If such a character is found, it is the start of the first token.

4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer. If such a character is found, it is overwritten by a null character, which terminates the current token. The strtok function saves a pointer to the following character, from which the next search for a token will start.

In a call

char *where = strtok(string_or_NULL, delimiters);

the token (a pointer to which is) returned - if any - extends from the first non-delimiter character found from the starting position (inclusive) until the next delimiter character (exclusive), if one exists, or the end of the string, if no later delimiter character exists.

The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect.

Community
  • 1
  • 1
Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • 2
    `If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer`---Thank you,that **nails it**,taken right from the standard.That's **exactly** what I wanted to know. – Rüppell's Vulture May 15 '13 at 19:55
  • 1
    **BULL'S EYE**. **BANG ON TARGET!!** – Rüppell's Vulture May 15 '13 at 19:59
4

Going to the description in POSIX for strtok(), the description says:

char *strtok(char *restrict s1, const char *restrict s2);

A sequence of calls to strtok() breaks the string pointed to by s1 into a sequence of tokens, each of which is delimited by a byte from the string pointed to by s2. The first call in the sequence has s1 as its first argument, and is followed by calls with a null pointer as their first argument. The separator string pointed to by s2 may be different from call to call.

The first call in the sequence searches the string pointed to by s1 for the first byte that is not contained in the current separator string pointed to by s2. If no such byte is found, then there are no tokens in the string pointed to by s1 and strtok() shall return a null pointer. If such a byte is found, it is the start of the first token.

The strtok() function then searches from there for a byte that is contained in the current separator string. If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer. If such a byte is found, it is overwritten by a NUL character, which terminates the current token. The strtok() function saves a pointer to the following byte, from which the next search for a token shall start.

Note the second sentence of the third paragraph:

If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer.

This clearly states that in the example in the question, Toad is indeed a token. One way to think of it is that the list of delimiters always includes the NUL '\0' at the end of the delimiter string.


Having diagnosed that, note that strtok() is not a good function to use — it is not thread safe or reentrant. On Windows, you can use strtok_s() instead; on Unix, you can usually use strtok_r(). These are better functions because they don't store internally the pointer at which the search is to resume.

Because strtok() is not reentrant, you cannot call a function that uses strtok() from inside a function that itself uses strtok() while it is using strtok(). Also, any library function that uses strtok() must be clearly identified as doing so because it cannot be called from a function that is using strtok(). So, using strtok() makes life hard.

The other problem with the strtok() family of functions (and with strsep(), which is related) is that they overwrite the delimiter; you can't find out what the delimiter was after the tokenizer has tokenized the string. This can matter in some applications (such as parsing shell command lines; it matters whether the delimiter is a pipe or a semicolon or an ampersand (or ...). So shell parsers usually don't use strtok(), despite the number of questions on SO about shells where the parser does use strtok().

Generally, you should steer clear of plain strtok(), and it is up to you to decide whether strtok_r() or strtok_s() is appropriate for your purposes.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
2

Because cplusplus.com isn't telling you the whole story. Cppreference.com has a better description.

Cplusplus.com also fails to mention that strtok is not thread-safe, and only documents the strtok function of the C++ programming language, whereas cppreference.com does mention the thread safety issue and documents the strtok functions of both the C and the C++ programming languages.

Oktalist
  • 14,336
  • 3
  • 43
  • 63
0

strtok breaks a string to a sequence of tokens, separated by the given delimeters. Delimeters only separate tokens, not necesarily terminate them on both side.

0

Are you perhaps just mis-reading the description?

Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.

Given 'subsequent', I'm reading this as every call to strtok after the one that discovered \0, not necessarily the current one itself. So, the definition is consistent with behavior (and with what you would expect from strtok).

Matt Phillips
  • 9,465
  • 8
  • 44
  • 75
  • From the description from the source,it is obvious that it says that the end of the token is not possible without a delimiter.Subsequent calls or current call doesn't mater in this context.Here is what it says for the end of a token--`And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.` – Rüppell's Vulture May 15 '13 at 19:12
  • @Rüppell'sVulture I agree that that description doesn't describe well in a case where the initial string is ".Toad". However it seems clear at this point that the issue here is just poor documentation on the part of the source, nothing wrong with `strtok` per se. – Matt Phillips May 15 '13 at 19:38
  • I won't say `strtok` is wrong even by a slip of tongue!!Anyways,you got close to what I intend to ask....See,at the end of the penultimate token,the pointer is pointing to `T` of `Toad`,but to mark the end of the token, it needs a delimiter.But there is no delimiter after that and the null character is encountered,at which point it stops.So how is **Toad** a token? – Rüppell's Vulture May 15 '13 at 19:41
  • @Rüppell'sVulture :) I'm not sure where you're from but in the US we say 'Uncle!' at this point--yes, you're right! cplusplus.com's documentation is inadequate. But though popular, there is no sense I know of in which it's canonical or representative of the C language in any official way. So perhaps shoot them an email... – Matt Phillips May 15 '13 at 20:01
  • Search for any library function in C and cplusplus.com comes first on google.That had made me feel it's as holy as the Bible.But now I am having second thoughts.I have been cautioned many times in the last few days about that sight.Whom to trust in this world now? – Rüppell's Vulture May 15 '13 at 20:03
  • Hey Matt,look at the two new answers I got from DF and JL. – Rüppell's Vulture May 15 '13 at 20:04
  • @Rüppell'sVulture Right, indeed "The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect." from DF is exactly what I was saying. So looks like there's a pretty clear consensus on this one. – Matt Phillips May 15 '13 at 20:11
  • @MattPhllips I'll be careful about that site henceforth.Actually the layout of that site is very attractive,and professional looking.And it has the "Business only,no small talk" feel about it. – Rüppell's Vulture May 15 '13 at 20:13