0

I am tokenizing lines of delimited by commas and since some titles have commas in them, they are surrounded by quotes. I tokenize up to the first quote, store it in fronttoken, then up to the 2nd quote, store it into title, and then finally I tokenize until the \n. I don't understand why.

Does it have to do with strsep? I didn't use strtok because I want to catch the null tokens if the string consisted of ",," multiple delimiters with nothing in between them.

Since tempStr has malloced well over the necessary amount, it should be sufficient to as a *dest for strcpy. I have been stuck on this for hours. If anyone could point out my mistake, I will greatly appreciate it. Thank you.

int main(int argc, char * argv[])
{   
    char* one = "hello, my, name, is, code monkey, \"This, is a title\", more, random, stuff\n";
    char* two = "blah blah blah";

    char* tempStr= malloc(1000);
    void* freeTempStr = tempStr;

    strcpy(tempStr, one);

    char* fronttoken = strsep(&tempStr, "\"");
    char* title = strsep(&tempStr, "\"");
    char* backtoken = strsep(&tempStr, "\n");
    char* token;

    strcpy(tempStr, fronttoken);
    token = strsep(&tempStr, ",");
    while (token != NULL)
    {
        printf("Front tokens: %s\n", token);
        token = strsep(&tempStr, ",");
    }
    printf("Title: %s\n", title);
    strcpy(tempStr, backtoken);
    token = strsep(&tempStr, ",");
    while (token != NULL)
    {
        printf("Back tokens: %s\n", token);
        token = strsep(&tempStr, ",");
    }
    //2nd strcpy gives segmentation fault

    free(freeTempStr)
    return 0;
}

Output...

Front tokens: hello
Front tokens:  my
Front tokens:  name
Front tokens:  is
Front tokens:  code monkey
Front tokens:
Title: This, is a title
Segmentation fault
codemonkey
  • 58
  • 1
  • 8
  • 1
    I don't understand what you are trying to do, have you really understand the behavior of `strsep()` ? – Stargateur Nov 29 '17 at 02:06
  • 1
    `strsep` will set `tempStr` to `NULL` if the separator is not found, so you are probably passing NULL as first argument to `strcpy`, causing segfault. You could test out this hypothesis by using the debugger and inspecting the value of `tempStr` when the segfault happens – M.M Nov 29 '17 at 02:06
  • @Stargateur I am trying to tokenize a large CSV file containing movie attributes properly. Some of the movies have commas in the title, which is why I have a front and back token in order to get around the comma in the title. – codemonkey Nov 29 '17 at 03:08
  • A properly formatted CSV file will have double quotes around fields that contain commas. You have to do different parsing if the field starts with a double quote. If your file is not properly formatted, then it is ambiguous. Also, you will lose embedded commas in strings if you don’t handle them properly. Inside a double quoted field, two adjacent double quotes in the input map to one in the converted string. You have to undo this on output, of course. – Jonathan Leffler Nov 29 '17 at 03:31
  • @M.M The initial value of tempStr before tokenizing the front token is 0x604440 "" then when strsep tokenizes the last token, then tempStr becomes 0x0. Do I have to malloc the tempStr again before using it again? – codemonkey Nov 29 '17 at 03:33
  • @JonathanLeffler For my CSV, the max amount of "quoted" fields is 1, which is the title, which is why I split each line into three parts in order to maintain the comma in the quoted title. – codemonkey Nov 29 '17 at 03:36
  • You don't have to do any more mallocs... you have to understand what the `strsep` function does to `tempStr` and write the rest of your code accordingly – M.M Nov 29 '17 at 03:38
  • @M.M the strsep function advances the pointer to the next delimeter and when it doesn't find the delimeter, the pointer is lost? So should I save the original pointer when I malloc and just set tempStr back to the original pointer when I need to strsep again? – codemonkey Nov 29 '17 at 03:47
  • **The code doesn't compile for me on Unbuntu 16 with gcc 6.2.0, even after adding necessary standard library headers.** No semicolon after the last `free` call. – Kaz Nov 29 '17 at 18:38

2 Answers2

0

Having looked harder at the code and compiled and experimented with it, I agree with M.M's analysis that the immediate problem is that you're trying to copy to a null pointer.

However, I suspect you've not carefully considered the memory management in general. When you do strcpy(tempStr, one);, you copy the source string into allocated memory. (Un)fortunately, you allocated far more memory than the copy needed. When you subsequently do strcpy(tempStr, fronttoken);, you are copying the fronttoken to a location after the original copy of one in tempStr. You then split that up.

You crash because you stop when tempStr has been set to null, and then you try strcpy(tempStr, backtoken), copying to the null pointer.

If you fix that, you could run into the problem of overlapping string copy. Your current set of back tokens is small enough that it isn't a problem, but if you had a 100 bytes of back tokens, you would have an overlapping string copy and undefined behaviour.

This code shows the problem and fixes it. Note that it includes square brackets around the tokens so it is easier to see exactly what is found.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{   
    char* one = "hello, my, name, is, code monkey, \"This, is a title\", more, random, stuff\n";

    char* tempStr= malloc(1000);
    void* freeTempStr = tempStr;

    strcpy(tempStr, one);
    printf("tempStr = [%p .. %p)\n", (void *)tempStr, (void *)(tempStr + 1000));

    char* fronttoken = strsep(&tempStr, "\"");
    printf("tempStr = %p; fronttoken = %p\n", (void *)tempStr, (void *)fronttoken);
    char* title = strsep(&tempStr, "\"");
    printf("tempStr = %p; title = %p\n", (void *)tempStr, (void *)title);
    char* backtoken = strsep(&tempStr, "\n");
    printf("tempStr = %p; backtoken = %p\n", (void *)tempStr, (void *)backtoken);
    char* token;

    printf("tempStr = %p; fronttoken = %p - before strcpy 1\n", (void *)tempStr, (void *)fronttoken);
    strcpy(tempStr, fronttoken);
    token = strsep(&tempStr, ",");
    while (token != NULL)
    {
        printf("Front tokens: %p [%s]\n", (void *)token, token);
        token = strsep(&tempStr, ",");
    }
    printf("Title: [%s]\n", title);
    printf("tempStr = %p; backtoken = %p - before strcpy 2 (unfixed)\n", tempStr, backtoken);
    tempStr = freeTempStr;
    printf("tempStr = %p; backtoken = %p - before strcpy 2 (fixed - but beware overlap)\n", tempStr, backtoken);
    strcpy(tempStr, backtoken);
    token = strsep(&tempStr, ",");
    while (token != NULL)
    {
        printf("Back tokens: %p [%s]\n", (void *)token, token);
        token = strsep(&tempStr, ",");
    }

    free(freeTempStr);
    return 0;
}

Sample output (Mac running macOS High Sierra 10.13.1, with the security update 2017-001 installed — macOS 10.13.1 (17B1002) — install it if you've not already done so!):

tempStr = [0x7fb91ac02880 .. 0x7fb91ac02c68)
tempStr = 0x7fb91ac028a3; fronttoken = 0x7fb91ac02880
tempStr = 0x7fb91ac028b4; title = 0x7fb91ac028a3
tempStr = 0x7fb91ac028ca; backtoken = 0x7fb91ac028b4
tempStr = 0x7fb91ac028ca; fronttoken = 0x7fb91ac02880 - before strcpy 1
Front tokens: 0x7fb91ac028ca [hello]
Front tokens: 0x7fb91ac028d0 [ my]
Front tokens: 0x7fb91ac028d4 [ name]
Front tokens: 0x7fb91ac028da [ is]
Front tokens: 0x7fb91ac028de [ code monkey]
Front tokens: 0x7fb91ac028eb [ ]
Title: [This, is a title]
tempStr = 0x0; backtoken = 0x7fb91ac028b4 - before strcpy 2 (unfixed)
tempStr = 0x7fb91ac02880; backtoken = 0x7fb91ac028b4 - before strcpy 2 (fixed - but beware overlap)
Back tokens: 0x7fb91ac02880 []
Back tokens: 0x7fb91ac02881 [ more]
Back tokens: 0x7fb91ac02887 [ random]
Back tokens: 0x7fb91ac0288f [ stuff]

Note that using a debugger would also allow you to find this problem pretty quickly.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

Your code doesn't compile as posted; there is a missing semicolon. Moreover, numerous standard headers are missing.

On 64 bit Ubuntu 16 with gcc 6.2.6, a crash occurs other than where you are observing it. Because the <string.h> header is missing, the strsep function is implicitly declared as returning int, which is wrong. Because of this, the fronttoken variable receives a garbage value and the very first strcpy fails.

The first thing to do is get a clean build with no errors or warnings (after turning on any diagnostics that are widely recommended by the community of users of your compiler).

After all that is fixed you have a simple logic problem:

while (token is not null) {
   token = strsep(&tempStr, ...)

}

strcpy into tempStr

Since the while loop contains no break statements, the only way it terminates is if token becomes null.

The only way token becomes null is if strsep returns null.

The only way strsep returns null, according to the documentation, is if tempStr is null.

I.e. the fact that token is null proves that tempStr must be null, which means that we must not use tempStr as the destination of a strcpy.

How tempStr becomes null is that in a previous call to strsep, no token delimiter was found. In that case strsep returns the whole string as the extracted token, and overwrites the pointer with null.

In other words, after extracting the last token from the string, strsep overwrites the pointer with null. Then the next time strsep is called, it returns null, indicating "no more tokens are available". This makes strsep easy to use: just keep calling it until you get a null. But you have to understand that the temporary context pointer gets nulled out in the process.

Kaz
  • 55,781
  • 9
  • 100
  • 149