Boost tokenizer fails to parse csv file having field with double quote

Question

I am parsing.csv file having two columns. I am trying to parse row using boost tokenizer from csv file in which one of field in row is in double quote(Ex: 1,"test"). After tokenizer, I am getting field without double quote in tok (1,test).

typedef tokenizer< escaped_list_separator<char>> Tokenizer;
if (getline(inputFile, line))
{
    Tokenizer tok(line);        
    vector< string > vec;
    vec.assign(tok.begin(), tok.end());

    //Here *(vec.begin() + 1) is printing string- test , without double quote
}

Is there any way to get this second field with double quote?

Indeed, the double quotes are 'eaten' by the tokenizer. But if they weren't, you'd have to remove them yourself. Or, if you're really attached to them, why not add the quotes back yourself? So, please elaborate: why is the presence of the quotes important to you. That helps us to come up with inspiring ideas. — Klaas van Gend, Feb 26 '18 at 21:00
I am getting data from user with double quote for writing into CSV. So I need to maintain the same while presenting this data to user. I will be having huge data so how I can add quotes back myself will be difficult as I may need to remember the fields with quote. So I am looking for way which can maintain the fields with double quote even after tokenizer. — dev, Feb 27 '18 at 05:03

sehe · Answer 1 · 2018-02-26T21:45:49.763

The quotes are a presentation thing. Once you parse/tokenize the data, you want the unescaped data back.

The quoted/escaped representation is to protect special characters in your data in transit only (to prevent them from interfering with your protocol¹).

Once you read it back, it is no longer in transit, and to "keep" the escapes or quotes (or whatever other artefacts come with your protocol¹) would be an error, and in fact is a frequent source of bugs, not seldom security vulnerabilities

Samples

CSV a or "a" corresponds to a value of a
likewise "\"" corresponds to "
"\\\"" corresponds to \"
"\" is incomplete (the quoted construct is not closed)

The important thing is that your values roundtrip without loss of information. So, parsing "a" as the value "a" creates the conceptual error that converting it back to quoted-escaped format would suddenly look like "\"a\"", which is an entirely different thing!

¹ presentation format or transport protocol

² most commonly, code injection:

Code injection vulnerabilities (injection flaws) occur when an application sends untrusted data to an interpreter. Injection flaws are most often found in SQL, LDAP, XPath, or NoSQL queries; OS commands; XML parsers, SMTP headers, program arguments, etc. Injection flaws tend to be easier to discover when examining source code than via testing.[1] Scanners and fuzzers can help find injection flaws.[2]

I understand that boost tokenizer will give "a" to a after tokenizing but I am looking for way to get this double quote even after tokenizer. I need to present the data that i accepted while writing into CSV. — dev, Feb 27 '18 at 04:59
You need to present the **data**. Not the CSV encoding. If you **also** want to display the source encoded form (that is **not** the data) then you need to use something else for parsing. E.g. http://coliru.stacked-crooked.com/a/dc5f0f1cfd91c456 — sehe, Feb 28 '18 at 15:48
See here for relevant evidence [that injection is a real problem](https://rhinosecuritylabs.com/azure/cloud-security-risks-part-1-azure-csv-injection-vulnerability/). Be safe! — sehe, Feb 28 '18 at 15:49

Boost tokenizer fails to parse csv file having field with double quote

1 Answers1

Samples

Linked