4

I have the following row in a CSV file that I am ingesting into a Splunk index:

"field1","field2","field3\","field4"

Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:

"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."

Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:

INDEXED_EXTRACTIONS = csv
KV_MODE = none

for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.

UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

chmod_007
  • 368
  • 1
  • 8
  • Splunk's interpreting the "`\"`" *exactly* as you describe from the RFC citation. A backslash before a quote mark is *precisely* how you escape it. That Excel *isn't* following the rules is...concerning, honestly :) – warren Jul 19 '22 at 15:43
  • @warren the RFC says you escape a double quote by "preceding it with another double quote", not a backslash. As in, `""` rather than `\"` – chmod_007 Jul 19 '22 at 15:56
  • but *commas* are being used to separate fields - with *quotes* being used to contain the values in a given field :) – warren Jul 19 '22 at 16:16
  • @warren yes. And then if you have a double quote inside a quoted field value, it must be preceded by another double quote. Backslashes should be irrelevant. See section 2 paragraph 7: https://datatracker.ietf.org/doc/html/rfc4180 Excel, Python, and every other application I have tried follow this standard. Splunk apparently does not. – chmod_007 Jul 19 '22 at 16:23
  • 1
    Also for reference, backlash appears nowhere in the grammar: ` file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D ; DQUOTE = %x22; LF = %x0A; CRLF = CR LF; TEXTDATA = %x20-21 / %x23-2B / %x2D-7E; ` – Steve Cox Jul 19 '22 at 16:36
  • 1
    Since the ascii code for '\' is 0x5C (0x2D < 0x5C < 0x7E), a compliant csv parser must consider it TEXTDATA – Steve Cox Jul 19 '22 at 16:42
  • @SteveCox exactly! That was my reading as well. I just wish the Splunk documentation would mention that it does not comply to the dominant CSV standard :( – chmod_007 Jul 19 '22 at 16:45

0 Answers0