2

Take this CSV file:

ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300

It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:

ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300

But that would render the file completely useless for all other programs (R, SAS, etc.)

Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?

Merik
  • 2,767
  • 6
  • 25
  • 41
  • 1
    That sounds backwards. The second representation is the standard way to represent embedded quotes in Excel and SAS. I am sure R can read such files also. The trick with adding backslash is what you would do if you were trying to type to line into a unix shell. – Tom Dec 04 '18 at 19:15
  • 2
    I am able to import the data into Excel using: (1) Data from text file; (2) Specify comma as delimiter; and (3) Specify text qualifier as {none}. Can you be more specific about the problem you're having? – xidgel Dec 04 '18 at 19:18
  • Please post what data you are trying to represent. For example does the value of NAME for case where ID=2 have quotes in it or not? – Tom Dec 04 '18 at 20:00

2 Answers2

1

Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt

So your second example represents this data:

Obs    id    name                        value

 1      1    Blah                         100
 2      2    Has space                    200
 3      3    Ends with quotes"            300
 4      4    "Surrounded with quotes"     300

If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:

id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300

But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.

Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.

But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:

Ends with quotes",300
4,"Surrounded with quotes",300

So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.

Tom
  • 47,574
  • 2
  • 16
  • 29
  • I was not aware of this RFC, thanks for linking to it. I think that answers my question. Programs that do not obey this RFC should be fixed. – Merik Dec 06 '18 at 01:49
0

This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel

A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:

read.table(
  text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
  sep = ",", 
  header = TRUE
)
Jozef
  • 2,617
  • 14
  • 19