read.csv() in R with all character columns and one numeric column

Question

I am having trouble with importing a cvs file into R (using read.csv()).

I would like to import a csv data file into a data frame in R and set all columns but "Value" as character columns. "Value"vector should be numeric column. Can someone help me please?

I have done this many times with other files but this one for some reason does not cooperate. The problem might be caused by the fact that the file is a european style (decimal is "."). I am not sure.

This is the link to a file: https://www.dropbox.com/s/9kqjiy5phj9qkg3/albania_%2B.csv?dl=0

Yes, thank you. I am aware of it but I could not make it work with this file. — carpediem, May 12 '15 at 03:44
You are aware of which? Nb, for European style CSV files, there is read.csv2(). — gung - Reinstate Monica, May 12 '15 at 03:46
Your file looks pretty screwed up, actually. Are all those quote marks meant to be there? Have you tried opening the file in another program (like Excel)? — Hong Ooi, May 12 '15 at 04:01

G. Grothendieck · Answer 1 · 2015-05-12T04:43:46.613

2

Read it in using readLines and remove the first (^") and last ("$) double quote and also any double quote followed by another double quote ("(?=")) creating L. Then use read.table to read L specifying as.is=TRUE to get "character" and "numeric" columns.

L <- gsub('^"|"$|"(?=")', '', readLines("albania_+.csv"), perl = TRUE)    
DF <- read.csv(text = L, as.is = TRUE)

giving:

> str(DF)
'data.frame':   544 obs. of  10 variables:
 $ Country.or.Area: chr  "Albania" "Albania" "Albania" "Albania" ...
 $ Year           : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Area           : chr  "Urban" "Urban" "Urban" "Urban" ...
 $ Sex            : chr  "Female" "Female" "Female" "Female" ...
 $ Age            : chr  "Total" "0 - 4" "5 - 9" "10 - 14" ...
 $ Record.Type    : chr  "Estimate - de facto" "Estimate - de facto" "Estimate - de facto" "Estimate - de facto" ...
 $ Reliability    : chr  "Final figure, complete" "Final figure, complete" "Final figure, complete" "Final figure, complete" ...
 $ Source.Year    : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ Value          : num  763925 39796 42761 55894 68627 ...
 $ Value.Footnotes: logi  NA NA NA NA NA NA ...

Here is a visualization of the regular expression:

^"|"$|"(?=")

Regular expression visualization

Debuggex Demo

edited May 12 '15 at 04:43

answered May 12 '15 at 04:01

G. Grothendieck

254,981
17
203
341

This doesn't seem to be correct. The `Final figure, complete` seems to be part of the same field. Plus, it's the `Value Footnotes` field that is always empty. See my answer, as I guess that produces a more correct `data.frame`. – nicola May 12 '15 at 04:07
I very much appreciate your help. I am getting an error message when I enter the first line of your code. The message looks as follows: `Warning message: In readLines("albania_+.csv") : incomplete final line found on 'albania_+.csv'` What am I doing wrong? – carpediem May 12 '15 at 04:18
2

@mayerkat Again, that is not an error, it's a warning. Just don't care about it. – nicola May 12 '15 at 04:19
@nicola I see now... I have checked it and you are absolutely right. Thank you for your help and patience. I really appreciate it. – carpediem May 12 '15 at 04:33
Have reivised to fix the `Final figure, complete` problem. – G. Grothendieck May 12 '15 at 04:33
Yes, based on your comment I revised and looking at your answer now it does seem both of them are very similar. I did use a different regular expression and only one gsub but I think its inevitable that they will be similar since you can't use read.csv/read.table in that many ways. – G. Grothendieck May 12 '15 at 04:48
@G.Grothendieck Aside from the regex used, the issue come from the bad quotes the file had. Your very first answer didn't correctly deal with that issue while mine did and explained what had to be done with the quotes. Next, you proposed an exact copy of my answer and now you enriched it with a (very smart and elegant) different regex. Honestly, had I been in your shoes, I'd just delete the answer and made a comment pointing out your clever regex. If you want to keep your answer, I will delete mine, since they are basically the same, mine is more recent and has a less elegant solution. – nicola May 12 '15 at 05:00

nicola · Answer 2 · 2015-05-12T04:18:48.393

I gave a look at your file and it seems very badly formatted. There are 3 issues:

Every line starts with an unnecessary quote (").
Every line ends with an unnecessary quote (").
Quotes are doubled for some reason. Instead of "fieldvalue" you have ""fieldvalue"" in your file.

This is just a workaround to read this file (don't worry about the warning you'll receive after the first line):

 textfile<-readLines("albania_+.csv")
 x<-gsub('"{2}','"',gsub('(^"|"$)',"",textfile))
 res<-read.csv(text=x,stringsAsFactors=FALSE)
 str(res)
 #'data.frame': 544 obs. of  10 variables:
 #$ Country.or.Area: chr  "Albania" "Albania" "Albania" "Albania" ...
 #$ Year           : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 #$ Area           : chr  "Urban" "Urban" "Urban" "Urban" ...
 #$ Sex            : chr  "Female" "Female" "Female" "Female" ...
 #$ Age            : chr  "Total" "0 - 4" "5 - 9" "10 - 14" ...
 #$ Record.Type    : chr  "Estimate - de facto" "Estimate - de facto"     "Estimate - de facto" "Estimate - de facto" ...
 #$ Reliability    : chr  "Final figure, complete" "Final figure, complete" "Final figure, complete" "Final figure, complete" ...
 #$ Source.Year    : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 #$ Value          : num  763925 39796 42761 55894 68627 ...
 #$ Value.Footnotes: logi  NA NA NA NA NA NA ...

Thank you for your help. When I enter the 1st line of your code, I get an error message: `Warning message: In readLines("albania_+.csv") : incomplete final line found on 'albania_+.csv'` — carpediem, May 12 '15 at 04:11
@mayerkat You just get a warning, not an error. Go ahead and it should work. — nicola, May 12 '15 at 04:17

read.csv() in R with all character columns and one numeric column

2 Answers2