0

I'm trying to import some publicly available life outcomes data using the code below:

require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls", 
                        sheet = "Quick Lookup", verbose = TRUE)

Naturally, the imported data frame doesn't look good: enter image description here I would like to amend my column names using the code below:

# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
                                    unique = TRUE,allow_ = TRUE)

But it produces rather unpleasant results:

> names(simd.sg.xls)
 [1] "X1"       "X1.1"     "X771"     "X354"     "X229"     "X74"      "X67"      "X33"      "X19"      "X1.2"    
[11] "X6"       "X1.3"     "X8"       "X7"       "X7.1"     "X6506"    "X21"      "X1.4"     "X6158"    "X6506.1" 
[21] "X6506.2"  "X6506.3"  "X6263"    "X6506.4"  "X6468"    "X1010"    "X815"     "X99"      "X58"      "X65"     
[31] "X60"      "X6506.5"  "X21.1"    "X1.5"     "X6173"    "X5842"    "X6506.6"  "X6506.7"  "X6263.1"  "X6506.8" 
[41] "X6481"    "X883"     "X728"     "X112"     "X69"      "X56"      "X54"      "X6506.9"  "X21.2"    "X1.6"    
[51] "X6143"    "X5651"    "X6506.10" "X6506.11" "X6263.2"  "X6506.12" "X6480"    "X777"     "X647"     "X434"    
[61] "X518"     "X246"     "X436"     "X6506.13" "X21.3"    "X1.7"     "X6136"    "X5677"    "X6506.14" "X6506.15"
[71] "X6263.3"  "X6506.16" "X660"     "X567"     "X480"     "X557"     "X261"     "X456"  

My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.

Konrad
  • 17,740
  • 16
  • 106
  • 167

1 Answers1

1

It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:

require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls", 
                        sheet = "Quick Lookup", verbose = TRUE, 
                        pattern="DATAZONE", header=TRUE)

UPDATE

I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:

Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.

On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:

simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls", 
                        sheet = "Quick Lookup", verbose = TRUE, 
                        header=FALSE, stringsAsFactors=FALSE)

   names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
                                   unique = TRUE,allow_ = TRUE)
   simd.sg.xls <- simd.sg.xls[-(1:2),]

All data will be of character type so you'll need to convert to factor and numeric as necessary.

WaltS
  • 5,410
  • 2
  • 18
  • 24
  • It works but why so many [warnings](http://pastebin.com/HxsGUy22). Why this cannot be addressed through `make.names`? I wanted to have the full table somewhere and then modify it accordingly. – Konrad Apr 19 '15 at 17:15