Reading oddly formatter CSV file

Question

I'm looking at downloading some data from the statistics.gov.scot website. For example, I would like to source some data on the rates of hospital admissions. The query to source the data table I'm interested in is of format:

http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall

and be accessed via this link, for those who want to try. The query generates a *.CSV file with the relevant information, however, the format of the file poses some challenges.

File example

The file content looks like that:

Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"

when imported to Excel:

However, when imported to R via the read.csv it looks like that:

> head(problematicFile)
                                                   V1                        V2
1             Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions       Hospital Admissions
3                                        measure type                          
4                                      Admission Type                          
5                                                 Age                          
6                                              Gender

Problem

The read.csv import returns only two columns. I'm guessing that the problem relates to some of the initial columns being empty. I want to read this file in a manner similar to the illustrated import achieved in Excel. The point is that, I intend to use vales from the row 7 in columns A and B and, naturally, the data table below. In terms of generating the data.frame I would be happy to contain NA values where there are empty cells but to be of the dimensions equivalent to those in Excel. I tried:

read.csv(file = link, header = FALSE, na.strings = "",
                               fill = TRUE)

but I keep on arriving at the same problem.

Desired results

The desired results should look like that (extract generated by hand):

Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00   NA  NA  NA  NA  NA  NA  NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA  NA  NA  NA  NA  NA  NA
measure type    NA  NA  NA  NA  NA  NA  NA  NA
Admission Type  NA  NA  NA  NA  NA  NA  NA  NA
Age NA  NA  NA  NA  NA  NA  NA  NA
Gender  NA  NA  NA  NA  NA  NA  NA  NA
Measure (cell values):  Ratio (Rate Per 100,000 Population)         NA  NA  NA  NA  NA
NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  NA  http://reference.data.gov.uk/id/year/2002   http://reference.data.gov.uk/id/year/2003   http://reference.data.gov.uk/id/year/2004   http://reference.data.gov.uk/id/year/2005   http://reference.data.gov.uk/id/year/2006   http://reference.data.gov.uk/id/year/2007   http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area  2002    2003    2004    2005    2006    2007    2008
http://statistics.gov.scot/id/statistical-geography/S92000003   Scotland    9,351   9,262   9,261   9,347   9,723   10,517  10,293
http://statistics.gov.scot/id/statistical-geography/S16000082   Angus South 8,236   8,500   8,523   8,371   8,616   8,978   9,325
http://statistics.gov.scot/id/statistical-geography/S16000106   Edinburgh Northern and Leith    9,040   8,040   7,925   9,042   10,355  11,833  8,916
http://statistics.gov.scot/id/statistical-geography/S16000140   Renfrewshire South  9,391   9,122   9,491   9,586   10,425  10,900  11,065
http://statistics.gov.scot/id/statistical-geography/S16000108   Edinburgh Southern  5,878   5,910   6,101   6,035   7,426   9,343   6,766
http://statistics.gov.scot/id/statistical-geography/S16000075   Aberdeen Donside    10,047  10,963  10,629  10,512  10,383  10,787  10,685
http://statistics.gov.scot/id/statistical-geography/S16000137   Perthshire North    9,388   9,524   7,799   9,350   9,543   9,791   9,991
http://statistics.gov.scot/id/statistical-geography/S16000077   Aberdeenshire East  7,211   7,300   7,153   7,411   7,435   7,268   7,547
http://statistics.gov.scot/id/statistical-geography/S16000114   Galloway and West Dumfries  9,861   9,165   8,143   9,258   7,508   10,213  10,399
http://statistics.gov.scot/id/statistical-geography/S16000096   Dumbarton   8,703   8,570   8,727   9,310   9,389   9,885   10,237

Screenshot

Just to illustrate further, I want to maintain the dimensions and populate missing values with NAs:

score 2 · Accepted Answer · answered Feb 01 '17 at 17:35

Parsing the metadata from the headers is a bit tricky. You might prefer to download the whole normalised dataset instead of that cross-tabulated slice.

> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")

> head(reconv)

  GeographyCode DateCode Measurement                              Units Value Gender Age
1     S92000003     2003        Mean Average reconvictions per offender  0.62    All All
2     S92000003     2004        Mean Average reconvictions per offender  0.33    All All
3     S92000003     2004        Mean Average reconvictions per offender  0.61    All All
4     S92000003     2005        Mean Average reconvictions per offender  0.60    All All
5     S92000003     2006        Mean Average reconvictions per offender  0.60    All All
6     S92000003     2007        Mean Average reconvictions per offender  0.11    All All

This will put all of the metadata in factor levels (so you don't have to parse it):

> str(reconv)

'data.frame':   10119 obs. of  7 variables:
 $ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ DateCode     : int  2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
 $ Measurement  : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
 $ Units        : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Value        : num  0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
 $ Gender       : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Age          : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...

You can select the slice you're interested in:

> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")

And get back to the original cross-tabulated slice if you want:

> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)

   GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1      S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2      S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3      S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4      S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5      S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...

Ujjwal Kumar · Answer 2 · 2016-03-15T11:22:02.613

1

You need to manually specify the col.names in order to force read.csv to read more than one column. Also specifying na.strings as an empty string would keep NA values in empty columns.

read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")

edited Mar 15 '16 at 11:22

answered Mar 15 '16 at 11:13

Ujjwal Kumar

581
3
12

Thanks for showing the interest but I'm precisely trying to avoid that. I need that information as it contains indicator names and some other data that I'm going to use. If I skip through the file I will have to read it **twice** once to get first 9 lines with the relevant metadata and then one more time to get the actual data. I want to avoid that, I want to have one big table with NAs placed in the blank columns and then reference values I need, **including** stuff from the first column. – Konrad Mar 15 '16 at 11:16
@Konrad see if the change helps – Ujjwal Kumar Mar 15 '16 at 11:22
`more columns than column names` The thing is that I won't know the dimensions of the file before importing it. Alternative approach could be to read it through `readLines` and then faff around with deriving the table from the lines with data and some other values from the first few rows. Ideally, I would rather have a table with NAs so I could do: `indicatorName <- x[7,2]` or whatever else I may need to pick from that stuff. – Konrad Mar 15 '16 at 11:26
Do you have any estimate on the max number of columns the data could have? If yes, then you can put that many column names when importing. Then you can check if a column is all `NA` and remove all such columns from imported data(if you want to keep the dataframe clean, if not then no need to remove such columns). Hope that helps – Ujjwal Kumar Mar 15 '16 at 11:31

score 0 · Answer 3 · answered Mar 15 '16 at 11:27

0

You could specify the number of columns by using read.table and the supply of column names:

read.table(file = link, 
           fill = TRUE,
           sep = ",",
           na.strings = "",
           col.names = paste("c", 1:12, sep = ""))

However, I don't know whether this is a good solution as you would need to know the number of columns a priori.

Another way would be to read the whole csv as string. Then you could pre-process by storing the header in another object (e.g. a list) and you could just use the "table part" as data frame.

answered Mar 15 '16 at 11:27

bet.bom

196
5

Thanks, it's a start. I was hoping that I will be able to somehow read everything at once as I could then jump around the `data.frame` and pick what I want. I've a list that sources those files in a loop so I can break it further into two objects one with header but was thinking that somehow it will be possible to avoid that. – Konrad Mar 15 '16 at 11:31

Reading oddly formatter CSV file

File example

Problem

Desired results

Screenshot

3 Answers3