paste values within categories defined by multiple columns

Question

I want to pivot the result column in df horizontally creating a data set with a separate row for each region, state, county combination where the columns are ordered by year then city.

I also want to identify each row in the new data set by region, state and county and remove the white space between the four results columns. The code below does all of that, but I suspect it is not very efficient.

Is there a way to do this with reshape2 without creating a unique identifier for each group and numbering observations within each group? Is there a way to use apply in place of the for-loop to remove white space from a matrix? (Matrix here being used in a different manner than a mathematical or programming construct.) I realize those are two separate questions and maybe I should post each question separately.

Given that I can achieve the desired result and am only looking to improve the code I do not know whether I should even post this, but I am hoping to learn. Thanks for any advice.

df <- read.table(text= "
region   state    county city  year result
1          1        1      1     1     1
1          1        1      2     1     2
1          1        1      1     2     3
1          1        1      2     2     4
1          1        2      3     1     4
1          1        2      4     1     3
1          1        2      3     2     2
1          1        2      4     2     1
1          2        1      1     1     0
1          2        1      2     1    NA
1          2        1      1     2     0
1          2        1      2     2     0
1          2        2      3     1     2
1          2        2      4     1     2
1          2        2      3     2     2
1          2        2      4     2     2
2          1        1      1     1     9
2          1        1      2     1     9
2          1        1      1     2     8
2          1        1      2     2     8
2          1        2      3     1     1
2          1        2      4     1     0
2          1        2      3     2     1
2          1        2      4     2     0
2          2        1      1     1     2
2          2        1      2     1     4
2          2        1      1     2     6
2          2        1      2     2     8
2          2        2      3     1     3
2          2        2      4     1     3
2          2        2      3     2     2
2          2        2      4     2     2
", header=TRUE, na.strings=NA)

desired.result <- read.table(text= "
region   state    county results
1          1        1     1234
1          1        2     4321
1          2        1     0.00
1          2        2     2222
2          1        1     9988
2          1        2     1010
2          2        1     2468
2          2        2     3322
", header=TRUE, colClasses=c('numeric','numeric','numeric','character'))

# redefine variables for package reshape2 creating a unique id for each
# region, state, county combination and then number observations in
# each of those combinations

library(reshape2)

id.var <- df$region*100000 + df$state*1000 + df$county
obsnum <- sequence(rle(id.var)$lengths)

df2 <- dcast(df, region + state + county ~ obsnum, value.var = "result")

# remove spaces between columns of results matrix
# with a for-loop.  How can I use apply to do this?

x <- df2[,4:(4+max(obsnum)-1)]

# use a dot to represent a missing observation

x[is.na(x)] = '.'

x.cat = numeric(nrow(x))

for(i in 1:nrow(x)) {
  x.cat[i] = paste(x[i,], collapse="")
}

df3 <- cbind(df2[,1:3],x.cat)
colnames(df3) <- c("region", "state", "county", "results")
df3

df3 == desired.result

EDIT:

Matthew Lundberg's answer below is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, I have posted an apparent way to do that below that modifies Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.

Matthew Lundberg · Accepted Answer · 2012-12-31T23:15:54.907

1

I think this does what you want:

df$result <- as.character(df$result)
df$result[is.na(df$result)] <- '.'


aggregate(result ~ county+state+region, data=df, paste0, collapse='')

  county state region result
1      1     1      1   1234
2      2     1      1   4321
3      1     2      1   0.00
4      2     2      1   2222
5      1     1      2   9988
6      2     1      2   1010
7      1     2      2   2468
8      2     2      2   3322

This relies on your data frame being sorted in the proper order (as yours is).

edited Dec 31 '12 at 23:15

answered Dec 31 '12 at 23:09

Matthew Lundberg

42,009
6
90
112

Thank you for the outstanding answer. Later I realized I also needed an output data set in which the four result columns are numeric and are separated by a space. I could not modify your answer to do that, but I came close, and posted the code here. – Mark Miller Jan 01 '13 at 12:10

Mark Miller · Answer 2 · 2013-01-01T12:17:11.520

Matthew Lundberg's answer is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, here I provide an apparent way to do that using a modification of Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.

The first two lines are modifications of Matthew's answer.

df$result[is.na(df$result)] <- 'NA'
df2 <- aggregate(result ~ county+state+region, data=df, paste)

Then I specify that NA represents missing observations and use apply to obtain the numeric output.

df2$result[df2$result=='NA'] = NA
new.df <- data.frame(df2[,1:3], apply(df2$result,2,as.numeric))

The output is below except note that I added 0.5 to each value in df shown in the original post.

  county state region  X1  X2  X3  X4
     1     1      1   1.5 2.5 3.5 4.5
     2     1      1   4.5 3.5 2.5 1.5
     1     2      1   0.5  NA 0.5 0.5
     2     2      1   2.5 2.5 2.5 2.5
     1     1      2   9.5 9.5 8.5 8.5
     2     1      2   1.5 0.5 1.5 0.5
     1     2      2   2.5 4.5 6.5 8.5
     2     2      2   3.5 3.5 2.5 2.5

score 0 · Answer 3 · answered Jan 01 '13 at 20:06

In my original post I asked how to remove spaces between columns in a data set using apply. That did not prove necessary thanks to Matthew Lundberg's answer to my larger question. Nevertheless, removing spaces between columns of a data set is something I frequently have to do. For completeness, here I post a way to do that using paste0 and apply that arose, in part, from Matthew's answer.

To remove all spaces from the data set x:

x <- read.table(text= "
A    B    C    D
1    1    1    1
1    1    2    2
1   NA    1    3
1    1    2    4
1    2    1    5
1    2   NA    6
1    2    1    7
1    2    2    8
", header=TRUE, na.strings=NA)

# use a dot to represent a missing observation

x[is.na(x)] = '.'

y <- as.data.frame(apply(x, 1, function(i) paste0(i, collapse='')))
colnames(y) <- 'result'
y

Gives:

The following code removes the spaces between just the second and third columns:

z <- as.data.frame(apply(x[,2:3], 1, function(i) paste0(i, collapse='')))

y <- data.frame(x[,1], z, x[,4])
colnames(y) <- c('A','BC','D')
y

Giving:

No need to create the anonymous functions for `apply`. Use the `...` argument to pass to `paste0` instead. `apply(x, 1, paste0, collapse='')`, — Matthew Lundberg, Jan 01 '13 at 20:16

paste values within categories defined by multiple columns

3 Answers3