0

I am struggling to work out how to do this in R. I have data like this from a set of ~50 csv files, each detailing an individual books sale transaction:

**week 1**
**author** **title** **customerID**
author1 title1 1
author1 title2 2
author2 title3 3
author3 title4 3

**week 2**
**author** **title** **customerID**
author1 title1 4
author3 title4 5
author4 title5 1
author5 title6 6

... ~ 50 weeks, each from a separate csv file

I want to get a new table, each row representing an author that appears in the complete data set, and with columns for each of the ~50 weeks that I have data for. Each cell should be the number of book sales of that author in that week. That can be calculated simply from summing the number of rows with that author in that week's sales file. So it should look something like this:

**author** **week1** **week2** ... **week50**
author1 2 1 ...
author2 1 0 ...
author3 1 1 ...
author4 0 1 ...
author5 0 1 ...
...

Any ideas? I know how to get the list of unique authors to make the first column from. And I know how to load each week's sale data into a data frame. But I need help automating this process: 1) iterating over the unique authors 2) iterating over each week's csv file or data frame 3) summing the sales for that author in that week 4) adding count as the value for that cell

Could anyone help? Thanks :-)

Harry Palmer
  • 468
  • 1
  • 6
  • 17
  • 1
    You don't need a loop. Put all the data in one data.frame with an additional column `week` and than use `plyr::ddply` to aggregate. If you really need to, you can reshape the data.frame afterwards. – Roland Jul 10 '12 at 09:54
  • plyr:ddply looks promising. but surely a loop would be a good strategy because I need to do this for ~50 different weeks. – Harry Palmer Jul 10 '12 at 10:15

1 Answers1

1
text1<-"**week 1**
**author** **title** **customerID**
author1 title1 1
author1 title2 2
author2 title3 3
author3 title4 3
"

df1<-read.table(header=T,skip=1,stringsAsFactors=F,text=text1)
week1<-read.table(header=F,nrows=1,stringsAsFactors=F,text=text1,sep=";")
week1<-substr(week1,3,nchar(week1)-2)
df1$week<-rep(week1,nrow(df1))

text2<-"**week 2**
**author** **title** **customerID**
author1 title1 4
author3 title4 5
author4 title5 1
author5 title6 6
"

df2<-read.table(header=T,skip=1,stringsAsFactors=F,text=text2)
week2<-read.table(header=F,nrows=1,stringsAsFactors=F,text=text2,sep=";")
week2<-substr(week2,3,nchar(week2)-2)
df2$week<-rep(week2,nrow(df2))

df<-rbind(df1,df2)
names(df)<-c("author","title","customerID","week")

require(plyr)
agg<-ddply(df,~author*week,function(df) length(df$title))


require(reshape)
res<-cast(agg,author~week,value="V1",fill=0)
res

   author week 1 week 2
1 author1      2      1
2 author2      1      0
3 author3      1      1
4 author4      0      1
5 author5      0      1

You only need a loop to read in your data. For that you can use something like

ff<-list.files(pattern="*.[Cc][Ss][Vv]")
for (i in 1:length(ff)){
  code for reading the data 
  and constructing the data.frame 
}
Roland
  • 127,288
  • 10
  • 191
  • 288
  • You are welcome. If you give my answer a green tick people will know that the question has been answered to your satisfaction. – Roland Jul 10 '12 at 13:43
  • I've almost got it working but I can't get past this error while trying to do the reshaping (res<-cast(agg,author~week,value="V1",fill=0) ): "Error: Casting formula contains variables not found in molten data: author, week". And when I change the names of the columns with names(agg) <- c("author", "week", "count") I get this "Error in data.frame(data[, c(variables), drop = FALSE], result = data$value) : arguments imply differing number of rows: 18143, 0". Am I using cast wrong? – Harry Palmer Jul 10 '12 at 14:49
  • I noticed I wasn't including to quotation marks as you did. Now when I include it in the command, I get this: "Error in data.frame(rrownames(x), unx, check.names = FALSE) : invalid multibyte string 1259 ". I have absolutely no idea what this means and googling doesn't shed any light either. – Harry Palmer Jul 10 '12 at 15:00
  • Nevermind - it was just some non-standard characters in the author fields that were screwing up the cast function. You are an R saviour! :-) – Harry Palmer Jul 10 '12 at 16:00