-2

I have row-wise transaction data in the following long format

TransactionId ItemCode
101   123
101   521
102   423
103   871
103   982
103   131

I want to reshape this to get all items in one transaction in one row, so the new reshaped table looks like this (wide)

TransactionID ItemCode(s)
101   123   521
102   423
103   871   982   131

I have tried melting and casting in R, but i am a bit confused about the how the methods work. Also my dataset is huge, it has 30M rows. How should I go about this process?

SuhrudL
  • 21
  • 1
  • 5
  • your output table isn't in rectangular form. You have too few columns relative to data below it (which has three columns) – Cyrus Mohammadian Aug 25 '16 at 16:44
  • @CyrusMohammadian I edited the transaction ids, maybe this makes what I want to do more clear. My rows will not always have the same number of outputs, because transactions can have variable number of items in them. – SuhrudL Aug 25 '16 at 17:24
  • Do you want your item codes in separate columns or in one column, and if in one column, do you want them as a character vector or as a data.frame per cell? – aichao Aug 25 '16 at 17:27
  • @aichao multiple columns is fine, with one item code in each cell in a row. – SuhrudL Aug 25 '16 at 17:35
  • 1
    I just tried `dcast(products_sample, TransactionID~ ItemCode)`, this gives me a sparse matrix with all the item codes as columns and the transaction ids as row names. Can i convert this into a format described above? – SuhrudL Aug 25 '16 at 17:37

1 Answers1

1

You can use the plyr package to accomplish what you want. Assuming your data is in the data frame products_sample, you can use ddply with spread from the tidyr package. This will put the different ItemCodes from each TransactionId into different columns.

library(plyr)
librray(tidyr)
result <- ddply(products_sample, "TransactionId", spread, ItemCode, ItemCode)

Using the data you provided:

  TransactionId ItemCode
1           101      123
2           101      521
3           102      423
4           103      871
5           103      982
6           103      123

This gives

print(result)
##  TransactionId 123 521 423 131 871 982
##1           101 123 521  NA  NA  NA  NA
##2           102  NA  NA 423  NA  NA  NA
##3           103  NA  NA  NA 131 871 982

The last two arguments to ddply are the key and value arguments passed to spread. The key identifies the column in products_sample whose values are to be used as the column names. The value identifies the column in products_sample whose values will populate the cells. See ?spread for details.

aichao
  • 7,375
  • 3
  • 16
  • 18
  • thanks! why are you specifying `ItemCode` twice? when i tried running this, i get the error: `Error: Duplicate identifiers for rows (1, 2)` – SuhrudL Aug 25 '16 at 17:49
  • @SuhrudL: one possible cause for your error message is that you used `TransactionId` as the key in spread. Recall that this spread operation is performed with the data grouped by `TransactionId`, the second argument in `ddply`. Consequently, the identifiers for the `key` will be the same for all rows in the `spread`. – aichao Aug 25 '16 at 18:08