Creating new variables based on two columns as index one column as new variable names python pandas or R

Question

Please help me edit the title if there's a better phrasing once you read the question.

I have data that looks like this:

Location    Date    Item    Price
 12           1       A       1
 12           2       A       2      
 12           3       A       4
 13           1       A       1
 13           2       A       4
 12           1       B       1
 12           2       B       8
 13           1       B       1
 13           2       B       2
 13           3       B       11

I want to use location and date to create a new variable for each item that is that items price, for example, the output I want is:

Location    Date    PriceA   PriceB
 12           1       1       1
 12           2       2       8      
 12           3       4       NaN
 13           1       1       1
 13           2       4       2
 13           3       NaN     11

akrun · Accepted Answer · 2014-12-11T05:06:27.257

3

You could try reshape from base R

 reshape(df, idvar=c('Location', 'Date'), timevar='Item', direction='wide')
 #    Location Date Price.A Price.B
 #1        12    1       1       1
 #2        12    2       2       8
 #3        12    3       4      NA
 #4        13    1       1       1
 #5        13    2       4       2
 #10       13    3      NA      11

Or

library(reshape2)
dcast(df, Location+Date~paste0('Price',Item), value.var='Price')
#    Location Date PriceA PriceB
#1       12    1      1      1
#2       12    2      2      8
#3       12    3      4     NA
#4       13    1      1      1
#5       13    2      4      2
#6       13    3     NA     11

Or you could use dcast.data.table (would be faster) after converting to data.table

library(data.table)
dcast.data.table(setDT(df)[,Item:=paste0('Price', Item)],
                                         ...~Item, value.var='Price')

Or

library(tidyr)
library(dplyr)
spread(df, Item, Price) %>%
                      rename(PriceA=A, PriceB=B)
#   Location Date PriceA PriceB
#1       12    1      1      1
#2       12    2      2      8
#3       12    3      4     NA
#4       13    1      1      1
#5       13    2      4      2
#6       13    3     NA     11

Update

If you don't need Price as prefix, just do:

dcast.data.table(setDT(df), ...~Item, value.var='Price')

and the reshape2 option would be

dcast(df,...~Item, value.var='Price')

data

df <- structure(list(Location = c(12L, 12L, 12L, 13L, 13L, 12L, 12L, 
13L, 13L, 13L), Date = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 
3L), Item = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"
), Price = c(1L, 2L, 4L, 1L, 4L, 1L, 8L, 1L, 2L, 11L)), .Names = c("Location", 
"Date", "Item", "Price"), class = "data.frame", row.names = c(NA, 
-10L))

edited Dec 11 '14 at 05:06

answered Dec 11 '14 at 04:44

akrun

874,273
37
540
662

I am running the first solution now. I have 266 unique Items and the total rows is about 3,000,000. Do you know which is more efficient offhand? – wolfsatthedoor Dec 11 '14 at 04:49
1

@robbieboy74 I am using `R 3.1.2`. Could you upgrade to recent version of R? I would think `tidyr` to be efficient. But, you could convert `data.frame` to `data.table` and use `dcast.data.table` which would be faster as well. – akrun Dec 11 '14 at 04:51
Is there an easy way to make it so the names of the new columns aren't Price.A and Price.B but rather just A and B? – wolfsatthedoor Dec 11 '14 at 05:02
1

@robbieboy74 By default, the `dcast`, `dplyr`, `dcast.data.table` creates only `A` and `B`. I created the `PriceA` etc after looking at your expected output. – akrun Dec 11 '14 at 05:03
1

@robbieboy74 Using the example dataset, I am not getting any error. – akrun Dec 11 '14 at 05:08
Sorry, ok, I got it all sorted! Thank you for such a great answer! – wolfsatthedoor Dec 11 '14 at 05:09
1

That's a nice collection! `tidyr`'s casting function is just a wrapper for `dcast` AFAICT, which wouldn't be as fast / memory efficient as `dcast.data.table`. – Arun Dec 11 '14 at 08:32

Creating new variables based on two columns as index one column as new variable names python pandas or R

1 Answers1

Update

data