0

I'd like to create a weighted average from one data frame and add it to another. Typically, I do this in SQL, but I cannot in this case. The example I give here is very simplified.

The first data frame is called Customer. It has a CustomerID column.

Customer <- data.frame(
     CustomerID = sample(1:10)
)

The second data frame is called Order. It has four fields: CustomerID, Year, Weight, and TotalCost.

Order <- data.frame(
     CustomerID = sample(1:9, 100, replace=TRUE),
     Year = sample(2014:2016, 100, replace=TRUE),
     Weight = sample(1:3, 100, replace=TRUE),
     TotalCost = sample(200:400, 100, replace=TRUE)
)

I'd like to add a column WeightedCost to the Customer data frame that is the weighted average TotalCost for that customer, calculated Sum(Weight*TotalCost)/Sum(Weight) limited by the CustomerID field and where the Year > 2015.

I've looked at looping thru the Customer table, but I'm pretty sure there is a better vectorized solution.

Additionally (late edit), I'd like to know how to implement this if the year comes from the Customer table rather than explicitly defined. Here is the new Customer data frame:

c = c(1,1,2,2,3,3,4,4,5,5)
y = c(2014,2015,2014,2015,2014,2015,2014,2015,2014,2015)
Customer <- data.frame(
    CustomerID = c,
    Year = y
)
Soma Holiday
  • 185
  • 1
  • 1
  • 13

2 Answers2

2

Update

No idea if OP is still interested, but for posterity...

First, may I say that If you had asked the right question in the first place, and shown your desired output, you would likely have got an answer quicker. Secondly, I have rolled back to my first answer because it at least answered the question as originally phrased, whereas later edits were clearly not what you were looking for.

Rolled back answer

library(dplyr)
Order %>% 
     filter(Year > 2015) %>% 
     group_by(CustomerID) %>% 
     transmute(w.mean = sum(TotalCost * Weight) / sum(Weight) %>% 
     slice(1) %>% 
     full_join(Customer)

Edit - Added join to Customer data

Result

CustomerID     w.mean
     1         321.5556
     2         264.3333
     3         231.2000
     4         397.0000
     5         250.0000
     6         266.6250
     7         237.0000
     8         258.2000
     9         384.8333
    10         NA

Updated answer

Finally I have added a new series of manipulations that add all the weighted costs from back to the specified year. I'm still kind of guessing what your outcome should look like, but this is what you were asking for, right?

Data manipulation

Order %>% 
     group_by(CustomerID, Year) %>% 
     summarize(w.mean = weighted.mean(TotalCost, Weight)) %>% #sum(TotalCost * Weight) / sum(Weight)) %>%
     arrange(CustomerID, desc(Year)) %>% 
     mutate(w.mean = cumsum(w.mean)) %>%
     right_join(Customer2, c('CustomerID', 'Year'))

Output

CustomerID  Year   w.mean
         1  2014 939.3500
         1  2015 602.3500
         2  2014 860.7063
         2  2015 566.9286
         3  2014 780.8819
         3  2015 522.4274
         4  2014 922.6154
         4  2015 569.6154
         5  2014 945.9679
         5  2015 654.7179

If this is still the wrong output, and you are still interested, please post an example of the correct output.

  • I'm not familiar with the dplyr package, though I did install it and add the library. When I put your code into R verbatim, it doesn't "complete." It is looking for another line of code. – Soma Holiday Jan 25 '17 at 15:58
  • Yes, there was a bracket missing from the end of the `transmute` line - I edited it last night, I guess after you copied it? - I just double checked it - looks fine now, sorry :) –  Jan 25 '17 at 16:01
  • How would I implement this if the year (in this case: 2015) was a column in the Customer table? – Soma Holiday Jan 25 '17 at 16:22
  • @Soma Holiday, I don't know what `Customer` looks like and I am not sure what your desired output is now. Please have a read of the links Imo posted under your question –  Jan 25 '17 at 16:35
  • I still show you limiting by 2015 in your filter and mutate statements. The results should be overlapping where 1/2014 includes orders from 2015 and 2016 while 1/2015 includes just orders from 2016. – Soma Holiday Jan 27 '17 at 16:08
1

Consider a base R solution with aggregate(), transform(), and merge():

# CREATE WEIGHT COST PRODUCT UNIT LEVEL COLUMN
Order$WgtCostProd <- Order$Weight * Order$TotalCost

# AGGREGATE WITH NEW WEIGHTEDCOST COLUMN
OrderAgg <- transform(aggregate(.~CustomerID, Order[Order$Year > 2015,], FUN=sum),
                      WeightedCost = WgtCostProd / Weight)

# LEFT JOIN MERGE TO ORIGINAL DATAFRAME
Customer <- merge(Customer, OrderAgg[c("CustomerID", "WeightedCost")], 
                  by="CustomerID", all.x=TRUE)
Customer

#    CustomerID WeightedCost
# 1           1     302.6667
# 2           2     281.2941
# 3           3     229.6667
# 4           4     308.6000
# 5           5     305.6667
# 6           6     315.0000
# 7           7     312.5833
# 8           8     333.0000
# 9           9     295.6667
# 10         10           NA
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This gives me a WeightedCost.x (CustomerId or NAN) and WeightedCost.y (WeightedCost) column in the Customer table. – Soma Holiday Jan 24 '17 at 23:16
  • 1
    You are continually merging onto *Customer* df. With each run, rerun initial *Customer* with `data.frame` call. Or at end, create a new dataframe instead of overwriting it as this answer shows. – Parfait Jan 24 '17 at 23:53
  • How would I implement this if the year (in this case: 2015) was a column in the Customer table? – Soma Holiday Jan 25 '17 at 16:23
  • I like this solution, but I really want to see how it extends if we include the year from the Customer dataframe as a limiting factor. – Soma Holiday Jan 27 '17 at 16:10
  • In the `merge` just filter for year: `merge(Customer[Customer$Year==2015,], ...)` or `merge(subset(Customer, Year==2015), ...)`. – Parfait Jan 27 '17 at 17:25