1

I'm struggling a bit with R and plyr, I don't know how to obtain the result I'm interested in. I have a dataframe looking like this:

Region Price
Alentejano 71
Andalucia 30
Bordeaux 135
Bordeaux 500
Bordeaux 185

And so on. I would like to get the mean for each Region, and so far I tried with plyr and the code:

means <- ddply(data, ~ Region, summarise, mean = mean(Price), sd=sd(Price))

which succesfully gives me the standard deviation in places where I have more than one observance per variable. I do not get any means. How do I make a code that gives me a mean for multiple observances, but leaves the number if there is only one observance?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Martin Andersen
  • 133
  • 1
  • 1
  • 7
  • If there is only one observation, you get NA for sd, do you want the observation to replace the NA? – akrun Oct 09 '15 at 03:51
  • The description is confusing. Can you update with the expected output (based on the data showed). – akrun Oct 09 '15 at 04:25
  • What's the output you are getting? As you can see in my answer, I'm getting a correct output. – Jaap Oct 09 '15 at 07:05

2 Answers2

1

Based on your code, you are not using dplyr but plyr. When you take the mean of one observation, it will return the value of that observation:

On your example data:

aggregate(Price ~ Region, dat, FUN = mean)

returns:

      Region    Price
1 Alentejano  71.0000
2  Andalucia  30.0000
3   Bordeaux 273.3333

As you can see, for regions "Alentejano" and "Andalucia" the same values as in the original data are returned.

Using the code you provided:

library(plyr)
ddply(dat, ~ Region, summarise, mean = mean(Price), sd=sd(Price))

I get:

      Region     mean       sd
1 Alentejano  71.0000       NA
2  Andalucia  30.0000       NA
3   Bordeaux 273.3333 197.8846

Which is the expected & valid outcome.

If you are using both plyr and dplyr, make sure that you have loaded plyr before dplyr. Else you will get the following warning message:

------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------------

Used data:

dat <- read.table(text="Region Price
Alentejano 71
Andalucia 30
Bordeaux 135
Bordeaux 500
Bordeaux 185", header=TRUE)
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I think the OP also want SD. Also, the original code in OP's post gives the output you showed for the mean. I would say the expected output is not clear. It could be a case that the OP loaded both dplyr and plyr and got the masking issue. – akrun Oct 09 '15 at 06:56
  • @akrun both added :-) – Jaap Oct 09 '15 at 06:59
  • I am so sorry, the code worked yes. I had also loaded dplyr, although after plyr as I should. For some reason my values were not numeric, so after I fixed that I had no problem – Martin Andersen Oct 11 '15 at 07:51
0

This will give you the needed answer

means <- ddply(data, ~ Region, summarise, mean = mean(Price[duplicated(Price)]), sd=sd(Price))
Anuja Parikh
  • 53
  • 1
  • 14
  • I get all `NaN` for the `mean` column based on the input dataset. – akrun Oct 09 '15 at 04:11
  • Thats because there are no duplicates in the data provided – Anuja Parikh Oct 09 '15 at 04:17
  • I don't know the expected output, but based on your code, the sd for Bordeaux is 197.8846 and all others are NA, the mean values are all NaN. The OP's original code gives `mean` values that are not NA.... – akrun Oct 09 '15 at 04:18
  • This is the expectation of the user "How do I make a code that gives me a mean for multiple observances, but leaves the number if there is only one observance?" As per that I am giving the code which gives mean for multiple observance and leaves the number if it has only one observance – Anuja Parikh Oct 09 '15 at 04:28
  • I am not sure. I get all NaN for the `mean` column based on your code, which can be achieved also by simply changing the 'Price' column to NA. – akrun Oct 09 '15 at 04:29
  • I have not tried the script , i have given the member the suggestion. let him try and answer. whatever queries you have please ask the member not me – Anuja Parikh Oct 09 '15 at 05:06
  • Listen, please maintain your tone and then speak. What answer I gave was as per the question. what you are doing is only testing it on the 5 records given. whereas i tested it on my own sample data which had unique and duplicated values as required by the member. My answer is specific to the question. Im not trying it just on the data the member gave because logically there are no duplicated values in the data he gave. so there wont be any data left to carry out mean thats why the mean will gives NAs. – Anuja Parikh Oct 09 '15 at 06:14
  • It is your tone that is rude. I was just reporting that I got NaN. You said that it is not tested and ask the OP. Anyway, I can only test based on the input data OP given. Using the input data from OP, the OP's code give me `mean` values that are not NaN, but using your code, it gives NaN. Anyway, I think there is no point in continuing with this. – akrun Oct 09 '15 at 06:16
  • Do as per what you feel, Stop bothering me. – Anuja Parikh Oct 09 '15 at 06:18
  • More than one observation per group and duplicated are different. – akrun Oct 09 '15 at 06:23