2

Ok so I have a csv file similar to this structure

hashID,value,flag

98fafd,   35,   1

fh56w2,   25,   0

ggjeas,   55,   1

adfh5d,   45,   0

Basically what I want to do is get the median of the value column but only include rows where flag==1 in the calculation.

Is this even possible in R? I've searched around and haven't found anything like this.

Thomas
  • 43,637
  • 12
  • 109
  • 140
SansStef
  • 33
  • 1
  • 3
  • 5
    I'd suggest reading a few of the online FAQs for R. This is a very basic question that has been answered many times. You're looking to `subset` your data. The function for that in R is `[`. Look at `?"["` – Justin Jul 02 '13 at 21:03
  • 3
    As Justin says it's really simple in R, and probably has been asked before many times, but I just went out and tried a couple of searches (using only the words in the title) and also searched through the "Introduction to R" and didn't really succeed. Could this question be another search target to the next questioner? – IRTFM Jul 02 '13 at 22:02
  • 1
    Most of the other questions are about `mean` and use keywords like "subset", "conditional", or "specific rows" so I link to them here since it's essentially [the](http://stackoverflow.com/questions/12350783/find-max-mean-min-of-the-a-subset-in-r) [same](http://stackoverflow.com/questions/12394332/how-to-get-column-mean-for-specific-rows-only/12394419) [question](http://stackoverflow.com/questions/12555179/conditional-mean-statement/12587505), but without those terms. – Thomas Jul 03 '13 at 05:53

2 Answers2

4

You can also do this in a quick one-liner with a boolean array for an index to the data frame:

# read the data from a csv file
newdata <- read.csv("file.csv")
# this will give you a vector of boolean values of length nrow(newdata)
newdata$flag==1
# and this line uses the above vector to retrieve only those elements of 
# newdata$value for which the row contains a flag value of 1
median(newdata$value[newdata$flag==1])
LSE
  • 73
  • 4
2

Here is one possibility:

Read your data set using the following command:

newdata <- read.csv("stackoverflow questions/mediancol.csv")
# I assume you have the data in csv format

   # Showing the data I used for the computation
     newdata <- structure(list(hashID = structure(c(1L, 3L, 4L, 2L), .Label = c("98fafd", 
"adfh5d", "fh56w2", "ggjeas"), class = "factor"), value = c(35L, 
25L, 55L, 45L), flag = c(1L, 0L, 1L, 0L)), .Names = c("hashID", 
"value", "flag"), class = "data.frame", row.names = c(NA, -4L
))
    > newdata
  hashID value flag
1 98fafd    35    1
2 fh56w2    25    0
3 ggjeas    55    1
4 adfh5d    45    0

# Subset the data when flag =1
newdata1 <- subset(newdata,flag==1)

# Look at the summary of the data

> summary(newdata1)
    hashID      value         flag  
 98fafd:1   Min.   :35   Min.   :1  
 adfh5d:0   1st Qu.:40   1st Qu.:1  
 fh56w2:0   Median :45   Median :1  
 ggjeas:1   Mean   :45   Mean   :1  
            3rd Qu.:50   3rd Qu.:1  
            Max.   :55   Max.   :1

# Only look at the median 
median(newdata1$value)
[1] 45
Jd Baba
  • 5,948
  • 18
  • 62
  • 96