-2

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers. However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"

My data set is:

 name   mark  
    A  100.0   
    B  0.5   
    C  100.0  
    A  50.0   
    B  90.0   
    B  1000.0   
    C  1200.0   
    C  5000.0   
    A  210.0 

The function which I have written is :

data.frame(do.call("rbind", as.list(by(data, data$name, 
   function(x){apply(x[, .(mark)],2, 
   function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] 
   <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))

The result of the above function is the first column below (I've manually added back name for illustratory purposes):

      mark      NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C

In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?

Hack-R
  • 22,422
  • 14
  • 75
  • 131

2 Answers2

1

Are you sure that code is doing what you think it is?

It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.

A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.

Reading your data in (why the "----"?)

scores <- read.table(text="
name   mark  
A 100.0   
B 0.5   
C 100.0  
A 50.0   
B 90.0   
B 1000.0   
C 1200.0   
C 5000.0   
A 210.0", header=TRUE)

and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value

scale_outliers <- function(data) {

  lim <- quantile(data, na.rm = TRUE)
  data[data > lim["75%"]] <- lim["75%"]
  data[data < lim["25%"]] <- lim["25%"]
  return(data)

}

Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data

gg1 <- scores %>% ggplot(aes(x=name, y=mark)) 
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1

gg1

And if we alter it with the new function we get the data back without rows changed around

scores %>% mutate(new_mark = scale_outliers(mark))
#>   name   mark new_mark
#> 1    A  100.0      100
#> 2    B    0.5       90
#> 3    C  100.0      100
#> 4    A   50.0       90
#> 5    B   90.0       90
#> 6    B 1000.0     1000
#> 7    C 1200.0     1000
#> 8    C 5000.0     1000
#> 9    A  210.0      210

and we can plot that,

gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark)) 
gg2 <- gg2 + geom_point() + geom_boxplot()  + coord_cartesian(ylim=range(scores$mark))
gg2

gg2

Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),

gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark)) 
gg3 <- gg3 + geom_point() + geom_boxplot()  + coord_cartesian(ylim=range(scores$mark))
gg3

gg3

Hack-R
  • 22,422
  • 14
  • 75
  • 131
Jonathan Carroll
  • 3,897
  • 14
  • 34
0

A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:

data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]

Then you perform your calculation but you keep the name and i:

df <- data.frame(do.call("rbind", as.list(
    by(data, data$name,
       function(x) cbind(i=x$i, 
                     name=x$name,
                     apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
                     )))))

And finally you order using the index:

df[order(df$i),]
  i name    mark
1 1    A     100
4 2    B     0.5
7 3    C     100
2 4    A      50
5 5    B      90
6 6    B 839.625
8 7    C    1200
9 8    C    4875
3 9    A     210
Hack-R
  • 22,422
  • 14
  • 75
  • 131
HubertL
  • 19,246
  • 3
  • 32
  • 51