Only include outliers from each column in a dataframe

Question

I have a dataframe as follows:

 chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3

I have to create a loop which:

a) Calculates the upper and lower limit (UL and LL) for each column from the third column onwards.
b) Only includes rows that fall outside of the UL and LL (Zoutliers).
c) Then count the number of rows where the Zoutlier is the same direction (i.e. positive or negative) as the previous or the subsequent row for the same chr.

The output would therefore be:

 ZScore1    TBGGT     12_try      324Gtt       AMN2
 nrow        4         6            4           4

So far I have code as follows:

  library(data.table)#v1.9.5
  f1 <- function(df, ZCol){

  #A) Determine the UL and LL and then generate the Zoutliers
  UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
  LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
  Zoutliers <- which(ZCol > UL | ZCol < LL)

  #B) Exclude Zoutliers per chr if same direction as previous or subsequent row
  na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol)))
  .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]},
  by=chr])[, list(.N)]}

  nm1 <- paste0(names(df)
  setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]

The code is patched together from various places. The problem I have is combining parts A) and B) of the code to get the output I want

Is `Zcol` supposed to be essentially `3:ncol(df)` , i.e. all columns from number 3 onward, or just one column at a time? — Carl Witthoft, Apr 15 '15 at 11:34
It should calculate it one column at a time. I guess the output of the first part of the code should give me all the Z outliers with the chr and leftPos they are in which I think it does. The second part should then take that column and for each chr then assess each row as described. That's the idea. So should I pass Zoutliers to the second part? — Sebastian Zeki, Apr 15 '15 at 14:59
If I just concentrate on the first part- how would I get the Zoutliers with chr and leftPos associated which I could then pass to the second part of the problem — Sebastian Zeki, Apr 15 '15 at 15:03

Roman · Answer 1 · 2016-04-19T15:52:27.063

Can you try this function? I was not sure what alpha is, so I could not reproduce the expected output and included it as variable in the function.

# read your data per copy&paste
d <- read.table("clipboard",header = T)
# or as in Frank comment mentioned solution via fread
d <- data.table::fread("chr   leftPos         TBGGT     12_try      324Gtt       AMN2
                                     1     24352           34         43          19         43
                                     1     53534           2          1           -1         -9
                                     2      34            -15         7           -9         -18
                                     3     3443           -100        -4          4          -9
                                     3     3445           -100        -1          6          -1
                                     3     3667            5          -5          9           5
                                     3     7882           -8          -9          1           3")


# set up the function
foo <- function(x, alpha, chr){
  # your code for task a) and b)
  UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE)
  LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE)
  Zoutliers <- which(x > UL | x < LL)
  # part (c
  # factor which specifies the direction. 0 values are set as positives
  pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative")
  # count the occurrence per chromosome and direction.
  aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length)
}

# apply over the columns and get a list of dataframes with number of outliers per chr and direction.
apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)

Fyi, the package now offers the `fread` function which you can use to read in text like `DT= fread("text text text")` — Frank, Apr 19 '16 at 14:34
@Frank Oh, good to know. Included this function in my answer. — Roman, Apr 19 '16 at 15:50

Only include outliers from each column in a dataframe

1 Answers1

Linked