1

I have a genetic dataset where each row describes a gene and has a beta column with multiple beta values I've compressed into one row/cell (from the variant level where multiple variants in one gene gave multiple betas). The beta is the effect size that the gene can have on a condition so large negative values are important as well as large positive values. I am trying to write code that selects the absolute value from the rows, and then trying to create another new column which records if the absolute value used to be negative - I have a biology background so I'm not sure if this is possible or the best way to do this?

For example my data looks like this:

Gene    Beta
ACE     0.01, -0.6, 0.4
BRCA    0.7, -0.2, 0.2 
ZAP70   NA
P53     0.8, -0.6, 0.001

Expected output something like this (selecting absolute value and keeping track of which numbers use to be negative):

Gene    Beta     Negatives
ACE      0.6         1
BRCA     0.7         0
ZAP70    NA          NA
P53      0.8         0

I am currently stuck on getting the absolute value from each row, what I am trying is this:

abs2 = function(x) if(all(is.na(x))) NA else abs(x,na.rm = T)
getabs = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)abs2(as.numeric(x)) ) %>%
  unlist() 

test <- df %>%
  mutate_at(names(df)[2],getabs)

#Outputs:
 Error in abs(x, na.rm = T) : 2 arguments passed to 'abs' which requires 1 

Any help on how to just get the absolute value per cell/row would be appreciated, as I assume I could also make a column getting the largest negative value, match that to identical absolute values and use that as my negatives record.

Input data:

dput(df)
structure(list(Gene = c("ACE", "BRCA", "ZAP70", "P53"), `Beta` = c("0.01, -0.6, 0.4", 
"0.7, -0.2, 0.2", "0.001, 0.02, -0.003", "0.8, -0.6, 0.001")), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"))
Sotos
  • 51,121
  • 6
  • 32
  • 66
DN1
  • 234
  • 1
  • 13
  • 38
  • 3
    Just to clarify, you wanted to select the _largest_ absolute value from all of the values contained in each row? (As well as recording whether the original value corresponding to this largest value was positive or negative). – Edward Mar 11 '20 at 10:26
  • 1
    Yes exactly this – DN1 Mar 11 '20 at 10:54
  • Related post: https://stackoverflow.com/q/60616503/680068 – zx8754 Mar 11 '20 at 11:03
  • Would be nice to add details on what you are trying to achieve. Are you trying to find the most significant variant in the gene or wanting to assign the highest beta to a gene? Does it makes sense statistically? – zx8754 Mar 11 '20 at 11:06
  • Is there a reason this can't be addressed by simply having two column, one for value, and one for absolute value? – caw5cv Mar 11 '20 at 13:16

3 Answers3

4

You can simply split, convert to numeric, find the index of the absolute maximum and check if it is negative, i.e.

sapply(strsplit(df$Beta, ', '), function(i){i1 <- as.numeric(i); 
                                            i2 <- which.max(abs(i1));
                                         if (length(i2) == 0){NA}else{i[i2] < 0}}) * 1

#[1]  1  0 NA  0
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

One way using dplyr is to get the comma-separated value into separate rows, group_by Gene get the max absolute value of Beta and check if that value is negative.

library(dplyr)

df %>%
  tidyr::separate_rows(Beta, sep = ",", convert = TRUE) %>%
  group_by(Gene) %>%
  summarise(Negatives = +(min(Beta) == -max(abs(Beta))),
            Beta = max(abs(Beta), na.rm = TRUE))

# A tibble: 4 x 3
#  Gene  Negatives   Beta
#  <fct>     <int>  <dbl>
#1 ACE           1    0.6
#2 BRCA          0    0.7
#3 P53           0    0.8
#4 ZAP70        NA   -Inf  

data

df <- structure(list(Gene = structure(c(1L, 2L, 4L, 3L), .Label = c("ACE", 
"BRCA", "P53", "ZAP70"), class = "factor"), Beta = structure(c(1L, 
2L, NA, 3L), .Label = c("0.01, -0.6, 0.4", "0.7, -0.2, 0.2", 
"0.8, -0.6, 0.001"), class = "factor")), class = "data.frame", 
row.names = c(NA, -4L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

You can write your custom function f and vectorize it via Vectorize, i.e.,

f <- Vectorize(function(x) {
  v <- as.numeric(unlist(strsplit(as.character(x),split = ",")))
  c(Beta = max(abs(v)),Negatives = sum(v<0 & v==-max(abs(v))))
})

and then run

df <- cbind(df[1],t(f(df$Beta)))

such that

> df
   Gene Beta Negatives
1   ACE  0.6         1
2  BRCA  0.7         0
3 ZAP70   NA        NA
4   P53  0.8         0
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81