Select min or max values within one cell (delimited string)

Question

I have a data frame where for each sample the columns can have multiple values, for example:

Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace    0.0381, ., 0.00357    0.01755, 0.001385    0.0037, NA , 0.039         -0.03,1,15
NOS          NA                  0.02              0.001, 0.00067              0.00009,25,30

I want to apply min() and max() for each gene's data (I have thousands of genes in total) in each column and get the smallest value for the pvalues but the largest value for columns such as the beta. So the output data would look like this:

Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace        0.00357              0.001385             0.0037                   15
NOS          NA                  0.02                0.00067                  30

I'm new to R and not sure if what I'm asking is possible, if there are multiple values in one cell are they viewed as strings?

Hi! It is easier if you would provide a sample that can be put into R directly.Your question "are they viewed as strings" could be easiest answered that way, but I would say they are a string indeed. Therefore I would try someting using the apply function, putting a combinations of functions in like: (for the minimum value) min(str_split()). — Annet, Jan 17 '20 at 11:17
Hi thank you for this. I will look into the apply function and your suggested combinations of functions straight away. How can I offer a sample that goes directly into R? Apologies if this is a basic question, that sounds like it would be very useful to give. — DN1, Jan 17 '20 at 11:20
using something like dput() will give the structure/data, the first few rows usually suffice to answer questions like this (for which you can use the head() function) — Annet, Jan 17 '20 at 11:29
It's possible (and sometimes very useful) to have multiple numeric values in one cell `df <- data.frame(a = I(list(c(1, 2, 3))), b = 1); df[[1,1]]`; the answers thusfar assume (perhaps with justification) that this is not the case — alan ocallaghan, Jan 17 '20 at 11:39

denis · Accepted Answer · 2020-01-20T08:12:33.190

A possible solution using stringr and dplyr:

library(dplyr)
library(stringr)

getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x) min(as.numeric(x),na.rm = T) ) %>%
  unlist() 

df %>%
  mutate_at(names(df)[-1],getmin)

  Gene Pvalue1  Pvalue2 Pvalue3  Beta
1  Ace 0.00357 0.001385 0.00370 -3e-02
2  NOS     Inf 0.020000 0.00067 9e-05

Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In min(as.numeric(x), na.rm = T) :
  no non-missing arguments to min; returning Inf

The function getmin extract the number with str_extract_all:

 str_extract_all(df$Pvalue2,"[0-9\\.-]+")

[[1]]
[1] "0.01755"  "0.001385"

[[2]]
[1] "0.02"

It has the advantage of being insensible to space or other characters, but can extract just a dot. I then loop on this list to extract in each cell the minimum, and convert the list into a vector with unlist. Using the as.numeric() function convert the possible extracted . to NA.

the code df %>% mutate_at(names(df)[-1],getmin) just apply this function on all columns exept the first one

edit: if you want to avoid inf values, you can use this slight modified version:

min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

df %>%
    mutate_at(names(df)[-1],getmin)

  Gene Pvalue1  Pvalue2 Pvalue3  Beta
1  Ace 0.00357 0.001385 0.00370 -3e-02
2  NOS      NA 0.020000 0.00067 9e-05

data:

df <- read.table(text = "
                 Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace    0.0381,.,0.00357    0.01755,0.001385    0.0037,NA,0.039         -0.03,1,15
                 NOS          NA                  0.02              0.001,0.00067              0.00009,25,30
                 ",header = T)

Thank you for this, this almost works except for some reason it values my negative numbers as positive numbers (and therefor selects the wrong min/max numbers). This is the explanation I've understood the most so I will keep working to see if I can resolve this, thank you for your help! — DN1, Jan 17 '20 at 11:48
I've been trying to figure it out, it seems like I need to modify ```"[0-9\\.]+"``` somehow to acconut for negative numbers, but I haven't found any similar problems. Please let me know if you know a solution to this. — DN1, Jan 17 '20 at 15:20
Sorry I didn't see the negative number. You can use `[0-9\\.-]` instead, see my edits. — denis, Jan 20 '20 at 08:10

ThomasIsCoding · Answer 2 · 2020-01-17T12:43:23.553

Here is a base R solution using regmatches + gregexpr to sort out numbers, i.e.,

dPvalue <- t(apply(df[grep("Pvalue",names(df))], 1, function(v) {
  unlist(Map(function(x) ifelse(length(x)>0, min(as.numeric(x)),NA), regmatches(v, gregexpr("-?\\d+(\\.\\d+)?",v))))
}))

Beta <- apply(df[grep("Beta",names(df))], 1, function(v) {
  unlist(Map(function(x) ifelse(length(x)>0, max(as.numeric(x)),NA), regmatches(v, gregexpr("-?\\d+(\\.\\d+)?",v))))
})

dfout <- cbind(df["Gene"],Pvalue,Beta)

such that

> dfout
  Gene Pvalue1  Pvalue2 Pvalue3 Beta
1  Ace 0.00357 0.001385 0.00370   15
2  NOS      NA 0.020000 0.00067   30

DATA

df <- structure(list(Gene = structure(1:2, .Label = c("Ace", "NOS"), class = "factor"), 
    Pvalue1 = structure(c(1L, NA), .Label = "0.0381,.,0.00357", class = "factor"), 
    Pvalue2 = structure(1:2, .Label = c("0.01755,0.001385", "0.02"
    ), class = "factor"), Pvalue3 = structure(2:1, .Label = c("0.001,0.00067", 
    "0.0037,NA,0.039"), class = "factor"), Beta = structure(1:2, .Label = c("-0.03,1,15", 
    "0.00009,25,30"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Thank you for this. This also almost works, but is producing 6000+ variables, do you know why that might be? I'll keep looking into fixing this because besides that the code does run. — DN1, Jan 17 '20 at 12:36
@DN1 If you want to keep clean in your global environment and have less variables, you can write your customized function to wrap all the things inside — ThomasIsCoding, Jan 17 '20 at 12:40
Thank you I understand it better now. I have 6000+ genes and it's given me the min max of the columns that I've wanted but it tranposes the results so pvalues per gene has become a row itself. I'll look into transposing this thanks again for your help — DN1, Jan 17 '20 at 15:05

score 1 · Answer 3 · answered Jan 17 '20 at 12:50

Using data.table, convert wide-to-long, split on comma, get min for P-values and max for Betas, and finally convert back to long-to-wide.

library(data.table)

dt1 <- fread("
Gene       Pvalue1             Pvalue2              Pvalue3                  Beta
Ace    0.0381,.,0.00357    0.01755,0.001385    0.0037,NA,0.039         -0.03,1,15
NOS          NA                  0.02              0.001,0.00067              0.00009,25,30
            ")

dcast(
  melt(dt1, id.vars = "Gene")[, paste0("col", 1:3) := lapply(tstrsplit(value, ","), as.numeric) 
                              ][, MinMax := ifelse(grepl("Pvalue", variable),
                                                   pmin(col1, col2, col3, na.rm = TRUE),
                                                   pmax(col1, col2, col3, na.rm = TRUE)) ],
  Gene ~ variable, value.var = "MinMax")

#    Gene Pvalue1  Pvalue2 Pvalue3 Beta
# 1:  Ace 0.00357 0.001385 0.00370   15
# 2:  NOS      NA 0.020000 0.00067   30
# Warning message:
# In lapply(tstrsplit(value, ","), as.numeric) : NAs introduced by coercion

Note: same steps can be applied using dplyr/tidyr.

Jaap · Answer 4 · 2020-01-21T09:37:51.293

Another option is using data.table and matrixstats:

library(data.table)
library(matrixStats)

pval_cols <- grep("Pvalue", names(DT), fixed = TRUE, value = TRUE)

min_fun <- function(x) {
  y <- tstrsplit(x, split = ",", fixed = TRUE)
  y <- rowMins(sapply(y, as.numeric), na.rm = TRUE)
  y <- replace(y, !is.finite(y), NA)
  return(y)
}

DT[, (pval_cols) := lapply(.SD, min_fun)
   , .SDcols = pval_cols][]

which gives:

> DT
   Gene Pvalue1  Pvalue2 Pvalue3          Beta
1:  Ace 0.00357 0.001385 0.00370    -0.03,1,15
2:  NOS      NA 0.020000 0.00067 0.00009,25,30

For the Beta-column(s) you can create a similar max_fun: just replace rowMins with rowMaxs.

score 0 · Answer 5 · answered Jan 17 '20 at 11:35

Here's the general idea.

applyFunctionToString <- function(
    string
  , sep = ","
){
    string <- gsub(" ", "", string)
    string <- unlist(strsplit(string, sep))
    string[string == "NA"] <- NA
    numbers <- as.numeric(string)
    min(numbers, na.rm = TRUE)
}

sapply(c("0.01755, 0.001385", "0.0037, NA , 0.039"), applyFunctionToString)

You actually want to get into string operations, convert each string into a numeric vector and then do your summary function (minor max).

The code I wrote here works in this instance, but you have to consider more factors:

Do your strings contain other characters the need to be removed?
What are missing values represented by?

You can also pass the function that you want to apply (min for example), but then you have other questions like how do you pass additional arguments to that function (using ...) - this would be out of scope.

Hope, it still helps a little.

You should test the `str_extract` and `str_extract_all` functions from `stringr`, which allows most of what your home made function does — denis, Jan 17 '20 at 11:40

Select min or max values within one cell (delimited string)

5 Answers5

Linked