1

I have date like the following

A <- c("-0.00023--0.00243unitincrease", "-0.00176-0.02176pmol/Lincrease(replication)",
       "0.00180-0.01780%varianceunitdecrease")

I want to extract the digits part and the rest part as two columns B and C. after extraction, it should get the following data frame:

#                                           A                 B                           C
#               -0.00023--0.00243unitincrease -0.00023--0.00243                unitincrease
# -0.00176-0.02176pmol/Lincrease(replication)  -0.00176-0.02176 pmol/Lincrease(replication)
#        0.00180-0.01780%varianceunitdecrease   0.00180-0.01780       %varianceunitdecrease

how to get that result in R?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Zhoufeng
  • 25
  • 5

2 Answers2

0

Using strsplit with positive lookahead/lookbehind. The [a-z%] denotes the range of letters from a to z as well as the % sign and should be expanded if there are other possibilities.

r1 <- do.call(rbind, strsplit(A, "(?<=\\d)(?=[a-z%])", perl=TRUE))
res1 <- setNames(as.data.frame(cbind(A, r1)), LETTERS[1:3])
res1
#                                             A                 B                           C
# 1               -0.00023--0.00243unitincrease -0.00023--0.00243                unitincrease
# 2 -0.00176-0.02176pmol/Lincrease(replication)  -0.00176-0.02176 pmol/Lincrease(replication)
# 3        0.00180-0.01780%varianceunitdecrease   0.00180-0.01780       %varianceunitdecrease

You may also want to get the numbers,

res2 <- type.convert(as.data.frame(
  do.call(rbind, strsplit(A, "(?<=\\d)-|(?<=\\d)(?=[a-z%])", perl=TRUE))))
res2
#         V1       V2                          V3
# 1 -0.00023 -0.00243                unitincrease
# 2 -0.00176  0.02176 pmol/Lincrease(replication)
# 3  0.00180  0.01780       %varianceunitdecrease

where:

str(res2)
# 'data.frame': 3 obs. of  3 variables:
# $ V1: num  -0.00023 -0.00176 0.0018
# $ V2: num  -0.00243 0.02176 0.0178
# $ V3: Factor w/ 3 levels "%varianceunitdecrease",..: 3 2 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

You can use strcapture and pass regex to extract data.

Here we divide A into two columns B and C where B column consists of an optional negative sign along with a decimal number followed by a - and another decimal number whereas column C consists of everything else.

In base R, you can use strcapture :

result <- cbind(A, strcapture('(-?\\d+\\.\\d+.*-\\d+\\.\\d+)(.*)', A, 
                   proto = list(B = character(), C = character())))
result

#                                            A                 B                           C
#1               -0.00023--0.00243unitincrease -0.00023--0.00243                unitincrease
#2 -0.00176-0.02176pmol/Lincrease(replication)  -0.00176-0.02176 pmol/Lincrease(replication)
#3        0.00180-0.01780%varianceunitdecrease   0.00180-0.01780       %varianceunitdecrease

You can use the same regex in tidyr::extract which will give the same output.

data.frame(A) %>%
  tidyr::extract(A, c('B', 'C'), '(-?\\d+\\.\\d+.*-\\d+\\.\\d+)(.*)', remove = FALSE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213