0

I'd like to extract the numerical values from a string, for example:

x <- c("14 trucks and 3298 pounds of tuna",
       "228 gallons and 190 pounds of sand",
       "161751 barrels gell, 3438540 pounds proppant",
       "29 pounds of hay, 100 barrels of water, 30 pins")   

I would like to extract the numbers (numerical) only before the word pounds, and the result after the operation should be

c(3298, 190, 3438540, 29)   
Sinh Nguyen
  • 4,277
  • 3
  • 18
  • 26
srog
  • 19
  • 2

3 Answers3

4

Using sub matching as few characters as possible until a number is encountered followed by "pounds", we can do

sub(".*?(\\d+) pounds.*", "\\1", x)
#[1] "3298"    "190"     "3438540" "29"   

You might want to wrap as.integer to this to convert it into integer.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

If "##### pounds" appear only once in each record then this will use of regex will work:

x <- c("14 trucks and 3298 pounds of tuna",
       "228 gallons and 190 pounds of sand",
       "161751 barrels gell, 3438540 pounds proppant",
       "29 pounds of hay, 100 barrels of water, 30 pins")

y <- gsub("(^|.* )(\\d+) pounds.*$", "\\2", x)

y
[1] "3298"    "190"     "3438540" "29" 
Sinh Nguyen
  • 4,277
  • 3
  • 18
  • 26
1

I am not sure if you will have floating numbers in your data, so you can try the following code for general use:

as.numeric(gsub(".*?(\\d+\\.?\\d+?)\\spound(s?).*","\\1",x))

which gives:

> as.numeric(gsub(".*?(\\d+\\.?\\d+?)\\spound(s?).*","\\1",x))
[1]    3298     190 3438540      29

Since there might be some data less than 1 pound, then I used pound(s?) in the gsub for pattern

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81