0

I'm just learning R for data science, and used these few lines to extract numbers from data (using data.table):

library(stringr)
library(data.table)
prods[, weights := str_extract(NombreProducto, "([0-9]+)[kgKG]+")]
prods[, weights := str_extract(weights, "[0-9]+")]
prods[, weights := as.numeric(weights)]

Here's an example of the 'NombreProducto' field I want to extract numbers/text from:

"Tostado 210g CU BIM 1182"

Is there an easy way to do this in a succinct one-liner? I tried

prods[, weights := str_match(NombreProducto, "([0-9]+)[kgKG]+")[2]]

but it set everything in the 'weights' column to the first result from the data.table. This is from the Grupo Bimbo Kaggle competition by the way.

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
wordsforthewise
  • 13,746
  • 5
  • 87
  • 117

2 Answers2

3

Without using stringr, you could just use sub with ".*?(\\d+)[kgKG].*" and back reference:

s = "Tostado 210g CU BIM 1182"

sub(".*?(\\d+)[kgKG].*", "\\1", s)
# [1] "210"
  • use (\\d+)[kgKG] to match digits followed by k, K, g, G letters;
  • specify .* before and after the pattern so that strings other than the pattern can be removed;
  • use ? on the first .* to make the match unready so that all the three digits will be kept;
  • use \\1 to refer the capture group (\\d+);
Psidom
  • 209,562
  • 33
  • 339
  • 356
2

We can use this with stringr in a single line using regex lookarounds.

prods[, weights := as.numeric(str_extract(NombreProducto, "([0-9]+)(?=[kgKG])"))] 
akrun
  • 874,273
  • 37
  • 540
  • 662