5

I need to extract some numbers from a text. Text is

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

The numbers to be extracted are 325 and 232. These are inside brackets and at the end of sentence. Other numbers are excluded. I tried strsplit(text, "[A-Za-z]+"), but is not getting what I needed.

bartektartanus
  • 15,284
  • 6
  • 74
  • 102

4 Answers4

5

Here's a stringi approach

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae; Claudii libidini, qui tum erat summo ne imperio, dederetur"

library(stringi)
stri_extract_all_regex(x, "(?<=[\\[(])\\d+(?=[\\])][.?!])")

## [[1]]
## [1] "325" "232"
hwnd
  • 69,796
  • 4
  • 95
  • 132
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 2
    Curious, doesn't `qdap` have function for getting text between brackets? I though I was you use it a few times before. – Rich Scriven Aug 24 '14 at 19:33
  • Yeah it has `bracketXtract` but this regex is less general (forces digits between) thus more accurate. And I'm becoming a big fan of the `stringi` package with fast, consistent results. – Tyler Rinker Aug 24 '14 at 20:15
4

Another one:

r <- gregexpr("[[(]\\d+[])](?=\\.)", text, perl = TRUE)
(m <- regmatches(text, r)[[1]])
# [1] "(325)" "[232]"

as.integer(gsub("\\D", "", m))
# [1] 325 232
lukeA
  • 53,097
  • 5
  • 97
  • 100
3

Here is a solution using strsplit....

> x <- 'Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;'
> strsplit(x, '[^0-9]+')[[1]][3:4]
## [1] "325" "232"

Or using base R to extract these values.

> regmatches(x, gregexpr('[[(]\\K\\d+(?=[])](?!,))', x, perl=T))[[1]]
## [1] "325" "232"
hwnd
  • 69,796
  • 4
  • 95
  • 132
0

With re module

import re

string="Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

print string

pattern = re.compile(r'(?<=[\[(])\d+(?=[\])]\.)')

result = pattern.findall(string)

print result
Stefan Gruenwald
  • 2,582
  • 24
  • 30