1

I'm trying to pull data using what I believe to be the python version, it's been working so far but now I've come across some data where there's unwanted numbers (that will change across documents I'm trying to process), so I'm wondering if there's a way to skip through those numbers. The anchor I'm using will be the same, Georgia in my example below. The words and numbers are all separated by little circles so it makes it pretty easy, just having trouble implementing some stackoverflow help to my problem.

What I've used, what I need from it will be bolded:

Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216

I've been using this formula to grab the anchor word and then use another code to grab the following word or number and it's worked until now. (Match(0).Value) I've tried changing that 0 to a 5 to try to grab the 6th value but it's not letting me do that. (?<=State\sName\s)(.*?(?=\s)). I've been looking here to try to solve my problem but I'm confused: RegEx skip word Update: Got some help from someone, suggesting I try to use this:

(Georgia)(?:\s*\*\s*\S+)(?:\s*\*\s*\S+)(?:\s*\*\s*\S+)(?:\s*\*\s*\S+)(?:\s*\*\s*\S+)\s*\*\s*([0-9,.]+)

, which I was able to use this part

(Georgia)(?:\s*\*\s*\S+){5}

to highlight up to the value I want to extract, but I'm unable to figure out how to highlight just the value I want.

user438383
  • 5,716
  • 8
  • 28
  • 43
bigdataguy
  • 11
  • 1

2 Answers2

0

I hope i understood what you wanted correctly

You can use this regex, and change '4' to any number you want to get the term https://regex101.com/r/zXiSTv/1/

For example:

In [1]: import re

In [2]: def get_nth_element(text, element):
   ...:     result = re.search(r"(Georgia)(?: \* [^ ]+){{{}}} \* ([^ ]+)".format(element), text)
   ...:     return result.group(1), result.group(2)
   ...:

In [3]: get_nth_element("Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216", 3)
Out[3]: ('Georgia', '145,982')

In [4]: get_nth_element("Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216", 4)
Out[4]: ('Georgia', '36,000')

In [5]: get_nth_element("Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216", 1)
Out[5]: ('Georgia', '0')

In [6]: get_nth_element("Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216", 0)
Out[6]: ('Georgia', '372,000')
Ron Serruya
  • 3,988
  • 1
  • 16
  • 26
  • hi thank you so much for helping! Unfortunately I'm a complete idiot and just realized that what I'm trying to do is R-Regex. Seriously thanks for helping though you're a beast for that! – bigdataguy Jun 18 '21 at 15:32
0

You might use 2 capture groups, and in the second capture group match digits with an optional decimal part

\b(Georgia)(?:[^*]*\*){5}\s*(\d+(?:,\d+)?)\b
  • \b A word boundary to prevent a partial match
  • (Georgia) Capture Georgia in group 1
  • (?:[^*]*\*){5} Repeat 5 times matching any char except * followed by matching *
  • \s* match optional whitespace chars
  • (\d+(?:,\d+)?) Capture 1+ digits with an optional decimal part in group 2
  • \b A word boundary

Regex demo

library(stringr)

s <- "Georgia * 372,000 * 0 * 0 * 145,982 * 36,000 * 0.09216"
str_match_all(s, "\\b(Georgia)(?:[^*]*\\*){5}\\s*(\\d+(?:,\\d+)?)\\b")[[1]][,2:3]

Output

[1] "Georgia" "36,000"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70