1

I am trying to extract string of text from a whole field with Openrefine. This is an extract of my dataset:

172.    D3B: 23Y1-Up, 27Y1-Up  (36 LK) 6-S/F Rollers, 4-D/F Rollers, 2-Carrier Rollers
179.    D3C: 23Y2508-UP (37LK) 6-S/F, 4-D/F, 2-T/C  
180.    27Y5050-UP (37LK) 6-S/F, 4-D/F, 2-T/C   
181.    2XF622-UP (37LK) 6-S/F, 4-D/F, 2-T/C    
182.    3RF0147-UP (36LK) 6-S/F, 4-D/F, 2-T/C
200.    D4D:67A1-UP, 78A1-UP, 85A1-UP, 86A1-UP, 59J1-644, 58J1-UP, 49J1-473, 22C1-UP, 91A1-UP, 88A1-UP

I want to extract 23Y1-Up, 27Y1-Up from record 172, 23Y2508-UP from record 179, 27Y5050-UP from record 180 and the whole 67A1-UP, 78A1-UP, 85A1-UP, 86A1-UP, 59J1-644, 58J1-UP, 49J1-473, 22C1-UP, 91A1-UP, 88A1-UP from record 200

So basically the rule would be to extract everything between :if present and ( if present. Maybe restricting it to where there is one or more occurrence of the string UP

So I am adding a new column based on existing column using value.match. I tried to adapt some query to my scope but I am very far from succeding despite multiple attempts. I started with this regex expression value.match(/\:?\s*(\w+\.?)+?.*/)[0] that I tought would isolate any word AFTER the semicolon (and the space) but it works only with words BEFORE... Yesterday I successfully extracted the numbers before the LK that is also relevant information for my dataset, but I can't grasp this.

Any help is much appreciated! Thanks

1 Answers1

0

Using match matches the whole string.

You can use a single capture group with a negated character class to exclude matching (

^[^:]*:\s*([^(]+).*$
  • ^[^:]*:\s* Match until the first : followed by optional whitespace chars
  • ( Capture group 1
    • [^(]+ Match 1+ occurrence of any char except (
  • ) Close group 1
  • .*$ Match the rest of the line

regex demo

Or capture in a group matching only word characters separated by a hyphen

^[^:]*:\s*(\w+-\w+(?:,\s+\w+-\w+)*).*$

regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Hi Thanks. This works great in regex, but not in openrefine's GREL using value.match. I am trying to figure out how to adapt it based on this document https://docs.openrefine.org/manual/grelfunctions/#string-functions – datacleaner Mar 04 '21 at 10:30
  • @datacleaner Try matching the entire string `^[^:]*:\s*([^(]+).*$` https://regex101.com/r/cRwYf8/1 llike `"179. D3C: 23Y2508-UP (37LK) 6-S/F, 4-D/F, 2-T/C".match(/^[^:]*:\s*([^(]+).*$/)` – The fourth bird Mar 04 '21 at 10:41
  • Kind of works.. still missing values from lines that do not have the semicolon which is optional. Also i have this line with semicolon that is showing null... 977E: 20A2509-4222 (40 Link) 6-S/F , 8-D/F , 4- – datacleaner Mar 04 '21 at 11:35
  • @datacleaner Can you give an example of a line without a `:` with the expected match? You can update the link with the example and add it again to a comment https://regex101.com/r/cRwYf8/1 – The fourth bird Mar 04 '21 at 11:37
  • line 180 of my post should match for 27Y5050-UP – datacleaner Mar 04 '21 at 11:38
  • @datacleaner This would match it https://regex101.com/r/DQeMQy/1 `^\d+\.\s*(?:\w+:)?\s*([^(]+).*$` or this https://regex101.com/r/WccLJx/1 `^\d+\.\s*(?:\w+:)?\s*(\w+-\w+(?:,\s+\w+-\w+)*).*$` – The fourth bird Mar 04 '21 at 11:42
  • Thanks, this works great in the editor, unfortunately not in GREL..... can't understand why. I used value.match and the opening/closing slash like this `value.match(/^\d+\.\s*(?:\w+:)?\s*(\w+-\w+(?:,\s+\w+-\w+)*).*$/)` – datacleaner Mar 04 '21 at 12:06
  • @datacleaner Ther are no leading whitespaces in the line? So the example data has a digit directly at the start of the line? Try it like this without the anchors and leading whitespace https://regex101.com/r/c5hjty/1 `\s*\d+\.\s*(?:\w+:)?\s*(\w+-\w+(?:,\s+\w+-\w+)*).*` – The fourth bird Mar 04 '21 at 12:17
  • All these results in null. I think it is because of how value.match works as explained here...https://docs.openrefine.org/manual/grelfunctions/#string-functions under "match (p,s).Your regex code is correct. I only managed to get some values ouf of this `value.match(/A*(\w+-\w+(?:,\s+\w+-\w+)*).*/)` still better than nothing...thx – datacleaner Mar 04 '21 at 12:51
  • @datacleaner I don't see an `A` but you can also skip the digits part and try a non greedy dot `.*?(?:\w+:)?\s*(\w+-\w+(?:,\s+\w+-\w+)*).*` – The fourth bird Mar 04 '21 at 12:54