Extract abbreviation from string of the words longer than 3 letters by regex

Question

string1 =  'Department of the Federal Treasury "IFTS No. 43"'
string2 =  'Federal Treasury Company "Light-8"'

I need to get the first capital letters of words longer than 3 characters that are before the opening quote, and also extract the quoted expression using a common pattern for 2 strings.

Final string should be:

for string1: 'IFTS No. 43, DFT'.
for string2: 'Light-8, FTC'.

I would like to get a common pattern for two lines for further use of this expression in DataFrame.

The word `Russia` appears nowhere in the first input string. From where is it coming? — Tim Biegeleisen, Apr 28 '23 at 13:41
Look into built in method str.split(). It will return a list of elements in your string and you can iterate through each one and apply logic desired. https://docs.python.org/3/library/stdtypes.html#str.split — Igor, Apr 28 '23 at 15:40
Please provide enough code so others can better understand or reproduce the problem. — Igor, Apr 28 '23 at 15:41
Try something like [`\"([^\"]+)\"|\b([A-Z])`](https://regex101.com/r/O3VBWP/1) [group](https://www.regular-expressions.info/brackets.html) 1 > quoted, group 2 > capital letters. — bobble bubble, Apr 28 '23 at 16:17
Tim Biegeleisen, Sorry, I didn't understand your question at first. Then I found an error in my assignment. I fixed it. But at the same time, I ended up solving it myself. This pattern fits right here r'(\w)\w{3,}|\"([^\"]+)\"'. The output needs a bit of work, but so far it suits me the best. — broncaeux, Apr 28 '23 at 16:26
# My code: import re row = 'Federal Treasury Company "Light-8"' pattern = r'(\w)\w{3,}|\"([^\"]+)\"' matches = re.findall(pattern, row) print('{}, {}'.format([match[1] for match in matches if match[1]][0], ''.join(map(str,[match[0] for match in matches if match[0]])))) # output 'Light-8, FTC' — broncaeux, Apr 28 '23 at 16:37
@broncaeux Have a look at [this Python demo](https://tio.run/##VY1NSwMxFEX3@RWXt0poHVrrSuhCkIKbbuzKyQijfdMJNB@8pGDB/z5mRChu77n3nnQtYwybaXI@RSkQVkr4xF/YQsiSbt8tdQtj6dt@6Pbp7q0zpFSumJ459VI8h4I4oIyMHR9Z@jMOwn2@yBWWXnaHV@xjg4eNJRvq1vflc@T5QbgZXDi6wqJ/pUtkM/tn2BItQdQpNUSBhwv4Wz4qwA3aNyeJl6TvjZkT1LvcrjsstrihCvic@VZY/Susqy6JC6X6s5mmHw). — bobble bubble, Apr 28 '23 at 17:07
@broncaeux You're welcome, glad that helped! :) I put it as an answer. — bobble bubble, Apr 28 '23 at 18:23

bobble bubble · Answer 1 · 2023-05-02T10:20:30.987

You can use a capturing group and alternation.

"([^"]+)"|\b[A-Z]

See this demo at regex101 (FYI read: The Trick)

It either matches the quoted parts and captures negated double quotes "inside" to the first capturing group OR matches each capital letter at an initial \b word boundary (start of word).

import re

regex = r"\"([^\"]+)\"|\b[A-Z]"

s = "Department of the Federal Treasury \"IFTS No. 43\"\n"

res = ["", ""]

for m in re.finditer(regex, s):
  if(m.group(1)):
    res[0] += m.group(1)
  else:
    res[1] += m.group(0)

print(res)

Python demo at tio.run >

['IFTS No. 43', 'DFT']

Extract abbreviation from string of the words longer than 3 letters by regex

1 Answers1