-2

I am working on a function that works by identifying if LP or LLP appear, preceded or not by a space, after a " at any position in the string. If this is the case i'd like to bring the LP or LLP sub string inside the quoted sub string, as shown below.

# input
'blabla "RANDOM COMPANY ONE "LLP blabla'
'blabla "RANDOM COMPANY TWO " LLP blabla'
'blabla "RANDOM COMPANY THREE " LP blabla'
'blabla "RANDOM COMPANY FOUR "LP blabla'

# output
'blabla "RANDOM COMPANY ONE LLP" blabla'
'blabla "RANDOM COMPANY TWO LLP" blabla'
'blabla "RANDOM COMPANY THREE LP" blabla'
'blabla "RANDOM COMPANY FOUR LP" blabla'

So far, I got to this function, which almost does what I want:

def fix_entity_broken_by_quotes(text):

    match = r'"\s*(LL?P)'
    replace = r'" \1 "'

    return ' '.join(re.sub(match, replace, text).split())

# run

>>> fix_entity_broken_by_quotes('blabla "RANDOM COMPANY ONE" LLP blabla')
Out[1]: 'blabla "RANDOM COMPANY ONE" LLP " blabla'

I would not want the " after ONE in the resulting string.

As always, any hint or explanation on what I am missing is very welcome.

Thanks!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Tytire Recubans
  • 967
  • 10
  • 27

2 Answers2

1

You may try using re.sub:

inp = ['blabla "RANDOM COMPANY ONE "LLP blabla', 'blabla "RANDOM COMPANY TWO " LLP blabla', 'blabla "RANDOM COMPANY THREE " LP blabla', 'blabla "RANDOM COMPANY FOUR "LP blabla']
output = [re.sub(r'"[ ]?(LP|LLP)', r'\1"', x) for x in inp]
print(output)

This prints:

['blabla "RANDOM COMPANY ONE LLP" blabla',
 'blabla "RANDOM COMPANY TWO LLP" blabla',
 'blabla "RANDOM COMPANY THREE LP" blabla',
 'blabla "RANDOM COMPANY FOUR LP" blabla']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

hint or explanation on what I am missing is very welcome. You have leading " in your replace

match = r'"\s*(LL?P)'
replace = r'" \1 "'

Changing replace to r' \1 "' should help.

Daweo
  • 31,313
  • 3
  • 12
  • 25