Extract multiple strings between brackets or two different delimiters

Question

I'm trying to extract data between these two delimiters [,]
for two cases: Beginning with #CB or CR:

Input data:

Scenario,Order_ID,Review
1,1234,<Order No. 10><Remark>Food was good</Remark><UserID>7890</UserID><Filter>[#CB:Customer happy with service]</Filter><Rating>5</Rating>
2,1235,<Rating>Five</Rating><Order No. 17><UserID>7880</UserID><Filter>[#CB:Customer had a good time]</Filter><Remark>Food was good</Remark><Additional>Service up to par</Additional><Remark>[#CB:Customer will return again]</Remark><End>Thats all</End>
3,1236,<Start>We shall begin</Start><Intro>[CR:Fine Dining Experience]</Intro><Rating>Three</Rating><Order No. 12><Filter>[#CB:Food was well prepared]</Filter><Remark>Food was good</Remark><Additional>Ambiance was lively</Additional><Remark>[CR:Hygine was maintained]</Remark><End>Could be better</End><UserID>7880</UserID>

Data wanted:

Scenario,Order_ID,Review1,Review2,Review3,Review4
1,1234,#CB:Customer happy with service,,,
2,1235,#CB:Customer had a good time,#CB:Customer will return again,,
3,1236,CR:Fine Dining Experience,#CB:Food was well prepared,CR:Hygine was maintained,

How do you intend to get the data? It is from a sql server or something else? — , Apr 21 '23 at 18:39
Data: Five7880[#CB:Customer had a good time]Food was goodService up to par[#CB:Customer will return again]Thats all We shall begin[CR:Fine Dining Experience]Three[#CB:Food was well prepared]Food was goodAmbiance was lively[CR:Hygine was maintained]Could be better7880 — Storm, Apr 21 '23 at 18:44
[Edit] your question and put the data there (with correct formatting), not into a comment. — Andrej Kesely, Apr 21 '23 at 18:45
If your data are text readers can cut-and-paste it to test code they may wish to suggest in an answer or comment. By displaying it as an image every reader wanting to use it must transcribe it manually, a time-consuming, tedious, irritating and wholly unnecessary exercise. — Cary Swoveland, Apr 21 '23 at 20:11

cpix · Answer 1 · 2023-04-21T18:51:04.313

It appears that the data is in XML format (although the <Order No. 10>. makes it invalid; I've swapped it with <Order>. You can iterate through each value and check whether they begin with #CB. Alternatively you can perform a regex search as shown below. However if you are confident that the data always follows the XML format, the former method would be more suitable.

import re

text = ...

# Extract between [#CB: and ]
cb_data = re.findall(r"\[#CB:(.*?)\]", text)
print(cb_data) # prints out CB

# Extract between [CR: and ]
cr_data = re.findall(r"\[CR:(.*?)\]", text)
print(cr_data) # prints out CR

If your data is always in XML:

import xml.etree.ElementTree as ET

xml_str1 = '<Order><Remark>Food was good</Remark><UserID>7890</UserID><Filter>[#CB:Customer happy with service]</Filter><Rating>Five</Rating></Order>'
xml_str2 = '<Order><UserID>7880</UserID><Filter>[#CB:Customer had a good time</Filter><Remark>Food was good</Remark><Additional>Service up to par</Additional><Remark>(#CB:Customer will return again]</Remark><End>Thats all</End></Order>'

root1 = ET.fromstring(xml_str1)
root2 = ET.fromstring(xml_str2)

cb_filter1 = root1.find('Filter').text
cb_data1 = cb_filter1[cb_filter1.find('#CB:')+4:cb_filter1.find(']')]

cb_filter2 = root2.find('Filter').text
cb_data2 = cb_filter2[cb_filter2.find('#CB:')+4:cb_filter2.find(']')]

print(cb_data1) # Customer happy with service
print(cb_data2) # Customer had a good time

score 1 · Accepted Answer · answered Apr 21 '23 at 19:34

Building on the other answer, this is how you could get it in the format you asked for.

import re
import pandas as pd

p = re.compile(r"\[(.*?)\]")

s1 = "<Order No. 10>Food was good7890[#CB:Customer happy with service]Five"
s2 = " Five<Order No. 17>7880[#CB:Customer had a good time]Food was goodService up to par[#CB:Customer will return again]Thats all"
print(p.findall(s1))
print(p.findall(s2))

d = pd.DataFrame({'scenario':[1,2], 'order_id':['1234', '1235'], 'reviews':[s1,s2] })

def padList(l, length=4):
  l = l[:length]
  l += [''] * (length - len(l))
  return l

d[['review1','review2','review3','review4']] = d.apply(lambda row: padList(p.findall(row['reviews'])), axis=1, result_type='expand')

d

Extract multiple strings between brackets or two different delimiters

2 Answers2