I am trying to parse a vocabulary list with a lot of unicode characters inside. I DONT want to catch these charaters, I want to handel them like normal charaters if possibel. My Data:
en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology 01A
en arkitektur /ɑrkitek¹tʉːr/ arkitekturen, arkitekturerarkitekturene an architecture 01A
ei avis /ɑ¹viːs/ avisa, aviser, avisene a newspaper 01P
Barcelona /bɑʃe¹luːnɑ/ proper name 01M
bare /²bɑːre/ just, only 01M
bare bra! /bɑre ¹brɑː/ just fine! 01M
en bensinstasjon /ben¹siːnstɑˌʃuːn/ bensinstasjonen, bensinstasjoner, bensinstasjonene a petrol station 01P
I want to have two groups reggex:
group(1): Includes all vocabulary without the last "capter-ID"
group(2): Only "capter-ID"
Example:
group(1): en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology
group(2): 01A
I tried the follwing search-algorithms which work fine on https://regex101.com/ which I used for debugging: "(.+)(01\S)\n" works as well as "(\D+)(01\S)\n"
This is my code and the Error I get:
import re
def readTemplate(filepath): #reading a file
try:
with open(filepath, "r") as template:
data = template.read()
return data
except:
return False
def parseData(data): #parse file data
voc = []
cap = []
regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
for matches in regexMatch:
voc.append(str(matches.group(1)))
cap.append(str(matches.group(2)))
return voc, cap
#-----------------------------Main Prog.-----------------------------
data = readTemplate('Vocubulary.txt') #open file
voc, cap = parseData(data) #parse Data
Traceback (most recent call last):
File "C:/User...Vocabulary.py", line 25, in <module>
voc, cap = parseData(data) #parse Data
File "C:/Users...Vocabulary.py", line 15, in parseData
regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
TypeError: expected string or bytes-like object
Process finished with exit code 1