How to ignore unicode character with regex in python 3?

Question

I am trying to parse a vocabulary list with a lot of unicode characters inside. I DONT want to catch these charaters, I want to handel them like normal charaters if possibel. My Data:

en  antropologi     /ɑntrupulu¹giː/     antropologien, antropologier, antropologiene    an anthropology     01A
en  arkitektur  /ɑrkitek¹tʉːr/  arkitekturen, arkitekturerarkitekturene     an architecture     01A
ei  avis    /ɑ¹viːs/    avisa, aviser, avisene  a newspaper     01P
    Barcelona   /bɑʃe¹luːnɑ/        proper name     01M
    bare    /²bɑːre/        just, only  01M
    bare bra!   /bɑre ¹brɑː/        just fine!  01M
en  bensinstasjon   /ben¹siːnstɑˌʃuːn/  bensinstasjonen, bensinstasjoner, bensinstasjonene  a petrol station    01P

I want to have two groups reggex: group(1): Includes all vocabulary without the last "capter-ID" group(2): Only "capter-ID"

Example: group(1): en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology
group(2): 01A

I tried the follwing search-algorithms which work fine on https://regex101.com/ which I used for debugging: "(.+)(01\S)\n" works as well as "(\D+)(01\S)\n"

This is my code and the Error I get:

import re

def readTemplate(filepath): #reading a file
    try:
        with open(filepath, "r") as template:
            data = template.read()
        return data
    except:
        return False

def parseData(data): #parse file data
    voc = []
    cap = []

    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
    for matches in regexMatch:
        voc.append(str(matches.group(1)))
        cap.append(str(matches.group(2)))

    return voc, cap

#-----------------------------Main Prog.-----------------------------

data = readTemplate('Vocubulary.txt') #open file
voc, cap = parseData(data) #parse Data

Traceback (most recent call last):
  File "C:/User...Vocabulary.py", line 25, in <module>
    voc, cap = parseData(data) #parse Data
  File "C:/Users...Vocabulary.py", line 15, in parseData
    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
TypeError: expected string or bytes-like object

Process finished with exit code 1

What if you add `data = ""` right below `def readTemplate(filepath):`? In `parserData`, add a condition `if data:` before parsing. See https://ideone.com/ly8bLN (not tested) — Wiktor Stribiżew, Sep 08 '19 at 11:35

score -1 · Answer 1 · answered Sep 08 '19 at 11:06

-1

While the code works without errors on my machine (linux), you could try using raw strings.

regexMatch = re.compile(r"(.+)(01\S)\n").finditer(data)

answered Sep 08 '19 at 11:06

Christopher Krause

114
1
5

How to ignore unicode character with regex in python 3?

1 Answers1