I'm creating a wordlist from a .txt file (with 65000 words) with the collections.counter() and findall() functions. It works well for English. However it ignores the special characters in other languages, like â, á, ü, ö etc. Furthermore I want combined words like "t'appele" and "signifie-t-elle" to be added as one distinct word. I have tried all sorts of regex combinations without success. Does someone know how to make it include the special characters? Below is my code.
with open(text_to_load) as f:
words_from_text = collections.Counter(
word.lower()
for line in f
for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```