-1

I'm creating a wordlist from a .txt file (with 65000 words) with the collections.counter() and findall() functions. It works well for English. However it ignores the special characters in other languages, like â, á, ü, ö etc. Furthermore I want combined words like "t'appele" and "signifie-t-elle" to be added as one distinct word. I have tried all sorts of regex combinations without success. Does someone know how to make it include the special characters? Below is my code.

with open(text_to_load) as f:
    words_from_text = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```
  • It's working well for me on all special characters / combined words you gave. How did you load the txt file? – Melkozaur Aug 10 '20 at 17:11
  • Thanks a lot, that's most probably the reason. I have updated the code with the load line. Based on your comment I also tried "with codecs.open(r'agatha_test.txt', encoding='utf-8') as f:" it gave me á and à but not ê, ', - etc. Do you have a recommendation how to load it ? – Lukas Lejring Aug 12 '20 at 19:33

1 Answers1

0

Thanks a lot, you really helped me greatly with the encoding. I had a further problem with \W in regex which doesn't seem to allow French characters. But I solved it this way instead:

with open(text_to_load, "r", encoding='utf-8') as f:
    for line in f:
        line = line.replace(".", " ")
        line = line.replace("—", " ")
        line = line.replace(",", " ")
        line = line.lower()
        for word in line.split():
            if word in words_from_text:
                words_from_text[word] = int(int(words_from_text[word]) + 1)
            else:
                words_from_text[word] = int("1")