Python: normalizing a text file

Question

I have a text file which contains several spelling variants of many words:

For e.g.

identification ... ID .. identity...contract.... contr.... contractor...medicine...pills..tables

So I want to have a synonym text file which contains the words synonyms and would like to replace all the variants with the primary word. Essentially I want the normalize the input file.

For e.g my synonym list file would look like

identification = ID identify
contracting = contract contractor contractors contra...... 
word3 = word3_1 word3_2 word3_3 ..... word3_n
.
.
.
.
medicine = pills tables drugs...

I want the end output file to look like

identification ... identification .. identification...contractor.... contractor.... contractor...medicine...medicine..medicine

How do I got about programming in python?

Thanks a lot for your help!!!

score 3 · Answer 1 · answered Sep 10 '11 at 21:43

3

Just a thought: Instead of having a list of all variation of a word, have a look at difflib

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

answered Sep 10 '11 at 21:43

Fredrik Pihl

44,604
7
83
130

Thanks you..I will need this type of heuristics in scrubbing. I intend to look into this at a slightly more advanced stage of the application I am working on. – Zenvega Sep 11 '11 at 18:33

unutbu · Accepted Answer · 2011-09-11T10:52:34.697

You could read the synonym file and convert it into a dictionary, table:

import re

table={}
with open('synonyms','r') as syn:
    for line in syn:
        match=re.match(r'(\w+)\s+=\s+(.+)',line)
        if match:
            primary,synonyms=match.groups()
            synonyms=[synonym.lower() for synonym in synonyms.split()]
            for synonym in synonyms:
                table[synonym]=primary.lower()

print(table)

yields

{'word3_1': 'word3', 'word3_3': 'word3', 'word3_2': 'word3', 'contr': 'contracting', 'contract': 'contracting', 'contractor': 'contracting', 'contra': 'contracting', 'identify': 'identification', 'contractors': 'contracting', 'word3_n': 'word3', 'ID': 'identification'}

Next, you could read in the text file, and replace each word with its primary synonym from table:

with open('textfile','r') as f:
    for line in f:
        print(''.join(table.get(word.lower(),word) 
                      for word in re.findall(r'(\W+|\w+)',line)))

yields

identification     identification    identity   contracting     contracting     contracting   medicine   medicine  medicine

re.findall(r'(\w+|\W+)',line) was used split each line while preserving whitespace. If whitespace is not of interest, you could also use the easier line.split().
table.get(word,word) returns table[word] if word is in table, and simply returns word if word is not in the synonym table.

Whitespace splitting will append trailing punctuation - for example "Show me your ID." if split on whitespace, won't give the nice clean "ID" string to convert to "identification". Upper/lower case will need to be handled too. — PaulMcG, Sep 11 '11 at 07:36
@Paul McGuire: Thanks for the comment. I changed `\s+|\S+` to `\w+|\W+` to separate punctuation from words, and added code to handle case. @Pradeep: These changes have unlikely but possibly problematic consequences: words with punctuation (like `can't`) in the synonym list will no longer match, and words whose meaning changes with case (`Polish` is a nationality, but `polish` is a verb) may get replaced by the same synonym. These issues can be handled with more code, but let's not do that unless it affects your situation. — unutbu, Sep 11 '11 at 11:47

Python: normalizing a text file

2 Answers2