0

I am trying to match the columns of two different csv files. I have managed to match words with the same synonyms like "house" and "residence" or "notes" and "comments". My problem is that I cannot correlate successfully more complicated words.

Example: (these are column names from different files)

"Email" and "E-mail Address" .My program can detect that "Email" and "E-mail" are the same but cannot connect "Email" to "Address".

Other example:

"Title/Salutation" and "Title". I detect that they match the word "Title" and I throw away the word "Salutation" . This cannot be applied to the previous example though, because I don't want to discard "Address".

How can I decide whether to keep the other words or to throw them away?

EDIT: I added a bit of code with what I tried. Sorry if its confusing.

elif len(list_of_tokens_1[i]) == 1 and len(list_of_tokens_2[j]) == 2:
    score1, list1_1,list1_2, syns_dict = common_words_advanced(copy_tokens_1[i][0], copy_tokens_2[j][0], syns_dict)
    score2, list2_1,list2_2, syns_dict = common_words_advanced(copy_tokens_1[i][0], copy_tokens_2[j][1], syns_dict)

list_of_tokens_1 contains the column names of the first file and list_of_tokens_2 of the second file ,both tokenized (e.g. E-mail_Address -> ['E-mail', 'Address'] . copy_tokens_1 and copy_tokens_2 are copies of the previous lists so I can make changes.

syns_dict contains all the synonyms of a word, with the word as the key.

common_words_advanced is a function that returns how close these words are comparing their synonyms string to string. If the score is 1 it means they have at lest one common synonym so they match. If is less it means they are close but they dont match.

list2_1, list1_2 etc contain the returned the best matched synonyms for each word.

In this code is where I try to match [Email] len ==1 with [E-mail,Address] len==2. The first line has input 'Email' and 'E-mail' and the score is 1. The second line compares Email and Address and the score is ~0.5 (very bad).

costisst
  • 381
  • 2
  • 6
  • 22

0 Answers0