0

I am trying to clean up text from an actual English dictionary as my source. I have already written a python program which loads the data from a .txt file into a SQL DB in four different columns - id, word, definition. In the next step though, I am trying to define what 'type' of word it is by fetching from the definition of the word strings like n. for noun, adj. for adjective, adv. for adverb, so on and so forth.

Now, using the following regex I am trying to extract all words that end with a '.' like adv./abbr./n./adj. etc. and get a histogram of all such words to see what all the different types can be. Here my assumption is that such words will obviously be more frequent than normal words which end with '.' but even then I plan to check the top results manually to confirm. Here's my code:

for row in cur:
  temp_var = re.findall('\w+[.]+ ',split)
  if len(temp_var) >=1 : 
      temp_var = temp_var.pop()
      typ_dict[temp_var] = typ_dict.get(temp_var,0) + 1

for key in typ_dict:
  if typ_dict[key] > 50:
      print(key, typ_dict[key])

After running this code I am not getting the desired result, with my count of numbers being way lower than in the definition. I have tested the word 'Abbr.' which this code shows occurs for 125 times but if you were to change the regex '\w+[.]+ ' to 'Abbr. ' the result shoots up186. I am not sure why my regex is not capturing all the occurrences.

Any idea as to why I am not getting all the matches?

Edit: Here is the type of text I am working with

Aback - adv. take aback surprise, disconcert. [old english: related to *a2]
Abm - abbr. Anti-ballistic missile
Abnormal - adj. Deviating from the norm; exceptional.  abnormality n. (pl. -ies). Abnormally adv. [french: related to *anomalous]

This is broken down into two the word and the rest into a definition and is loaded into a SQL table.

Ctrl
  • 35
  • 6
  • providing a sample of your input may help you get an answer. what is ```cur``` ? – Hozayfa El Rifai May 19 '20 at 10:53
  • Cur is a cursor used to connect to the SQL database. My input is an SQL table with the definition column having content of the form "abbr. 1 automobile association. 2 alcoholics anonymous. 3 anti-aircraft." – Ctrl May 19 '20 at 10:55
  • It's hard to pin down without the text you're using, but... have you tried using '\s' instead of ' '? It's possible that some of your words are followed with another type of whitespace. – Kyle Alm May 19 '20 at 11:03
  • Yes. I have tried that as well. If you want I can manually go through the output and find exact texts which do not match '\w+[.]+ ' and which match 'Abbr. '? – Ctrl May 19 '20 at 11:05

1 Answers1

0

If you are using a dictionary to count items, then the best variant of a dictionary to use is Counter from the collections package. But you have another problem with your code. You check tep_var for length >= 1 but then you only do one pop operation. What happens when findall returns multiple items? You also do temp_var = temp_var.pop() which would prevent you from popping more items even if you you wanted to. So the result is to just yield the last match.

from collections import Counter

counters = Counter()

for row in cur:
    temp_var = re.findall('\w+[.]+ ',split)
    for x in temp_var:
        counters[x] += 1

for key in counters:
    if counters[key] > 50:
        print(key, counters[key])
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Aha!! Maybe that's why I am missing all those matches because it is popping only the first match for my reg ex which might be something else as well. Thanks for this. I'll give it a try. – Ctrl May 19 '20 at 11:21