1

I need to identify all abbreviations and hyphenated words in my sentences to start. They need to be printed as they get identified. My code does not seem to be functioning well for this identification.

import re

sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
    abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
    hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

    print("new sentence:")
    print(sent)
    print(abbs_)
    print(hypns_)

One of the sentences in my corpus is: DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI

The output for this is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

expected output is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']
Murmel
  • 5,402
  • 47
  • 53
Shraddha Avasthy
  • 161
  • 3
  • 13

3 Answers3

1

Your rule for abbreviations does not match. You want to find any words with more then 1 consecutive capital letter, a rule you could use would be:

abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations

This would match APIs and BI.

t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"

import re

abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed

print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)

Output:

DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]  # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service']  # fixed hyphen rule

This will most probably not find all abbreviations like

t = "Prof. Dr. S. Quakernack"

so you might need to tweak it using some more data and f.e. http://www.regex101.com

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

I suggest:

abbs_ = re.findall(r'\b[A-Z]+s?\b', sent) #abbreviations
hypns_ = re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
Gary Goh
  • 229
  • 1
  • 7
1

"As you know, I got all As in my course".

Is "As" an abbreviation? If not, then you need to discard single capital letters followed or not by Ss, and only gather at least pairs, optionally followed by one s as in APIs. So,

abbs_ = re.findall(r'\b(?:[A-Z][A-Z]+s?)\b', sent) #abbreviations

The \b are needed to be sure you don't also reap things such as ImNotAGirl because of that AG pair in the middle.

Then you have to get abbreviations: a word (\w+), followed by at least one hyphen-word sequence:

hypns_= re.findall(r'\b(?:\\w+(-\w+)+)\b', sent) #hyphenated words
LSerni
  • 55,617
  • 10
  • 65
  • 107