6

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.

But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?

2 Answers2

7

The re regex library is probably the tool for the job.

In order to remove every string of consecutive uppercase letters, the following code can be used:

import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)

Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".

The convenient thing about regex is how easily it can be modified for more complex cases. For example:

mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)

Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.

The ? specifier could also be used to remove acronyms that end in s, for example:

mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)

This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.

The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.

Xeomorpher
  • 141
  • 4
  • 1
    Wow. this is a perfect answer. thanks a lot :) I will apply this an see for my text. –  Dec 10 '17 at 01:33
  • 2
    This expression matches partial words that start with at least three capital letters. For example: `TESting` matches `TES`. Also, if a space is missing between a sentence and the one following it, the first word of the second sentence will be removed. E.g `Testing this expression.Then another sentence.`will remove `.Then`. – drumhellerb Dec 10 '17 at 01:39
  • Ah thank you, I should have used \b instead of \w. I edited my answer to hopefully fix those issues. – Xeomorpher Dec 10 '17 at 01:49
0

An intuitive way would be the use of regex

This regular expression does the job :([A-Z]\.*){2,}s?

Which gives in python :

import re
re.sub("([A-Z]\.*){2,}s?","", your_text)

Please visit regex documentation in case of doubt https://docs.python.org/2/library/re.html#re.sub

Jonathan C.
  • 109
  • 7
  • You might want to consider using {2,} instead of +. This would match and remove single capital letters, like "I", or a sentence beginning with "A". – Xeomorpher Dec 10 '17 at 01:24
  • On second thoughts, this would also remove every capital letter in the text within a word too, so it would need some `\w`s to keep it contained – Xeomorpher Dec 10 '17 at 01:25
  • 1
    "You might want to consider using {2,}" Yes indeed, good one ! Also I forgot to double the \ in my answer to make it apparent. Thanks for pointing this out. I Edited my answer – Jonathan C. Dec 10 '17 at 01:33