Is there a way to identify and create a list of all acronyms in a dataframe?

Question

I have a dataframe with a column that has many acronyms in it.

I would like to simply (a) identify all acronyms in each cell on the next column and (b) produce a list of all unique acronyms found (not duplicates).

I would like to simply use pyspellchecker to find any word that is misspelled and treat it as an acronym.

I know that method will also produce non-acronyms that are simply misspelled words but I can't think of any other way to do it (unless we assume that all acronyms will also be in all uppercase which is unfortunately not the case in my dataset).

For example I have,

Column 1
I worked for the NBA
I worked at the CIA
I am seeing a pt
CIA and NBA are both cool places to work

Desired output:

Column 1	Column 2
I worked for the NBA	NBA
I worked at the CIA	CIA
I am seeing a pt	pt
CIA and NBA are both cool places to work	CIA,NBA
I also worked at NSA catedslf	NSA, catedslf

and

{NBA, CIA, pt, NSA, catedslf}

I through catedslf in there just to show that its okay if I also catch misspelled words (I know its unavoidable).

Unless you use a pattern (e.g. all caps with ≥2 chars) or a dictionary, there is no direct way. — mozway, Mar 25 '22 at 17:10
Uppercase acronyms would be easy to identify (i.e. words with all caps). But, you have lowercase ones also (i.e pt). — DarrylG, Mar 25 '22 at 17:22
@mozway can you show how you would do it with a dictionary? thank you. — user2520842, Mar 25 '22 at 17:36
@DarrylG can you show you would do it with a dictionary? thank you. — user2520842, Mar 25 '22 at 17:37

score 2 · Answer 1 · answered Mar 25 '22 at 20:46

Not sure if this is exactly what you want, but maybe it helps. I suppose you have a dataframe like this (not a series):

df =

                                   Column 1
0                      I worked for the NBA
1                       I worked at the CIA
2                          I am seeing a pt
3  CIA and NBA are both cool places to work
4             I also worked at NSA catedslf

Then this

from spellchecker import SpellChecker

spell = SpellChecker()
df["Column 2"] = df.assign(
    misspelled=df["Column 1"].str.split().map(spell.unknown),
    acronyms=df["Column 1"].str.findall(r"([A-Z]{2,})").map(set)
)[["misspelled", "acronyms"]].apply(lambda row: set.union(*row), axis=1)

results in

                                   Column 1         Column 2
0                      I worked for the NBA            {NBA}
1                       I worked at the CIA            {CIA}
2                          I am seeing a pt             {pt}
3  CIA and NBA are both cool places to work       {NBA, CIA}
4             I also worked at NSA catedslf  {catedslf, NSA}

Then

result = set.union(*df["Column 2"])

produces

{'NSA', 'CIA', 'catedslf', 'NBA', 'pt'}

and

df["Column 2"] = df["Column 2"].map(", ".join)

finally

                                   Column 1       Column 2
0                      I worked for the NBA            NBA
1                       I worked at the CIA            CIA
2                          I am seeing a pt             pt
3  CIA and NBA are both cool places to work       CIA, NBA
4             I also worked at NSA catedslf  NSA, catedslf

But there might be other problems ahead. For example punctuation. Maybe you should do something like:

from string import punctuation

df["Column 1"] = df["Column 1"].str.translate(str.maketrans("", "", punctuation))

beforehand (there might be better ways to do that).

Is there a way to identify and create a list of all acronyms in a dataframe?

1 Answers1