0

I have a dataframe with a column that has many acronyms in it.

I would like to simply (a) identify all acronyms in each cell on the next column and (b) produce a list of all unique acronyms found (not duplicates).

I would like to simply use pyspellchecker to find any word that is misspelled and treat it as an acronym.

I know that method will also produce non-acronyms that are simply misspelled words but I can't think of any other way to do it (unless we assume that all acronyms will also be in all uppercase which is unfortunately not the case in my dataset).

For example I have,

Column 1
I worked for the NBA
I worked at the CIA
I am seeing a pt
CIA and NBA are both cool places to work

Desired output:

Column 1 Column 2
I worked for the NBA NBA
I worked at the CIA CIA
I am seeing a pt pt
CIA and NBA are both cool places to work CIA,NBA
I also worked at NSA catedslf NSA, catedslf

and

{NBA, CIA, pt, NSA, catedslf}

I through catedslf in there just to show that its okay if I also catch misspelled words (I know its unavoidable).

1 Answers1

2

Not sure if this is exactly what you want, but maybe it helps. I suppose you have a dataframe like this (not a series):

df =

                                   Column 1
0                      I worked for the NBA
1                       I worked at the CIA
2                          I am seeing a pt
3  CIA and NBA are both cool places to work
4             I also worked at NSA catedslf

Then this

from spellchecker import SpellChecker

spell = SpellChecker()
df["Column 2"] = df.assign(
    misspelled=df["Column 1"].str.split().map(spell.unknown),
    acronyms=df["Column 1"].str.findall(r"([A-Z]{2,})").map(set)
)[["misspelled", "acronyms"]].apply(lambda row: set.union(*row), axis=1)

results in

                                   Column 1         Column 2
0                      I worked for the NBA            {NBA}
1                       I worked at the CIA            {CIA}
2                          I am seeing a pt             {pt}
3  CIA and NBA are both cool places to work       {NBA, CIA}
4             I also worked at NSA catedslf  {catedslf, NSA}

Then

result = set.union(*df["Column 2"])

produces

{'NSA', 'CIA', 'catedslf', 'NBA', 'pt'}

and

df["Column 2"] = df["Column 2"].map(", ".join)

finally

                                   Column 1       Column 2
0                      I worked for the NBA            NBA
1                       I worked at the CIA            CIA
2                          I am seeing a pt             pt
3  CIA and NBA are both cool places to work       CIA, NBA
4             I also worked at NSA catedslf  NSA, catedslf

But there might be other problems ahead. For example punctuation. Maybe you should do something like:

from string import punctuation

df["Column 1"] = df["Column 1"].str.translate(str.maketrans("", "", punctuation))

beforehand (there might be better ways to do that).

Timus
  • 10,974
  • 5
  • 14
  • 28