Get the maximum length of a word from a paragraph

Question

I am working on a text problem where I have my pandas dataframe holding many columns out of which one consists of paragraphs. What I need in output are 3 columns as defined -

Length of largest words
Number of largest words (in case if there are any similar length)
Total number of such similar length words.

I account for a word if it is separated by a space.Looking for an answer using python apply-map.

Here's a sample input data -

df = pd.DataFrame({'text':[
    "that's not where the biggest opportunity is - it's with heart failure drug - very very huge market....",
    "Of course! I just got diagnosed with congestive heart failure and type 2 diabetes. I smoked for 12 years and ate like crap for about the same time. I quit smoking and have been on a diet for a few weeks now. Let me assure you that I'd rather have a coke, gummi bears, and a bag of cheez doodles than a pack of cigs right now. Addiction is addiction.",
    "STILLWATER, Okla. (AP) ? Medical examiner spokeswoman SpokesWoman: Oklahoma State player Tyrek Coger died of enlarged heart, manner of death ruled natural."
]})

df

    text                                                
0   that's not where the biggest opportunity is - ...   
1   Of course! I just got diagnosed with congestiv...   
2   STILLWATER, Okla. (AP) ? Medical examiner spok...

Here is the expected output -

    text                                               word_count   word_length     words
0   that's not where the biggest opportunity is - ...   1           11             opportunity
1   Of course! I just got diagnosed with congestiv...   1           10              congestive
2   STILLWATER, Okla. (AP) ? Medical examiner spok...   2           11              spokeswoman SpokesWoman

What do you count as a "word"? For example, would "market...." count as a single word? — GrumpyCrouton, Jan 18 '19 at 20:42
Yes. Everything separated by a space is what I account for a `word`. — sync11, Jan 18 '19 at 20:44
I would add that to your question, something like "_Anything separated by a space (or beginning/end of line) is what I consider a word_" - This is _essential_ for the answer. — GrumpyCrouton, Jan 18 '19 at 20:45

score 1 · Answer 1 · answered Jan 18 '19 at 20:57

The following code should do the trick:

def get_values(text):
    tokens = text.split() # Splitting by whitespace
    max_word_length = -1
    list_words = [] # Initializing list of max length words

    for token in tokens:
        if len(token) > max_word_length:
           max_word_length = len(token)
           list_words = [] # Clearning the list, since there's a new max
           list_words.append(token)
        elif len(token) == max_word_length:
           list_words.append(token)

     words_string = ' '.join(list_words) if len(list_words) > 1 else list_words[0] # Concatenating list into string

     return [len(list_words), max_word_length, list_words]

df['word_count'], df['word_length'], df['words'] = zip(*df['text'].map(get_values))

Edit: Forgot to concatenate list

No problem! I shall adivse you that `map` is a faster solution that `itertuples` if you are concerned about performance — GRoutar, Jan 18 '19 at 21:12

meW · Accepted Answer · 2019-01-19T05:59:52.917

One possible solution using apply-map -

import nltk
import pandas as pd

# Reading df and proceeding with code

expanded_text = df.text.apply(lambda x: ' '.join(nltk.word_tokenize(x))).str.split(" ", expand=True)

df.word_length = expanded_text.applymap(lambda x: len(str(x)) if x != None else 0).max(axis=1)

i = 1
for idx, val in enumerate(expanded_text.itertuples()):
    temp = expanded_text.iloc[idx:idx + i, :].applymap(lambda x: True if len(str(x)) == df.loc[idx, 'word_length'] else False if x != None else False).T
    idx_ = temp.index[temp[idx] == True].values 
    words = " ".join(expanded_text.iloc[idx:idx + i, idx_].values.tolist()[0])
    df.loc[idx, 'words'] = words
    df.loc[idx, 'word_count'] = len(words.split())
    i += 1

Thank you so much. You saved lot of my time. – sync11 Jan 18 '19 at 21:08 — sync11, Jan 18 '19 at 21:08

Get the maximum length of a word from a paragraph

2 Answers2