-1

For example, Data Frame is:

df = pd.DataFrame(data = {'id': ['393848', '30495'],
                         'text' : ['This is Gabanna. @RT Her human Jose rushed past firefighters into his burning home to rescue her. She suffered burns on her nose and paws, but will be just fine. The family lost everything else. You can help them rebuild below. 14/10 for both (via @KUSINews)',
                                  'Meet Milo. He’s a smiley boy who tore a ligament in his back left zoomer. The surgery to fix it went well, but he’s still at the hospital being monitored. He’s going to work very hard to fetch at full speed again, and you can help him do it below. 13/10']
                         })

I wrote some functions:

def tokenize(df): 
    def process_tokens(df): #return column with lists of tokens
        def process_reg(text): #return plain text
            return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", str(text)).split()])
        df['tokens'] = [process_reg(text).split() for text in df['text']]
    return process_tokens(df) 

tokenize(df)

def process(df): #return column with dicts
    def process_group(token): #convert list of tokens into dictionery
            return pd.DataFrame(token, columns=["term"]).groupby('term').size().to_dict()
    df['dic'] = [process_group(token) for token in df['tokens']]

process(df)

They work great one by one and I got what expected:

I looking for the solution to nest all functions into one to be able to pass data frame just once.

Can't find.

Please, help

AlexR
  • 9
  • 1
  • 6
  • 1
    `process(tokenize(df))`? – Stop harming Monica Mar 18 '19 at 20:26
  • `--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () ----> 1 process(tokenize(df)) in process(df) 2 def process_group(token): #convert list of tokens into dictionery 3 return pd.DataFrame(token, columns=["term"]).groupby('term').size().to_dict() ----> 4 df['dic'] = [process_group(token) for token in df['tokens']] TypeError: 'NoneType' object is not subscriptable` – AlexR Mar 18 '19 at 20:50
  • Write a function that calls `tokenize` and then `process`. – Stop harming Monica Mar 18 '19 at 20:56
  • so I need one more function? could you show an example, please? and how to nest all of them into one container? – AlexR Mar 18 '19 at 21:01
  • Well it looks to me that you want a function that you do not have yet, hence you need one more function to get what you want. Now you don't know how to put `tokenize(df);process(df)` inside a function? – Stop harming Monica Mar 18 '19 at 21:50

1 Answers1

0
def ad (df):
    def tokenize(df): #return column with dicts
        def process_tokens(df): #return column with lists of tokens
            def process_reg(text): #return plain text
                return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", str(text)).split()])
            df['tokens'] = [process_reg(text).split() for text in df['text']]
        return process_tokens(df)

    tokenize(df)

    def process (df):
        def process_dic(df): #return column with dicts
            def process_group(token): #convert list of tokens into dictionery
                return pd.DataFrame(token, columns=["term"]).groupby('term').size().to_dict()
            df['dic'] = [process_group(token) for token in df['tokens']]
        return process_dic(df)

    return process(df)

then...

ad(df)

works well. Though I have an idea that another way of writing this will perform faster.... a challenge for another day.

Thank you for your support, @Goyo!

AlexR
  • 9
  • 1
  • 6
  • You do not need to put the definitions of `tokenize` and `process`inside `ad`. [Flat is better than nested.](https://www.python.org/dev/peps/pep-0020/) – Stop harming Monica Mar 19 '19 at 18:01
  • Agree technically, but this one for my own use and I personally want one function for all this stuff. I tried to put one inside another but fail and came up here with the question. I know that it is not perfect for now ) – AlexR Mar 21 '19 at 00:23