0

I have a dataframe where each row contains a list of strings. I have written a function that performs a Bernoulli-type trial on each string, where with some probability (0.5 here) each word is deleted if the trial is a success. See below:

import numpy as np
import pandas as pd

def bernoulli_trial (sublist, prob = 0.5):

    # create mask of trial outcomes per each object in sublist
    mask = np.random.binomial(n=1, p=prob, size=len(sublist))

    # perform transformation on bernoulli successes
    transformed_sublist = [token for delete, token in zip(mask, sublist) if not delete]

    return transformed_sublist

This works as expected when I pass every row of a dataframe, as per:

df = pd.DataFrame(data={'store': [1,2,3], 'colours': [['red','blue','yellow','green','brown','pink'],
                                                      ['black','white'],
                                                      ['purple','orange','cyan','mauve']]})

df['colours'] = df['colours'].apply(bernoulli_trial)

Out: 
0      [red, green]
1           [black]
2    [orange, cyan]
Name: colours, dtype: object

However, rather than apply the function uniformly across each sublist and for each string, what I now want to do is apply conditions on (a) whether a given sublist will be passed to the function (yes/no), and (b) which strings within that sublist will be applied (i.e. by specifying that I only want to test certain colors).

I think I have a working solution for part (a) - by wrapping the Bernoulli function inside a function that checks whether a given condition is met (i.e. is the length of the sublist greater than 2 objects?) - this works (see below) but I'm unsure if there is a more efficient (read more pythonic) way to do this.

def sublist_condition_check(sublist):
    if len(sublist) > 2:
        sublist = bernoulli_trial(sublist)
    else:
        sublist = sublist
    return sublist

Note that any sublists that do not meet the condition should remain unchanged.

df['colours'].apply(sublist_condition_check)

Out: 
0      [red, brown]
1    [black, white] # this sublist had only two elements so remains unchanged
2           [mauve]
Name: colours, dtype: object

However, I am completely stuck on how to go about applying conditional logic on each word. Say, for example, I wanted to only apply the trial to a prespecified list of colours ['red','mauve','black'] - subject to it passing the sublist condition check - how could I go about that?

Pseudo-code for what I am hoping to achieve would be something like the following:

for sublist in df:
    if len(sublist) > 2:     # check if sublist contains more than two objects
        for colour in sublist:     # cycle through each colour within the sublist
            if colour in ['red','mauve','black']:     
                colour = bernoulli_trial (colour)     # only run bernoulli if colour in list
            else:
                colour = colour     # if colour not in list, colour remains unchanged
        else:
            sublist = sublist     # if sublist <= 2, sublist remains unchanged

I know a literal interpretation of this won't work, as the initial bernoulli_trial function receives a list rather than the individual string. But hopefully it describes what I want to achieve.

cookie1986
  • 865
  • 12
  • 27

1 Answers1

0

Unsure of the etiquette regarding answering my own question, but thought I'd provide some detail of a working solution I have identified in case anyone encounters a similar situation.

I have extended the initial bernoulli function to include an additional if statement based on whether each string meets an inclusion criteria.

# internal function - bernoulli trial for each string in sublist
def bernoulli_trial (sublist, prob = 0.50):

    # set token criteria for performing bernoulli trial
    token_criteria = ['red','black','purple'] # perform trial only on these strings

    # create mask of trial outcomes per each word in sublist
    mask = np.random.binomial(n=1, p=prob, size=len(turn))

    # perform transformation (deletion) on bernoulli successes
    transformed_turn = []
    for token, delete in zip(turn, mask):             
        if token not in token_criteria:
            transformed_turn.append(token)
        else:
            if delete == 0: # retain only those strings not marked for deletion
                transformed_turn.append(token)

    return transformed_sublist

Combined with the sublist_condition_check function described in the question, this now performs as expected

cookie1986
  • 865
  • 12
  • 27