3

I'm trying to merge/concatenate two columns where both have related, but separate text data delimited by "|" in addition to replacing certain names with "" and replace the | with '\n'.

For example, the original data may be:

    First Names            Last Names
0   Jim|James|Tim          Simth|Jacobs|Turner
1   Mickey|Mini            Mouse|Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka

If I want to merge/concatenate to derive Full Names and remove entries tied to "Smith" the final df should look like:

    First Names            Last Names            Full Names
0   Jim|James|Tim          Simth|Jacobs|Turner   James Jacobs\nTim Turner
1   Mickey|Mini            Mouse|Mouse           Mickey Mouse\nMini Mouse
2   Mike|Billy|Natasha     Mills|McGill|Tsaka    Mike Mills\nBilly McGill\nNatasha Tsaka

My current approach so far has been:

def parse_merge(df, col1, col2, splitter, new_col, list_to_exclude):

    orig_order = pd.Series(list(df.index)).rename('index')

    col1_df = pd.concat([orig_order, df[col1], df[col1].str.split(splitter, expand=True)], axis = 1)
    col2_df = pd.concat([orig_order, df[col2], df[col2].str.split(splitter, expand=True)], axis = 1)

    col1_melt = pd.melt(col1_df, id_vars=['index', col1], var_name='count')
    col2_melt = pd.melt(col2_df, id_vars=['index', col2], var_name='count')

    col2_melt['value'] = '(' + col2_melt['value'].astype(str) + ')'
    col2_melt = col2_melt.rename(columns={'value':'value2'})

    melted_merge = pd.concat([col1_melt, col2_melt['value2']], axis = 1 )

    if len(list_to_exclude) > 0:
         list_map = map(re.escape, list_to_exclude)

    melted_merge.ix[melted_merge['value2'].str.contains('|'.join(list_map)), ['value', 'value2']] = ''

    melted_merge[new_col] = melted_merge['value'] + " " + melted_merge['value2']

if I call:

parse_merge(names, 'First Names', 'Last Names', 'Full Names', ['Smith'])

The data becomes:

    Index   First Names        count    value            value2        Full Names
0   0       Jim|James|Tim      0        Jim              Smith         ''
1   1       Mickey|Mini        0        Mickey           Mouse         Mickey Mouse
2   2       Mike|Billy|Natasha 0        Mike             Mills         Mike Mills

Just not sure how to finish this out without any loops or if there is a more efficient / totally different approach.

Thanks for all the input!

wingsoficarus116
  • 429
  • 5
  • 17

3 Answers3

4

Here is a condensed solution using pd.DataFrame.apply and some of python's nice built-in features:

def combine_names(row):

    pairs = list(zip(row[0].split('|'), row[1].split('|')))
    return '\n'.join([' '.join(p) for p in pairs if p[1] != 'Simth'])

df['Full Name'] = df.apply(combine_names, axis=1)
Alex
  • 12,078
  • 6
  • 64
  • 74
  • Nice solution @AlexG – nipy Jan 10 '17 at 22:50
  • Appreciate the solution! How would I generalize this to be applicable to a dataframe with n columns where I simply wanted to add a column consisting of two columns merged as such? Also, I'm not too familiar with .apply - can I pass arguments into combine_names? It would be ideal to pass a list of names to skip instead of just 'Smith' as that was an example. – wingsoficarus116 Jan 10 '17 at 23:57
  • You could code a list of names directly into the `combine_names ` function, but it must only take one argument. The argument passed is the row (assuming axis is set to 1). You could also call it like this: `df[['First Names', 'Last Names']].apply(combine_names, axis=1)` if you have more than just those two columns. Coming back to your first point, you could change: `if p[1] != 'Simth'` to something like this: `if p[1] not in ['Simth', 'John', 'King']` – Alex Jan 11 '17 at 00:05
  • I've worked around this by embedding the combine_names function in another function which creates a separate dataframe for combine_names to work on and then concats to the original df. Many thanks, very clever! – wingsoficarus116 Jan 11 '17 at 01:00
3

I really like @AlexG's solution - please use it.

Here is my attempt to create a creative one-liner solution - it's absolutely perverse, so it should NOT be used - it's just for fun:

In [78]: df
Out[78]:
          First Names           Last Names
0       Jim|James|Tim  Simth|Jacobs|Turner
1         Mickey|Mini          Mouse|Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka

In [79]: df['Full Names'] = \
    ...: (df.stack()
    ...:    .str.split(r'\|', expand=True)
    ...:    .unstack(level=1)
    ...:    .groupby(level=0, axis=1)
    ...:    .apply(lambda x: x.add(' ').sum(axis=1).str.strip())
    ...:    .replace([r'\w+\s+Simth'], [np.nan], regex=True)
    ...:    .apply(lambda x: x.dropna().str.cat(sep='\n'), axis=1)
    ...: )
    ...:

In [80]: df
Out[80]:
          First Names           Last Names                               Full Names
0       Jim|James|Tim  Simth|Jacobs|Turner                 James Jacobs\nTim Turner
1         Mickey|Mini          Mouse|Mouse                 Mickey Mouse\nMini Mouse
2  Mike|Billy|Natasha   Mills|McGill|Tsaka  Mike Mills\nBilly McGill\nNatasha Tsaka
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
2

I've got a lot of comprehension

l = df.values.tolist()

['|'.join(n)
 for n in [[' '.join(z)
 for z in zip(*[s.split('|')
 for s in r]) if z[1] != 'Smith']
 for r in l]]

['James Jacobs|Tim Turner',
 'Mickey Mouse|Mini Mouse',
 'Mike Mills|Billy McGill|Natasha Tsaka']

l = df.values.tolist()

df['Full Names'] = [
     '|'.join(n)
     for n in [[' '.join(z)
     for z in zip(*[s.split('|')
     for s in r]) if z[1] != 'Smith']
     for r in l]]

df

enter image description here


word play aside, this is pretty snappy over sample data

enter image description here


longer explanation

l

[['Jim|James|Tim', 'Simth|Jacobs|Turner'],
 ['Mickey|Mini', 'Mouse|Mouse'],
 ['Mike|Billy|Natasha', 'Mills|McGill|Tsaka']]
  • l is a list of lists. I will make extensive use of list comprehensions and iterables.
  • Each sub-list consists of 2 strings that I will split and zip together.
  • The result of the split will be a "list" of tuples consisting of (first, last) names. I'll use if z[1] != 'Smith' to filter out the smiths.
    • BTW, in this line you could use z[1] not in list_of_names
  • I'll then use ' '.join (that's actually a function) to combine each tuple to first last
  • I'll then use another '|'.join to combine the sub-list of first last to first1 last1|first2 last2... so on and so forth

The reason why this is quicker is because comprehensions have been optimized to a great extent. The other solutions are using apply which is a generic looping structure that can only leverage fast looping under special circumstances (someone who knows more, please correct me if I'm wrong). Using lambda is definitely not one of those circumstnces.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Can you explain how this is so efficient and what exactly is it doing? I can understand the first answer more or less (the "combin_names" approach by AlexG - but this is beyond me. Apologies for my limited knowledge. – wingsoficarus116 Jan 11 '17 at 01:16
  • @wingsoficarus116 updated with something resembling an explanation – piRSquared Jan 11 '17 at 01:52