3

I have a dataframe with a column containing lists, I am trying to iterate over each row in the dataframe and concatenate with each element of the list for that row. I am trying to write code to achieve the result displayed in 'molecule_species'. Any thoughts on this would be appreciated.

Dataframe =

import pandas as pd
df = pd.DataFrame({'molecule': ['a',
                                'b',
                                'c',
                                'd',
                                'e'],
                   'species' : [['dog'],
                                ['horse','pig'],
                                ['cat', 'dog'],
                                ['cat','horse','pig'],
                                ['chicken','pig']]})

New column I am trying to create by iterating over rows and list elements, concatenating 'molecule' with each element in the list contained in 'species'.

df['molecule_species'] = [['a dog'],
                          ['b horse','b pig'],
                          ['c cat', 'c dog'],
                          ['d cat','d horse','d pig'],
                          ['e chicken','e pig']]
  • Does [this](https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) question help? You also might consider referring to [concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) method. – S.Au.Ra.B.H Jan 16 '20 at 19:21
  • All of the solutions accomplish what you want, but as you can see they all at some point require a loop over the rows. pandas isn't meant to store complex objects, like lists, and often the most performant ways to deal with objects in pandas is to move away from pandas (Andy L.'s solution). It seems like all of the information you'd need is available at `df.explode('species')` and that format is more suitable for later manipulations with pandas. – ALollz Jan 16 '20 at 19:46

3 Answers3

6

Pandas >= 0.25.0

Use Series.explode and then join, return to the list with GroupBy.agg:

df['molecule_species'] = (df.explode('species')
                            .apply(' '.join,axis=1)
                            .groupby(level=0)
                            .agg(list) )
print(df)

  molecule            species         molecule_species
0        a              [dog]                  [a dog]
1        b       [horse, pig]         [b horse, b pig]
2        c         [cat, dog]           [c cat, c dog]
3        d  [cat, horse, pig]  [d cat, d horse, d pig]
4        e     [chicken, pig]       [e chicken, e pig]

Pandas < 0.25.0

df['molecule_species']=(df.reindex(df.index.repeat(df.species.str.len()))
                          .assign(species=np.concatenate(df.species.values))
                          .apply(' '.join,axis=1)
                          .groupby(level=0)
                          .agg(list) )
print(df)
  molecule            species         molecule_species
0        a              [dog]                  [a dog]
1        b       [horse, pig]         [b horse, b pig]
2        c         [cat, dog]           [c cat, c dog]
3        d  [cat, horse, pig]  [d cat, d horse, d pig]
4        e     [chicken, pig]       [e chicken, e pig]

Another approach is Series.str.cat

df2 = df.explode('species')
df['molecule_species']=df2['molecule'].str.cat(df2['species'],sep=' ').groupby(level=0).agg(list)
E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37
ansev
  • 30,322
  • 5
  • 17
  • 31
5

You can try this,

>>> import pandas as pd
>>> df = pd.DataFrame({'molecule': ['a',
                                'b',
                                'c',
                                'd',
                                'e'],
                   'species' : [['dog'],
                                ['horse','pig'],
                                ['cat', 'dog'],
                                ['cat','horse','pig'],
                                ['chicken','pig']]})

>>> df['molecule_species'] = (df
    .apply(lambda x: [x['molecule'] + ' ' + m for m in x['species']], axis=1))
>>> df
  molecule            species         molecule_species
0        a              [dog]                  [a dog]
1        b       [horse, pig]         [b horse, b pig]
2        c         [cat, dog]           [c cat, c dog]
3        d  [cat, horse, pig]  [d cat, d horse, d pig]
4        e     [chicken, pig]       [e chicken, e pig]
E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37
  • 1
    Honestly, this should have more upvotes. Yes, **apply is slow**, but there's no way around it with a DataFrame of lists. This solution **is** faster than the explode and it's also concise +1. – ALollz Jan 16 '20 at 19:50
  • 2
    @ALollz: I prefer list comprehension over `apply`. However, I agree it is faster than `explode`. Upvoted :) +1 – Andy L. Jan 16 '20 at 19:53
4

You may try double list comprehension. In processing sub-lists and string concatenation within cells of pandas, list comprehension is much faster than using built-in pandas methods.

df['molecule_species'] = [[mol+' '+ a_spec for a_spec in specs] 
                                      for mol, specs in zip(df.molecule, df.species)]

Out[87]:
  molecule            species         molecule_species
0        a              [dog]                  [a dog]
1        b       [horse, pig]         [b horse, b pig]
2        c         [cat, dog]           [c cat, c dog]
3        d  [cat, horse, pig]  [d cat, d horse, d pig]
4        e     [chicken, pig]       [e chicken, e pig]
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • 1
    suggestion : ```from itertools import product, chain; df['molecule_species'] = [list(chain.from_iterable(product([first], last))) for first, last in zip(df.molecule, df.species)]``` ? – sammywemmy May 04 '20 at 01:30