1

I am writing a program that is able to scrape album informations from the Discogs music database. The scraper works fine.

Now I have a Data Frame with lots of duplicated artists and titles where just the formats cell is different (see for example 'Sido', 'Ich und keine Maske' in the snippet of my data frame below).

    Interpret                    Title                              Formats
0   Afrika Bambaataa And Family  The Decade Of Darkness 1990-2000   CD, Album, RE
1   Sha Hef                      Out The Mud
2   Sido                         Ich Und Keine Maske                CD, Album
3   Sido                         Ich Und Keine Maske                2xLP, Album
...

Now I am looking for a way to combine these double entries without loss of information. Can somebody give me a hint? The final result should be look like this:

    Interpret                    Title                              Formats
0   Afrika Bambaataa And Family  The Decade Of Darkness 1990-2000   CD, Album, RE
1   Sha Hef                      Out The Mud
2   Sido                         Ich Und Keine Maske                CD, Album, 2xLP
...

I have tried

r = dataframe.groupby('Interpret')['Formate'].apply(','.join)

but the result is a Pandas Series with removed 'title'-column, so I lost information.

Jonnes
  • 25
  • 5
  • have you tried assign your series back to the column? something like `dataframe['Formate'] = dataframe.groupby('Interpret')['Formate'].apply(','.join)` – moys Oct 04 '19 at 09:56

1 Answers1

1

have you tried

import pandas as pd
df = pd.DataFrame({'Interpret': ['Afrika Bambaataa And Family', 'Sha Hef', 'Sido', 'Sido'],
                   'Title': ['The Decade Of Darkness 1990-2000', 'Out The Mud', 'Ich Und Keine Maske', 'Ich Und Keine Maske'],                             
                   'Formats': ['CD, Album, RE', 'CD, Album', 'CD, Album', '2xLP, Album']})

# remove duplicate interprets and merge formats...
df1 = df.groupby('Interpret').agg(lambda x: ', '.join(x.unique())).reset_index()

# now to get rid of duplicate entries in 'Formats' column...
def drop_dupes(row):
    l = row.split(', ')
    return ', '.join(list(set(l)))

df1['Formats'] = df1['Formats'].apply(drop_dupes)

? That gives you

Out[40]: 
                     Interpret  ...          Formats
0  Afrika Bambaataa And Family  ...    CD, RE, Album
1                      Sha Hef  ...        CD, Album
2                         Sido  ...  CD, Album, 2xLP

and is basically a slightly modified version of the answer you find for this question.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72